<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ObservabilityGuy</title>
    <description>The latest articles on DEV Community by ObservabilityGuy (@observabilityguy).</description>
    <link>https://dev.to/observabilityguy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3433708%2Faf43ef59-cf80-46ad-930d-f76811e673a2.png</url>
      <title>DEV Community: ObservabilityGuy</title>
      <link>https://dev.to/observabilityguy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/observabilityguy"/>
    <language>en</language>
    <item>
      <title>Add Enterprise Memory to OpenClaw, and Your Agent Finally Doesn’t Have to Ask Again</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Tue, 26 May 2026 03:12:47 +0000</pubDate>
      <link>https://dev.to/observabilityguy/add-enterprise-memory-to-openclaw-and-your-agent-finally-doesnt-have-to-ask-again-2n89</link>
      <guid>https://dev.to/observabilityguy/add-enterprise-memory-to-openclaw-and-your-agent-finally-doesnt-have-to-ask-again-2n89</guid>
      <description>&lt;p&gt;This article introduces AgentLoop MemoryStore, a fully managed, enterprise-grade memory solution designed to give AI Agents long-term, reliable memory for production environments.&lt;/p&gt;

&lt;p&gt;Presumably every AI developer has experienced such a scenario: your intelligent Agent is finally online. Demo ran smoothly, the internal review passed smoothly, and the boss nodded his approval. After two months of hard work, the team finally pushed it into the production environment. In the first week, user feedback was acceptable. But by the second week, you receive a user message like this: "The last time I explicitly said I wanted to return it, why is your robot still asking me if I want to exchange it?" You go through the conversation log, and what the user said is true-in the last round of dialogue, the intention to return was very clear. However, Agent has no impression. Every conversation is like meeting for the first time. You suddenly realize: Agent online is only the starting point, the real key is that it must "remember". And the pain behind this is far deeper than imagined.&lt;/p&gt;

&lt;p&gt;The First Layer of Pain: Users Would Not Like to Say It Again&lt;br&gt;
This is the most direct experience of harm, but also the most silent reason for the loss of users. Users don't care about your technical architecture or which big model you use. All they know is that what they said yesterday will be repeated today. In the customer service scene, the user has already explained the order problem, the receiving address and the return request, but he has to repeat it from the beginning when he enters the line again. The experience collapses instantly and the customer complaint rate rises sharply. In the sales scene, the customer made it clear that "the budget has not been approved" before, and Agent still repeatedly pushes the quotation scheme, which will only make the customer feel that the assistant is not listening at all. In the learning scene, the next day, the system still repeatedly questions as weak items, which will only make people feel that the product is perfunctory.&lt;/p&gt;

&lt;p&gt;Users will not complain about "your memory system is not working", they will only lose it silently, or be prepared before the next use-it can't remember what I said anyway.&lt;/p&gt;

&lt;p&gt;The Second Layer of Pain: On the Road to Self-Study, You Have to Step on the Pits Yourself&lt;br&gt;
After noticing the problem, many teams chose to develop their own memory system, only to find that the road was far more difficult than expected. Originally three weeks to complete the memory function, eventually evolved into three months of the underlying infrastructure reconstruction.&lt;/p&gt;

&lt;p&gt;● Easy to store but difficult to recall: It is not complicated to store the dialogue history in the vector database. The difficulty is to accurately recall the "most relevant information" in the next round, rather than bringing back a bunch of invalid noise. If the retrieval quality is not up to standard, the memory will be useless, recalling five pieces of information and four pieces of interference, but will bias the model judgment.&lt;/p&gt;

&lt;p&gt;● Only increase but not decrease, memory confusion: users prefer concise answers last month, and this month they want to explain in more detail. If the system only adds but not updates, the two contradictory information coexist, and the more dirty data they use, the more inconsistent judgments.&lt;/p&gt;

&lt;p&gt;● Context stacking and effect reversal: Some people directly put all the history into the Prompt, which seems simple, but leads to double the token cost and slow response. The model filters valid content from redundant information, and the accuracy does not increase but decreases. Long context doesn't equal good memory, and many times it's just more expensive noise.&lt;/p&gt;

&lt;p&gt;● Demo is smooth and production is unstable: The memory of a single machine performs well in the testing phase. In the first production phase, problems occur frequently, such as the memory of multi-instance deployment does not communicate with each other, the memory of instance destruction is lost, and the memory extraction of high concurrency slows down the main link...&lt;/p&gt;

&lt;p&gt;The Third Layer of Pain: The Function Is Done, but I Dare Not Go Online the Main Link&lt;br&gt;
This is the most hidden and most realistic pain point. The memory function can be realized technically, but after landing, the problem ensues: who will maintain the vector database? How do I troubleshoot and locate exceptions? User historical memory involves privacy. How can data isolation be ensured? Compliance requires that the memory can be traced and deleted. Can the existing scheme be supported? Will the memory assembly line drag down the entire service if the traffic surges tenfold? Before these questions are clearly answered, any prudent technical leader dares to connect the core agent to the primary link. Memory is not unable to do it, but after it is done, no one dares to be really responsible. As a result, a large number of agents in the team are in an awkward position: the functions are already available, the project is not ready, and the business is slow to deliver.&lt;/p&gt;

&lt;p&gt;In the past few years, memory ability has almost become the most crowded track in agent infrastructure. Simply storing conversations, enabling vector retrieval, and recording user preferences are no longer scarce capabilities. What is really scarce is an enterprise-level memory system that allows enterprises to quickly access, fit business scenarios, and run stably in the production environment. This is the core problem AgentLoop MemoryStore want to solve. As a fully managed enterprise-level memory management agent, AgentLoop MemoryStore has three advantages: out-of-the-box, flexible customization, and serverless O&amp;amp;M-free. It is equipped with core capabilities such as multi-dimensional memory retrieval, intelligent memory update, asynchronous pipeline architecture, and hierarchical precision retrieval. It no longer asks "memory weight is not important"-the answer you already know. What it needs to solve is: why the enterprise has been slow to put the core agent online, and how this key point is completely broken.&lt;/p&gt;

&lt;p&gt;For agents, the value of memory goes far beyond "preserving historical conversations." It determines whether the agent can upgrade from a one-time question and answer tool to a long-term collaboration partner that continuously understands users, reuses context, and deposits business experience. Without memory, each round of Agent dialogue is like a first meeting. With reliable memory, Agent can truly understand "who you are, what happened, and how to continue judgment".&lt;/p&gt;

&lt;p&gt;For enterprises, memory is never an additional function, but a watershed of whether Agent can really be used. Does the customer service robot remember the user's last work order? Does the sales assistant remember the customer's decision-making progress and historical objections? Can the learning assistant dynamically adjust the content according to the learning progress? The core of these problems is not how personified the model is, but whether the entire memory system is sufficiently engineered, operational, and scalable.&lt;/p&gt;

&lt;p&gt;However, to really solve these pain points, it is far from enough to rely on scattered memory functions. A complete solution designed for the production environment from access, use, operation and maintenance, and compliance is needed. AgentLoop MemoryStore starts from the real pain points of enterprises and uses a set of out-of-the-box, flexible, open, stable and reliable memory system to turn "usable" agents into "daring and easy-to-use" agents.&lt;/p&gt;

&lt;p&gt;Out-of-the-Box: No Duplication of Infrastructure Construction, so That Memory Capabilities Directly Into the Existing Business&lt;br&gt;
Many teams are not unable to make Memory Demo, but are stuck in the access cost. A self-built memory system often means that you must simultaneously process vector storage, structured storage, model invocation, asynchronous tasks, monitoring and alerting, permission isolation, and SDK encapsulation. Technically, it is not impossible, but the pace of product launch will be seriously slowed down. The first value of AgentLoop MemoryStore is not how cool the feature is, but how convenient it is:&lt;/p&gt;

&lt;p&gt;a. out-of-the-box: you do not need to create a self-built vector database, MSMQ, or background task system. you can activate it and use it in a one-stop manner. it provides the ability to write and store raw data to long-term memory recall. Enterprise agents only need to focus on their own agent development, without the need to focus on the complex memory extraction process.&lt;/p&gt;

&lt;p&gt;b. Multiple docking solutions: It provides a complete API and SDK for data writing and memory recall. The client can be seamlessly connected. In addition, AgentLoop MemoryStore allows you to consume trace data collected by observable probes. You only need to load the probes in the program to collect user interaction information in a non-intrusive manner without modifying the original business logic. For teams with existing memory-related code, the product is also compatible with the Mem0 API, enabling zero-cost migration. In addition, it also supports multiple access forms such as MCP Server and OpenClaw plug-ins, which can be easily integrated into various mainstream Agent frameworks, allowing existing systems to quickly have long-term memory capabilities.&lt;/p&gt;

&lt;p&gt;c. Cross-device memory sharing: provides SaaS hosting services. Memory sharing is supported across machines, instances, and sessions. Compared with the open-source standalone memory system, AgentLoop Memory provides memory sharing across devices. In an enterprise-level agent, the agent generally runs in a sandbox for permission isolation. If the memory system is a stand-alone version, it will disappear with the destruction of the agent instance. However, based on AgentLoop Memory, the agent instance can be destroyed at any time, but the memory can be forever.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Business Scenario Example: Intelligent Customer Service&lt;/strong&gt;&lt;br&gt;
A typical customer service Agent, most afraid of is "talked yesterday, today all forget". The user explained the order problem, receiving preference and communication habits yesterday. When entering the line again today, the system started asking questions from scratch and the experience would collapse immediately. After you connect to the AgentLoop MemoryStore, the customer service team does not need to rewrite the entire memory logic. Mem0-compatible interfaces or OpenClaw plug-ins can be used to recall and write memories into existing processes. When users consult again, Agent can first see key information such as "last ticket progress", "users' common addresses" and "preferred communication methods". Naturally, answers are more continuous and manual transfer is more efficient. Compared with many open source memory solutions that are more suitable for local experiments or single-machine deployment, the SaaS-based AgentLoop MemoryStore also has a very practical advantage: memory is not tied to a single machine, but can be continuously shared among different devices, different instances, and different service nodes. If the user communicates with the Agent on the web page in the morning and moves to the mobile terminal in the afternoon, or the request is routed to another machine, the system can still continue the same memory. This cross-machine sharing capability is closer to the way enterprises operate real online services.&lt;/p&gt;

&lt;p&gt;The focus of this type of value is not "technically achievable", but "how long the business team can use it". For many enterprises, going online as soon as a week is often more meaningful than one more concept function.&lt;/p&gt;

&lt;p&gt;Flexible and Open: Memory Is Not Only Stored, but Also Supports Business Processing and Precise Retrieval&lt;br&gt;
After solving the problem of "fast access", the next key is to make the memory really fit the business, rather than simply piling up historical conversations. Memory is prone to homogenization because many products only solve the "storage" problem, but do not really solve the "how to remember, what to remember, when to take" problem. In an enterprise scenario, memory is never a static file, but a set of dynamic assets that are updated with business changes. The core difference of AgentLoop MemoryStore is that it is open enough to "memory processing" and "memory retrieval": it supports multi-dimensional memory extraction, not only retains the original dialogue content, but also automatically extracts structured memories such as user preferences, factual information, and scene summaries, so that memories are no longer scattered chat records. At the same time, it supports the dynamic update of memory rather than a mere addition, when the user's preference changes, the system will automatically update the old memory, from the source to reduce the accumulation of dirty data. It also supports flexible custom rules, whether it is the global extraction policy of the entire memory base or the special processing rules of a single message, which can be flexibly defined according to business requirements, so that the memory fully fits your business logic. In addition, it also provides a hierarchical retrieval strategy from L1 to L3, covering basic hybrid retrieval, refined Rerank to deep Agetic Search, taking into account the response speed, recall accuracy and deep semantic understanding capabilities in all aspects. The most important point here is that enterprises do not have to accept a "black box Memory" default understanding, but can inject their own business judgment into it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Business Scenario: Sales Assistant&lt;/strong&gt;&lt;br&gt;
The key memory in the sales scenario is often not a "customer is interested in the product", but more detailed structured information: the current procurement stage of the customer, who is the decision maker, whether the budget is approved, what objections were raised in the last phone call, and what actions were agreed next. If you just put all the chat records back into context, the cost is high, the noise is much, and the effect is not stable. A more effective way is to extract information such as "organizational structure", "business opportunity stage", "historical objection" and "next action" into renewable long-term memory, and then cooperate with hierarchical retrieval to recall only the most relevant parts in the current round. In this way, Agent gives not only a "chat" reply, but more like a sales colleague who has really followed up the customer process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Business Scenario: Learning Assistant&lt;/strong&gt;&lt;br&gt;
In the learning scene, the more memory, the better. The system needs to distinguish between "long-term stable learning goals" and "short-term changes in knowledge mastery". For example, a user prefers video explanation at the beginning and then makes it clear that he prefers topic-driven learning. Another example is that after several rounds of practice, the old memory should be corrected instead of being kept as "weak points in learning".&lt;/p&gt;

&lt;p&gt;AgentLoop MemoryStore supports separate processing by memory type and extraction strategy, allowing Learning Assistant to not only remember users, but also "remember changes." This improvement of the personalized experience is often more direct than simply expanding the context window.&lt;/p&gt;

&lt;p&gt;Serverless, Elastic, and O&amp;amp;M-Free: Memory Does Not Act as a System Bottleneck and Does Not Add Infrastructure Burden&lt;br&gt;
Memory function is easy to use, flexible is not enough, once on the production, stability and operation and maintenance costs become the key to determine whether the landing. Once Memory enters the production environment, the real test is often not "whether it can be extracted", but "whether the main link will be slowed down during high concurrency". Many solutions work well in the Demo phase, but problems will be exposed when they reach the real business traffic: synchronous extraction is too slow, call queuing, upstream and downstream timeout, resource expansion depends on manual work, and monitoring and alerting are not systematic. AgentLoop MemoryStore is designed to be "production-ready": It uses the memory pipeline architecture of asynchronous writing to process time-consuming memory retrieval in the background to minimize the impact on the main process. Relying on the data processing pipeline developed by AgentLoop, it can also perform multi-dimensional deduplication for large-scale interactive data, covering lexical deduplication, hash deduplication, and semantic vector deduplication, reducing redundant dirty data from the source. At the same time, it completely decouples the storage, calculation and retrieval modules. Each module can be expanded independently according to the actual load and can be easily adapted to the Auto Scaling capacity no matter how the business traffic fluctuates. In addition, it natively supports multi-tenant isolation, complete audit logs, and end-to-end observability to fully meet the O&amp;amp;M and compliance requirements of enterprises.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Business Scenario: Customer Service and Shopping Guide During the Promotion Period&lt;/strong&gt;&lt;br&gt;
When e-commerce is promoted, the pressure on customer service and shopping guide agents is usually several times or even dozens of times higher than usual. If the memory retrieval is executed in full synchronization, each dialogue has to wait for the model extraction and writing to be completed, and the latency of the main link will increase rapidly, eventually affecting the whole site experience. A more reasonable approach is to leave "the most critical recall to the user's reply" in the real-time path and put "more complex memory processing and precipitation" into the asynchronous pipeline. In this way, the Agent can respond in a timely manner without blocking the foreground service due to background memory processing. For enterprises, this is not a simple architecture optimization, but a question of whether they can stabilize service quality at critical moments.&lt;/p&gt;

&lt;p&gt;The significance of Serverless and O&amp;amp;M-free is also here. What the enterprise team really wants to save is not only a few machines, but also a whole set of maintenance costs around Memory: expansion, monitoring, exception troubleshooting, task backlog, data isolation, and permission control. If you do all of this on your own, Memory will quickly go from being an "empowerment" to a "new burden."&lt;/p&gt;

&lt;p&gt;Why AgentLoop Memory Is More Suitable for Production Environment: Not Only Can Remember, but Also Can Be Verified, Managed and Audited&lt;br&gt;
The access is fast, flexible, and stable. Eventually, it must be quantifiable, controllable, and compliant before it can truly enter the core link of the enterprise. When enterprises choose Memory, they will not only look at the concept, but also look at the results. Don't look at the advertisement, look at the curative effect, whether the effect is good or not, go to Benchmark to run and see. Based on a unified Benchmark, it is the touchstone for measuring different Memory systems. In the Locomo Benchmark evaluation, the accuracy score of AgentLoop Memory reaches 84.07%. At the same time, compared with EverMemos, the recalled memory volume is 30% less. This means that it doesn't just "remember more", but gives more efficient hit results with less context overhead.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9mk4m0gb0qjombzn8lgd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9mk4m0gb0qjombzn8lgd.png" alt=" " width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In addition to the effect, enterprises are also concerned about the long-term operation. AgentLoop MemoryStore also provides several capabilities that are critical to the production environment: in addition to the effect, enterprises are also concerned about long-term operation. AgentLoop MemoryStore also provides several critical capabilities for the production environment: it has built-in multi-tenant data isolation capabilities to meet enterprise-level security boundary requirements; it also provides complete audit logs to support the full tracking of memory additions, deletions, modifications, and checks to meet the requirements of compliance audits. It also supports comprehensive observability and cost analysis capabilities. You can easily view the latency, token consumption, request volume, and storage volume, and quickly troubleshoot problems. It also supports multiple integration methods and reduces the access threshold for different technology stacks.&lt;/p&gt;

&lt;p&gt;In other words, it wants to deliver not just a "memory agent", but a memory infrastructure that enterprises can confidently incorporate into their core business links.&lt;/p&gt;

&lt;p&gt;Best Practice: OpenClaw + AgentLoop MemoryStore - Low-threshold Access to Long-term Memory&lt;br&gt;
To enable more teams to use reliable long-term memory, OpenClaw is further integrated with AgentLoop MemoryStore. This allows developers to quickly provide stable, reusable, and operational enterprise-level memory capabilities to existing agents without the need to build memory modules from scratch. If you are already using OpenClaw, the cost of accessing AgentLoop MemoryStore will be lower. We have packaged the integration solution as a separate npm package openclaw-plugin-agentloop-memory that, once installed and configured, can add enterprise-class long-term memory to OpenClaw without modifying the OpenClaw code itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;br&gt;
Before you perform the migration, make the following preparations:&lt;/p&gt;

&lt;p&gt;■ You have an Alibaba Cloud account and have activated the AgentLoop MemoryStore service.&lt;/p&gt;

&lt;p&gt;■ Create a Workspace and MemoryStore in the AgentLoop MemoryStore console&lt;/p&gt;

&lt;p&gt;■ The AccessKey ID and AccessKey secret of your Alibaba Cloud account.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Installation&lt;/strong&gt;&lt;br&gt;
Execute in the OpenClaw project directory:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;npm install openclaw-plugin-agentloop-memory&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;Configure&lt;/strong&gt;&lt;br&gt;
After the installation is complete, enable the plug-in in the OpenClaw configuration and specify the connection parameters. Typical configurations are as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{
  "memory-agentloop": {
    "endpoint": "cms.cn-hangzhou.aliyuncs.com",
    "accessKeyId": "${ALIBABA_CLOUD_ACCESS_KEY_ID}",
    "accessKeySecret": "${ALIBABA_CLOUD_ACCESS_KEY_SECRET}",
    "workspace": "my-workspace",
    "memoryStore": "my-memory-store"
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The following table describes the core parameters :&lt;/p&gt;

&lt;p&gt;■ endpoint: the API endpoint address of AgentLoop MemoryStore. Enter the endpoint address based on the region where the instance is located, for example, cms.cn-hangzhou.aliyuncs.com&lt;/p&gt;

&lt;p&gt;■ accessKeyId /accessKeySecret: Alibaba Cloud access credential, supports environment variable injection to avoid plaintext storage&lt;/p&gt;

&lt;p&gt;■ workspace: Name of the workspace created in the AgentLoop MemoryStore control&lt;/p&gt;

&lt;p&gt;■ memoryStore: The name of the memory bank in the workspace.&lt;/p&gt;

&lt;p&gt;The plug-in also provides the following optional configurations:&lt;/p&gt;

&lt;p&gt;■ userId /agentId: used for user-level and agent-level data isolation, applicable to multi-tenant scenarios&lt;/p&gt;

&lt;p&gt;■ autoCapture: On by default, it automatically extracts valuable information from the conversation and writes it to the memory bank.&lt;/p&gt;

&lt;p&gt;■ autoRecall: On by default, it automatically retrieves relevant memories and injects context before each conversation starts.&lt;/p&gt;

&lt;p&gt;■ inferOnAdd: This feature is enabled by default. Intelligent extraction is enabled when you write data to the memory. Multi-dimensional memory extraction and deduplication are automatically performed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Capabilities provided by the plug-in&lt;/strong&gt;&lt;br&gt;
After installation, the plug-in adds three types of capabilities to OpenClaw:&lt;/p&gt;

&lt;p&gt;■ Agent tools: three memory operation tools: registration memory_recall, memory_store and memory_forget, which are convenient for Agent to actively retrieve, write and delete memory during dialogue.&lt;/p&gt;

&lt;p&gt;■ Automated hooks: When autoRecall and autoCapture are enabled, memory recall and asynchronous precipitation are automatically completed to reduce business code transformation.&lt;/p&gt;

&lt;p&gt;■ CLI command: provides openclaw agentloop command line capabilities to facilitate developers to search, add, list, and delete memories directly in the terminal, and perform connectivity checks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SDK for Python Quick Experience Demo&lt;/strong&gt;&lt;br&gt;
If you want to quickly verify the effect first, you can also experience it directly through the Python SDK:&lt;/p&gt;

&lt;p&gt;1.Get AgentLoop Memory SDK&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install agentloop-memory&lt;/code&gt;&lt;br&gt;
2.Run the sample program&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;from agentloop_memory import Config
from agentloop_memory.client import AgentLoopMemoryClient
import os
import time
def main():
    # 1. Init memory store client
    config = Config(
        access_key_id=os.getenv("ALIYUN_ACCESS_KEY_ID"),
        access_key_secret=os.getenv("ALIYUN_ACCESS_KEY_SECRET"),
        endpoint=os.getenv("CMS_ENDPOINT", "cms.cn-shanghai.aliyuncs.com"),
    )
    client = AgentLoopMemoryClient(
        config,
        workspace=os.getenv("CMS_WORKSPACE"),
        memory_store=os.getenv("CMS_MEMORY_STORE"),
    )
    # 2. Create memory store
    result = client.create_memory_store(
        description="Example memory store",
        extraction_strategies=["FACT"],
    )
    print("create_memory_store:", result)
    time.sleep(5)
    # 3. Add memory
    result = client.add(
        messages="I live in Hangzhou and love visiting West Lake",
        user_id="user123",
    )
    print("add:", result)
    time.sleep(120)
    # 4. Search memory
    result = client.search(
        query="Where do I live?",
        user_id="user123",
    )
    print("search:", result)
    # 5. Get all memories
    result = client.get_all(
        user_id="user123",
        page=1,
        page_size=10,
    )
    print("get_all:", result)
    # 6. List memory stores
    result = client.list_memory_stores(max_results=10)
    print("list_memory_stores:", result)
if __name__ == "__main__":
    main()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sample result&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{'status_code': 200, 'headers': {'server': 'AliyunSLS', 'content-length': '0', 'connection': 'keep-alive', 'access-control-allow-origin': '*', 'date': 'Mon, 02 Feb 2026 03:27:53 GMT', 'x-log-time': '1770002873', 'x-log-requestid': '698019B5FA0F42BA63073DF6'}}
{'results': [{'event_id': '800c03bc-dc54-42de-bd07-153421f88259', 'message': 'Memory processing has been queued for background execution', 'status': 'PENDING'}]}
{'results': [{'created_at': 1770002874, 'hash': '55566d2fdec59e0a3bf8870b1cb17bfd', 'id': '019c1c65-9745-7773-92f8-189a2b4a3721', 'memory': 'lives in Hangzhou, 'score': 0.5316177221048695, 'updated_at':: updated_at': 1770002874, 'user_id': 'user_0.46264787090919 ', '74 createdy': at': 177a' 1770002874, 'user_id': 'user123'}, {'created_at': 1770002874, 'hash': '7b869aba23294ab37679c5f7e7465921', 'id': '019c1c65-990e-7381-8ba4-794867a634bd', 'memory': 'like the scenery of hangzhou', 'score': 0.4317308740071, 'updated_at': 1770002874, ''user_id': 'user12l':} 3'
{'results': [{'created_at': 1770002874, 'hash': '55566d2fdec59e0a3bf8870b1cb17bfd', 'id': '019c1c65-9745-7773-92f8-189a2b4a3721', 'memory': 'Lived in Hangzhou, 'updated_at': 1770002874, 'user_id': upered': 'user12y', {'7b869aba23294ab37679c5f7e7465921' 'user123'}, 'hash': 170002874', 'hidat ', 'hash' 'hash' 'hash' 1770002874, 'hash': '939ed9d15f907d252363fd0e2cffb9a9', 'id': '019c1c65-9ac3-7cd1-afea-1f091dcdc6fe', 'memory': 'frequent visit to the West Lake ', 'updated_at': 1770002874, 'user_id': 'user123'}], 'relations': []}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the memory is added, the system automatically extracts and stores three key pieces of information:&lt;/p&gt;

&lt;p&gt;■ "I live in Hangzhou"&lt;/p&gt;

&lt;p&gt;■ "Love the scenery of Hangzhou"&lt;/p&gt;

&lt;p&gt;■ "I often go to the West Lake to play."&lt;/p&gt;

&lt;p&gt;When querying "Where do I live?", the system will accurately return "live in Hangzhou" and return other associated memories based on the relevance. The whole process without manual annotation, memory extraction and retrieval can be done automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;br&gt;
Today's Memory market does not lack new concepts, but solutions that can really help enterprises run agents, run stably, and run out of business value. The focus of AgentLoop MemoryStore is not to make "memory" more mysterious, but to do the three most realistic things well: to connect to the existing system faster, to fit the specific business more flexibly, and to run in the production environment more carefully. For teams that are already doing customer service, sales, learning, shopping guide and other agents, such Memory is really worth seeing and being connected to the main link.&lt;/p&gt;

&lt;p&gt;Don't let your agents have only seven seconds of memory. Immediate access to AgentLoop MemoryStore so that data is truly deposited into reusable business wisdom:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cmsnext.console.alibabacloud.com/agentloop/home" rel="noopener noreferrer"&gt;https://cmsnext.console.alibabacloud.com/agentloop/home&lt;/a&gt;&lt;/p&gt;

</description>
      <category>openclaw</category>
      <category>cloudnative</category>
      <category>agents</category>
    </item>
    <item>
      <title>LoongCollector + ACS Agent Sandbox: Build a Production-grade AI Agent Runtime Platform</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Tue, 26 May 2026 02:56:58 +0000</pubDate>
      <link>https://dev.to/observabilityguy/loongcollector-acs-agent-sandbox-build-a-production-grade-ai-agent-runtime-platform-5pj</link>
      <guid>https://dev.to/observabilityguy/loongcollector-acs-agent-sandbox-build-a-production-grade-ai-agent-runtime-platform-5pj</guid>
      <description>&lt;p&gt;This article introduces AgentLoop MemoryStore, a fully managed, enterprise-grade memory solution designed to give AI Agents long-term, reliable memory for production environments.&lt;/p&gt;

&lt;p&gt;1.Security and Observability Challenges of AI Agents&lt;br&gt;
With the rapid development of Large Language Models (LLMs), AI Agents are moving from the lab to production. From intelligent customer service to code assistants, and from data analytics to automated O&amp;amp;M, AI Agents are transforming how we work. However, unlike traditional applications, AI Agents possess two distinct characteristics:&lt;/p&gt;

&lt;p&gt;● &lt;strong&gt;Unpredictable behavior:&lt;/strong&gt; The same input might generate different outputs and invoke different toolchains.&lt;/p&gt;

&lt;p&gt;● &lt;strong&gt;Execution capability:&lt;/strong&gt; Agents don't just "talk"; they "act"—accessing data, invoking APIs, and executing operations.&lt;/p&gt;

&lt;p&gt;These two characteristics present entirely new challenges.&lt;/p&gt;

&lt;p&gt;Core challenge 1: Runtime security (What are Agents permitted to do? Who defines the boundaries?)&lt;br&gt;
Consider this scenario: A customer service Agent answering a query is subjected to a prompt injection attack. It accidentally accesses another user's order information, or even triggers a refund API. This is a real-world security risk, not science fiction.&lt;/p&gt;

&lt;p&gt;AI Agent security risks primarily stem from two areas:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.Lack of strong isolation in execution environments&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Agents require data access and tool invocation at runtime. Without strict permission controls, prompt injections or accidental triggers can lead to unauthorized access, data leaks, or unintended operations—such as an Agent bypassing security checks to access a restricted database.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.Lack of control over external capabilities&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The greatest threats often arise from the abuse of external capabilities—such as abnormal outbound calls, SSRF/intranet probing, or sensitive data persistence and exfiltration. For example, an Agent might be tasked with "checking the weather" but actually initiates a scan of internal network services.&lt;/p&gt;

&lt;p&gt;Core Challenge 2: Full-link Observability (What did the Agent do? Why did it do it? How effective was it?)&lt;br&gt;
Traditional applications are deterministic; the same input yields the same output. AI Agents, however, may make different decisions each time, leading to three major observability hurdles:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.Behavior is hard to reproduce and troubleshoot&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For the same query, an Agent might use Tool A today, Tool B tomorrow, or simply provide a direct answer the day after. When errors occur, identifying the exact point of failure is difficult.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.Difficulty in cost control and attribution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Costs are driven by LLM token consumption and external API calls, both of which fluctuate significantly. It is often unclear which users, tasks, or models are driving up expenses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3.Quality is hard to measure and optimize&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Output quality depends on model capability, prompt design, and retrieval data. Because these factors change constantly, it is difficult to pinpoint what is working, what isn't, and how to optimize.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Is a Specialized Solution Necessary?&lt;/strong&gt;&lt;br&gt;
Traditional monitoring and security solutions fall short in AI Agent scenarios:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fntuc85ha57k1jzz8hizf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fntuc85ha57k1jzz8hizf.png" alt=" " width="789" height="379"&gt;&lt;/a&gt;&lt;br&gt;
This is why a runtime platform and observability solution specifically designed for AI Agents are essential. Let's explore how ACS Agent Sandbox and LoongCollector address these challenges.&lt;/p&gt;

&lt;p&gt;2.ACS Agent Sandbox and LoongCollector: Comprehensive Security and Observability&lt;br&gt;
ACS Agent Sandbox provides a secure execution environment based on Kubernetes, while LoongCollector acts as a telemetry data collector to provide agents with comprehensive monitoring and analysis. Together, their deep integration forms a complete production-grade execution platform for AI Agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.1 ACS Agent Sandbox: Providing Runtime Security&lt;/strong&gt;&lt;br&gt;
Alibaba Cloud Container Service (ACS) Agent Sandbox is a specialized environment launched by Alibaba Cloud. Built on Kubernetes, it provides a secure, isolated, and scalable platform for running AI Agents.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8fktnnu3b9ir87ooks5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8fktnnu3b9ir87ooks5.png" alt=" " width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.2 LoongCollector: Providing Sandbox Observability&lt;/strong&gt;&lt;br&gt;
LoongCollector is a unified telemetry collector open-sourced by the Alibaba Cloud Observability team. Designed for cloud-native and high-performance scenarios, it offers unique advantages for AI Agent use cases:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0jpt0aqfffmunql3kd5x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0jpt0aqfffmunql3kd5x.png" alt=" " width="799" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extreme Performance and Ultra-low Overhead&lt;/strong&gt;&lt;br&gt;
AI Agents are compute-intensive, so observability components must be lightweight to avoid impacting business operations:&lt;/p&gt;

&lt;p&gt;● Zero-copy architecture: Utilizes Memory Arena and zero-copy to minimize unnecessary memory overhead.&lt;/p&gt;

&lt;p&gt;● Event pooling and reuse: High-frequency object pooling reduces memory allocation and Garbage Collection (GC) pressure.&lt;/p&gt;

&lt;p&gt;● High single-core throughput: A single core can support log collection throughput of up to 500 MB/s.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unified Collection: Full Coverage of Logs, Metrics, and Traces&lt;/strong&gt;&lt;br&gt;
● Logs: Supports stdout/stderr and file logs; automatically associates Kubernetes metadata such as Pods, Namespaces, and Labels.&lt;/p&gt;

&lt;p&gt;● Metrics: Native support for Prometheus Exporter, system metrics (CPU, memory, network, and disk I/O), and GPU metrics (NVIDIA DCGM).&lt;/p&gt;

&lt;p&gt;● Traces: Full support for OpenTelemetry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Edge Computing: Moving Processing to the Data Source&lt;/strong&gt;&lt;br&gt;
Beyond collection, it performs edge-side preprocessing to reduce transmission and storage costs:&lt;/p&gt;

&lt;p&gt;● High-performance C++ plugins and Structured Process Language (SPL) engine.&lt;/p&gt;

&lt;p&gt;● Supports complex processing: Filtering, transformation, and aggregation.&lt;/p&gt;

&lt;p&gt;● Edge-side dimensionality reduction: Minimizing noise and data volume at the source.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enterprise-Grade Reliability: Ensuring Zero Data Loss and Stable Operations&lt;/strong&gt;&lt;br&gt;
Data reliability&lt;/p&gt;

&lt;p&gt;● At-least-once delivery semantics.&lt;/p&gt;

&lt;p&gt;● Local disk caching: Persisting data to disk during network anomalies and retransmitting upon recovery.&lt;/p&gt;

&lt;p&gt;● Automatic retry and exponential backoff.&lt;/p&gt;

&lt;p&gt;● Backpressure and rate limiting: Protects the system during downstream congestion.&lt;/p&gt;

&lt;p&gt;Operational reliability:&lt;/p&gt;

&lt;p&gt;● Multi-tenant pipeline isolation.&lt;/p&gt;

&lt;p&gt;● Priority scheduling: Ensuring critical data is processed first.&lt;/p&gt;

&lt;p&gt;● Hot updates and graceful changes: Configuration changes take effect without restarts or service interruptions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unified Management for Large-Scale Elastic Scenarios&lt;/strong&gt;&lt;br&gt;
● ConfigServer: Centralized configuration management supporting tens of thousands of Agents.&lt;/p&gt;

&lt;p&gt;● Remote configuration delivery: Changes take effect in real-time without requiring manual login.&lt;/p&gt;

&lt;p&gt;● Status and performance monitoring: A unified view of health and resource overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.3 Deep Integration: LoongCollector Provides Zero-Intrusion, Automated, and Highly Reliable Observability&lt;/strong&gt; for Sandbox&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuq5dvwh5ypdowbs24s5p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuq5dvwh5ypdowbs24s5p.png" alt=" " width="799" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;● ACS management automatically injects the LoongCollector container into the Sandbox.&lt;/p&gt;

&lt;p&gt;● Via shared file path mounting.&lt;/p&gt;

&lt;p&gt;● Use the Pod network to perform Prometheus scraping on AI Agents or receive OpenTelemetry data.&lt;/p&gt;

&lt;p&gt;Through the deep integration of ACS Agent Sandbox and LoongCollector, we have built a comprehensive production-grade platform for AI Agents:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzb326cnkr6zk7m0q50h4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzb326cnkr6zk7m0q50h4.png" alt=" " width="789" height="463"&gt;&lt;/a&gt;&lt;br&gt;
3.Running OpenClaw Using ACS Agent Sandbox and LoongCollector&lt;br&gt;
OpenClaw is a trending AI application that redefines the boundaries of AI assistants. Its core value is no longer just answering questions, but understanding intent, planning steps, and invoking tools to complete tasks—acting as an "always-on" digital employee. Next, let's explore how to run OpenClaw securely and with full observability using ACS Agent Sandbox and LoongCollector.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;3.1 Enabling Sandbox LoongCollector Injection for ACK and ACS Clusters&lt;br&gt;
ACK clusters&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
Note: Install the following components in advance:&lt;/p&gt;

&lt;p&gt;● Install the LoongCollector component in Components and Add-ons.&lt;/p&gt;

&lt;p&gt;● Install the ACK Virtual Node component in Components and Add-ons.&lt;/p&gt;

&lt;p&gt;● Install ack-agent-sandbox-controller components in Components and Add-ons.&lt;/p&gt;

&lt;p&gt;● To expose services via EIP, install the ack-extend-network-controller component from the Marketplace. Refer to the help document for specific configuration steps.&lt;/p&gt;

&lt;p&gt;Modify the eci-profile ConfigMap in the kube-system namespace. The slsMachineGroup parameter defines the Sandbox machine group identifier; we recommend using a unique identifier different from the ACK DaemonSet group.&lt;/p&gt;

&lt;p&gt;ACS clusters&lt;/p&gt;

&lt;p&gt;Note: Install the following components first:&lt;/p&gt;

&lt;p&gt;● Go to Components and Add-ons and install the ack-agent-sandbox-controller component (version ≥0.5.3).&lt;/p&gt;

&lt;p&gt;● To expose services via EIP, go to Components and Add-ons in the ACK cluster and install the ack-extend-network-controller component.&lt;/p&gt;

&lt;p&gt;● Go to Components and Add-onsand install the in alibaba-log-controller component.&lt;/p&gt;

&lt;p&gt;The machine group identifier is the unified ACS cluster group ID: k8s-log-${cluster_id}&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3.2 Deploying OpenClaw in ACS Agent Sandbox&lt;/strong&gt;&lt;br&gt;
Enable the OpenTelemetry (OTel) plugin for OpenClaw&lt;/p&gt;

&lt;p&gt;Note&lt;/p&gt;

&lt;p&gt;● Ensure extensions/diagnostics-otel is included when packaging the OpenClaw image.&lt;/p&gt;

&lt;p&gt;● You must enable diagnostics-otel in the configuration to report metrics and trace data.&lt;/p&gt;

&lt;p&gt;Configure &lt;code&gt;~/.openclaw/openclaw.json&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Note: The endpoint configured here will be required for the LoongCollector collection configuration later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{  
  "plugins": {  
    "allow": ["diagnostics-otel"],  
    "entries": {  
      "diagnostics-otel": { "enabled": true }  
    }  
  },  
  "diagnostics": {  
    "enabled": true,  
    "otel": {  
      "enabled": true,  
      "endpoint": "http://127.0.0.1:4318",  
      "protocol": "http/protobuf",  
      "serviceName": "openclaw-gateway",  
      "traces": true,  
      "metrics": true,  
      "logs": true,  
      "sampleRate": 1,  
      "flushIntervalMs": 60000  
    }  
  }  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenClaw sandbox deployment example&lt;/p&gt;

&lt;p&gt;Below is a simplified example of creating an OpenClaw sandbox directly using a Sandbox CR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;apiVersion: agents.kruise.io/v1alpha1  
kind: Sandbox  
metadata:  
  name: openclaw  
  namespace: default  
spec:  
  template:  
    metadata:  
      labels:  
        alibabacloud.com/acs: 'true'  
        app: openclaw  
    spec:  
      containers:  
        - name: openclaw  
          # Replace with the actual OpenClaw image address  
          image: &lt;span class="nt"&gt;&amp;lt;open-claw&lt;/span&gt; &lt;span class="err"&gt;image&lt;/span&gt; &lt;span class="err"&gt;address&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;   
          imagePullPolicy: IfNotPresent   
          resources:  
            limits:  
              cpu: '4'  
              memory: 8Gi  
            requests:  
              cpu: '4'  
              memory: 8Gi  
          securityContext:  
            readOnlyRootFilesystem: false  
          terminationMessagePath: /dev/termination-log  
          terminationMessagePolicy: File  
      dnsPolicy: ClusterFirst  
      paused: true  
      restartPolicy: Always  
      schedulerName: default-scheduler  
      securityContext: {}  
      terminationGracePeriodSeconds: 1  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3.3 Full Observability Collection Configuration&lt;/strong&gt;&lt;br&gt;
As described in Is Your OpenClaw Really Running Under Control?, the observability data for OpenClaw is as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4zych2ip6y5qa1u1rnq3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4zych2ip6y5qa1u1rnq3.png" alt=" " width="789" height="187"&gt;&lt;/a&gt;&lt;br&gt;
Session logs&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;apiVersion: telemetry.alibabacloud.com/v1alpha1  
kind: ClusterAliyunPipelineConfig  
metadata:  
  name: openclaw-session-log  
spec:  
  config:  
    aggregators: []  
    global: {}  
    inputs:  
      - Type: input_file  
        # This path varies depending on the run path of the openclaw image.  
        FilePaths:  
          - /home/node/.openclaw/agents/main/sessions/*.jsonl  
        MaxDirSearchDepth: 0  
        FileEncoding: utf8  
        EnableContainerDiscovery: true  
        # Filter containers based on the OpenClaw sandbox information.  
        ContainerFilters:  
          K8sPodRegex: ^(openclaw.*)$  
    processors:  
      - Type: processor_parse_json_native  
        SourceKey: content  
    flushers:  
      - Type: flusher_sls  
        Logstore: openclaw-session-log  
    sample: ''  
  # Replace this with the sandbox machine group name of the ACK or ACS cluster.  
  machineGroups:  
    - name: &lt;span class="nt"&gt;&amp;lt;your-sandbox-machine-group&amp;gt;&lt;/span&gt;  
  # The project to which logs are collected.  
  project:  
    name: k8s-log-xxx  
  # The Logstore to which logs are collected.  
  logstores:  
    - name: openclaw-session-log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Application logs&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;apiVersion: telemetry.alibabacloud.com/v1alpha1  
kind: ClusterAliyunPipelineConfig  
metadata:  
  name: openclaw-app-log  
spec:  
  config:  
    aggregators: []  
    global: {}  
    inputs:  
      - Type: input_file  
        FilePaths:  
          - /tmp/openclaw/*.log  
        MaxDirSearchDepth: 0  
        FileEncoding: utf8  
        EnableContainerDiscovery: true  
        # Filter containers based on OpenClaw sandbox information.  
        ContainerFilters:  
          K8sPodRegex: ^(openclaw.*)$  
    processors:  
      - Type: processor_parse_json_native  
        SourceKey: content  
    flushers:  
      - Type: flusher_sls  
        Logstore: openclaw-app-log  
    sample: ''  
  # Replace this with the name of the sandbox machine group for your ACK or ACS cluster.  
  machineGroups:  
    - name: &lt;span class="nt"&gt;&amp;lt;your-sandbox-machine-group&amp;gt;&lt;/span&gt;  
  # The destination project for data collection.  
  project:  
    name: k8s-log-xxx  
  # The destination Logstore for data collection.  
  logstores:  
    - name: openclaw-app-log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenTelemetry&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;apiVersion: telemetry.alibabacloud.com/v1alpha1  
kind: ClusterAliyunPipelineConfig  
metadata:  
  name: openclaw-otel-config  
spec:  
  config:  
    # This corresponds to the logstores below. It distributes and stores OpenTelemetry logs, metrics, and trace data.  
    aggregators:  
      - Type: aggregator_opentelemetry  
        MetricsLogstore: openclaw-otel-metrics  
        TraceLogstore: openclaw-otel-traces  
        LogLogstore: openclaw-otel-logs  
    global: {}  
    inputs:  
      - Type: service_otlp  
        Protocals:  
          HTTP:  
            # Corresponds to the diagnostics-otel Endpoint enabled in OpenClaw.  
            Endpoint: '127.0.0.1:4318'  
            ReadTimeoutSec: 10  
            ShutdownTimeoutSec: 5  
            MaxRecvMsgSizeMiB: 64  
    processors: []  
    flushers:  
      - Type: flusher_sls  
        Logstore: openclaw-otel-logs  
  # Replace with the Sandbox machine group Name for the ACK or ACS cluster.  
  machineGroups:  
    - name: &lt;span class="nt"&gt;&amp;lt;your-sandbox-machine-group&amp;gt;&lt;/span&gt;  
  # The project for Collection.  
  project:  
    name: k8s-log-xxx  
  # The Logstore for Collection. Note that OpenTelemetry has three Data Types. You must define three Logstores.  
  # For metrics Data, set telemetryType to Metrics.  
  logstores:  
    - name: openclaw-otel-logs  
    - name: openclaw-otel-metrics  
      telemetryType: Metrics  
    - name: openclaw-otel-traces
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3.4 Summary: Fully Resolving OpenClaw Security Challenges&lt;/strong&gt;&lt;br&gt;
Sandbox runs OpenClaw securely and in isolation&lt;/p&gt;

&lt;p&gt;● Each Sandbox runs in an isolated kernel environment, preventing malicious code from attacking host system programs.&lt;/p&gt;

&lt;p&gt;● Each Sandbox uses an isolated temporary file system to prevent unauthorized reading, tampering, or deletion of host files.&lt;/p&gt;

&lt;p&gt;LoongCollector enables full-stack observability for OpenClaw&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffpzjp9mkydewcje911fe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffpzjp9mkydewcje911fe.png" alt=" " width="789" height="306"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;4. Summary and Outlook&lt;/strong&gt;&lt;br&gt;
The production-readiness of AI Agents is not a matter of "if," but "how." Security and observability are not optional—they are essential requirements.&lt;/p&gt;

&lt;p&gt;If you are building an AI agent application:&lt;/p&gt;

&lt;p&gt;● Start now by prioritizing runtime security and observability.&lt;/p&gt;

&lt;p&gt;● Choose the right tools instead of reinventing the wheel.&lt;/p&gt;

&lt;p&gt;● Establish best practices and promote them within your team.&lt;/p&gt;

&lt;p&gt;● Continually learn and optimize to ensure your Agents create real value.&lt;/p&gt;

&lt;p&gt;Both ACS Agent Sandbox and LoongCollector are open platforms; we invite you to try them and share your feedback. Together, let's build a more secure, reliable, and efficient production environment for AI Agents. We hope this article provides valuable reference and inspiration for your observability journey.&lt;/p&gt;

</description>
      <category>loongcollector</category>
      <category>cloudnative</category>
    </item>
    <item>
      <title>Human-Robot Half Marathon: The Large-Scale O&amp;M Challenge for Embodied Intelligence Beyond the Racecourse</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Wed, 20 May 2026 02:38:23 +0000</pubDate>
      <link>https://dev.to/observabilityguy/human-robot-half-marathon-the-large-scale-om-challenge-for-embodied-intelligence-beyond-the-5d7p</link>
      <guid>https://dev.to/observabilityguy/human-robot-half-marathon-the-large-scale-om-challenge-for-embodied-intelligence-beyond-the-5d7p</guid>
      <description>&lt;p&gt;This article introduces an Alibaba Cloud-powered O&amp;amp;M observability system tackling humanoid robot challenges in large-scale, outdoor, and long-distance scenarios.&lt;/p&gt;

&lt;p&gt;A special half marathon has just concluded in Beijing. More than 300 humanoid robots competed alongside humans, vying across dimensions such as autonomous navigation, dynamic balance, and multi-robot coordination, setting a global record for the scale of human-robot co-running events. When hundreds of robots collectively run 21 kilometers, what we see is not just a race, but a large-scale public stress test for the realm of embodied intelligence. As the race ends, a bigger challenge has emerged beyond the racecourse—&lt;/p&gt;

&lt;p&gt;In the face of new embodied intelligence scenarios characterized by clustering, mobility, and complexity, the industry urgently needs a standardized, reusable, integrated O&amp;amp;M system that adapts to outdoor weak-network and multi-device heterogeneous environments. Leveraging Alibaba Cloud's full-spectrum observability capabilities, with Simple Log Service (SLS), CloudMonitor (CMS), and Application Real-Time Monitoring Service (ARMS) as the core foundation, a collaborative O&amp;amp;M observability system for humanoid robots has been built. This system precisely matches the requirements of typical scenarios involving long-distance movement, multi-robot formation coordination, and full environment variable interference, providing a practical reference for the industry to solve large-scale O&amp;amp;M challenges.&lt;/p&gt;

&lt;p&gt;Three Dilemmas: New Challenges in Embodied Intelligence O&amp;amp;M Observability&lt;br&gt;
The 21-kilometer open course of the half marathon is an extreme stress test of the comprehensive stability of humanoid robots. It also exposes the three core bottlenecks in deploying embodied intelligence clusters at scale — a common challenge across all outdoor large-scale scenarios.&lt;/p&gt;

&lt;p&gt;● Environmental uncertainty is the primary challenge of outdoor operations. In open scenarios, temperature, humidity, and lighting conditions change in real time, while uncontrollable factors such as road bumps, ramps, curves, pedestrian crossings, and wireless signal fluctuations persist, continuously interfering with sensor detection accuracy, communication transmission stability, and power system payload balance. Especially under high-temperature conditions, prolonged high-load operation of robot active joints, computing power modules, and battery components accelerates hardware aging and significantly increases component failure rates. Device operation remains in a state of Dynamic fluctuation, where a single environmental disturbance can trigger cascading abnormalities.&lt;/p&gt;

&lt;p&gt;● Hidden damage and coupling threats from highly integrated devices further amplify operational risks. Humanoid robots tightly integrate motion modules, multiple sensor types, edge computing, AI inference, wireless communication, and other multilayer systems with precise structure and high interdependency. Minor vibrations and low-speed collisions during movement do not cause obvious skin damage but can easily lead to irreversible hidden issues such as slight displacement of lidar and vision cameras, loose joint wiring, and micro-deformation of internal support structures, which in turn cause navigation and obstacle avoidance inaccuracy, intermittent signal breaks, task execution bias, and other problems. Combined with individual device differences introduced by manual assembly, a minor abnormality in one device can quickly propagate to the entire formation, causing coordination disorder, rhythm desynchronization, and even cluster-level security risks.&lt;/p&gt;

&lt;p&gt;● Traditional O&amp;amp;M patterns are completely unable to adapt to new scenarios. Previously, fixed devices relied on post-incident emergency repair, manual offline troubleshooting, and standalone independent management — a passive pattern with delayed response, entirely unsuitable for humanoid robots that operate with Dynamic mobility, all-weather jobs, and multi-robot collaboration. To support stable operation of large-scale clusters, it is essential to break down data silos among hardware indicators, system logs, algorithm links, and environmental data, move beyond experience-based manual O&amp;amp;M, and complete the transformation from passive remediation to active defense through full-dimension status visualization, proactive threat prediction, and rapid abnormal loss containment.&lt;/p&gt;

&lt;p&gt;Cloud-edge Collaborative Data Collection Adapted to the Core O&amp;amp;M Features of Humanoid Robots&lt;br&gt;
Based on the natural properties of humanoid robots — large-scale movement, unstable network environments, multi-brand heterogeneity, and long-duration continuous operations — the ideal O&amp;amp;M architecture for the industry must balance low-latency edge self-healing with cloud-based global unified management. By adopting a Layer 3 cloud-edge collaborative design spanning terminal body, edge gateway, and cloud platform, the solution reasonably separates the responsibilities of data collection, local management, computing power processing, and global analysis. Built around the three core O&amp;amp;M modules of real-time status monitoring, intelligent failure prediction, and hierarchical emergency response, Alibaba Cloud observability products form a complete capability matrix integrating indicators, traces, and logs to address industry pain points such as fragmented embodied device logs, difficulty in quantifying hardware indicators, and difficulty in troubleshooting hidden algorithm faults.&lt;br&gt;
At the data access layer, the solution provides two highly available and flexible deployment modes to adapt to different outdoor conditions and network environments.&lt;/p&gt;

&lt;p&gt;● The lightweight LoongCollector and Simple Log Service software development kit direct collection mode features extremely low resource usage on the device side and high compression and transmission efficiency. It meets high real-time monitoring requirements and supports dynamic adjustment of collection policies from the cloud, eliminating the need for frequent OTA upgrades on devices. LoongCollector is a new-generation Database Collector launched by Alibaba Cloud Simple Log Service that integrates performance, stability, and programmability. It extends and integrates the observability technology stack, breaking the single-scenario limitations of traditional log collectors, and supports the collection, processing, ingress, and sending of Logs, Metrics, Traces, Events, and Profiles.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6oocosgywf5j2mzp52wu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6oocosgywf5j2mzp52wu.png" alt=" " width="799" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;● Based on the S3 protocol + Simple Log Service architecture, this mode is suitable for weak network and intermittent connectivity scenarios. Data is cached and encrypted locally and uploaded during off-peak hours. It is low-cost, highly reliable, not attached to a single vendor, and more extensible.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpeiiqntbpksbn6sqxyix.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpeiiqntbpksbn6sqxyix.png" alt=" " width="799" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Both modes are fully compatible with 5G, Wi-Fi, IoT, and other communication methods, fully adapting to the complex and dynamic network environment of mobile robots.&lt;/p&gt;

&lt;p&gt;Full-Domain, All-Dimension Observability for a Transparent Robot Cluster Operation System&lt;/p&gt;

&lt;p&gt;Whether for outdoor formation movement or routine commercial deployment, the foundation for stable operation of large-scale embodied intelligence clusters lies in full-dimension, full-epoch, and full-link observability.&lt;/p&gt;

&lt;p&gt;● At the hardware level, core indicators such as joint motor payload, current temperature, power supply health status, compute unit resource usage, inertial navigation calibration accuracy, sensing device data streams, sensor readings, and network quality are continuously collected to fully grasp the health status of core components and detect hardware threats such as overload, overheating, abnormal power supply, and sensor attenuation in advance.&lt;/p&gt;

&lt;p&gt;● At the business and algorithm level, the running status of underlying core processes is monitored in real time, and various management events are managed at different levels, with a focus on intercepting faults and fatal exceptions. Key indicators such as perception and decision inference latency, path planning efficiency, and collaborative execution success rate are continuously tracked to fully restore algorithm running health and detect performance degradation and logical exceptions in a timely manner.&lt;/p&gt;

&lt;p&gt;● At the scenario and environment level, full-epoch job info, device running status transitions, outdoor temperature and humidity environment data, physical collision management events, and other real-scene information are recorded. Through multi-dimension data cross-referencing, different failure root causes such as environmental interference, mechanical damage, algorithm bugs, and human operations are quickly distinguished, providing an objective basis for daily O&amp;amp;M and post-event review.&lt;/p&gt;

&lt;p&gt;For the above observation scenarios, the three core dimensions of indicator monitoring, Tracing Analysis, and log administration are built in depth to form a full-coverage, strongly collaborative, and closed-loop global observability capability, targeting industry pain points such as invisible operation of embodied devices, difficulty in detecting exceptions, and difficulty in tracing failures.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F23unu8jne43dtmyt7gwq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F23unu8jne43dtmyt7gwq.png" alt=" " width="799" height="417"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;● Indicator monitoring focuses on the model training realm, covering full-dimension timing monitoring and visualization management of AIBoost cluster AI infrastructure. Through continuous statistics on training resource payload, hardware conditions, environment parameters, and cluster running status, the training procedure can be quantified and abnormal threats can be warned in advance, ensuring the stability and reliability of AI model iteration from the ground up.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftfi1dvqchmdyv8xoqekd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftfi1dvqchmdyv8xoqekd.png" alt=" " width="800" height="409"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;● Tracing Analysis provides deep, end-to-end visibility into service operations, enabling full-link visualization and tracing across the CDN mapping system, motion control services, AI inference links, and cross-device interface interactions. It accurately captures hidden application layer failures such as algorithm drift, background service stuttering, remote instruction blocking, and multi-machine collaborative scheduling conflicts, making previously invisible software and algorithm issues fully transparent and significantly improving the efficiency of troubleshooting soft abnormal issues.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fogudq9q8nsaslv74078y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fogudq9q8nsaslv74078y.png" alt=" " width="800" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;● Log Administration: provides unified collection and standardized administration of end-to-end logs, including hardware operational logs, system process logs, AI module operation records, edge node management events, and job operation traces. It effectively addresses the challenges of scattered logs from heterogeneous devices, inconsistent formats, fragmented data, and difficulty in correlating and tracing issues. With high-throughput ingestion and second-level retrieval capabilities, it delivers complete, objective, and verifiable data support for failure review, root cause analysis, accountability determination, and batch issue tracing.&lt;/p&gt;

&lt;p&gt;With global visualization and management capabilities, you can gain a macro-level view of overall cluster status, device online status, and overall payload fluctuations, while also drilling down into individual device details, achieving bidirectional integration between macro management and micro-level positioning. Combined with dynamic thresholds and intelligent anomaly detection, real-time alerts are triggered for high-frequency threats such as sudden power drops, high-temperature overloads, network disconnections, and data drift, enabling true proactive threat prevention and control.&lt;/p&gt;

&lt;p&gt;Multi-Field Dependency Analysis to Resolve Incremental Hidden Threats with Predictive O&amp;amp;M&lt;br&gt;
Compared with obvious hardware corruption, the slow attenuation of sensor accuracy, line contact fatigue, chronic component aging, algorithm performance degradation, and hidden structural hazards caused by long-term vibration are the key factors affecting the long-term stable operation of humanoid robots. Such progressive issues cannot be detected through manual inspection and require multi-source data field dependency analysis to implement data-driven predictive O&amp;amp;M.&lt;/p&gt;

&lt;p&gt;Leveraging full-volume timing indicator data, this capability accumulates long-term insights into basic resource O&amp;amp;M, model training and inference efficiency evaluation, device payload changes, environmental impact patterns, and hardware aging trends to form a quantifiable health assessment baseline. Through end-to-end Tracing Analysis, the complete flow logic of instruction routing, service invocation, and algorithm computation is fully restored to quickly locate coordination bottlenecks and program anomalies. Combined with unified log administration, system events, error records, environmental changes, and external interference before and after an anomaly are correlated to fully reconstruct the failure scene.&lt;/p&gt;

&lt;p&gt;Multi-dimension data association and cross-validation enable accurate discovery of potential patterns in device operation and early detection of hidden risks. Combined with a tiered alerting mechanism that filters invalid fluctuations and duplicate alerts, threats are escalated and handled by tiering. During the early stage of failure emergence, proactive intervention through parameter automatic rotation tuning, run policy optimization, and remote fine-grained control effectively extends the stable operation epoch of devices, reducing failure rates and burst maintenance costs at the source.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbdsm3ee8y26ms96ygcfe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbdsm3ee8y26ms96ygcfe.png" alt=" " width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The deeper value of observability goes beyond ensuring current stable operation — it uses data from real, complex scenarios to feed back into product R&amp;amp;D and process upgrades, paving the way for long-term commercialization of humanoid robots. By leveraging comprehensive data accumulation, you can horizontally compare operational differences across devices of the same model and batch, quickly identify common issues caused by component batch bugs, schema design shortcomings, and manual assembly process bias, and help manufacturers optimize supply chains and production flows. Through quantitative analysis of algorithm performance, component payload, and sensing stability under different operating conditions, hardware limitations and algorithm bottlenecks are precisely distinguished, helping R&amp;amp;D teams optimize motion control, autonomous navigation, and coordination policies in a targeted manner.&lt;/p&gt;

&lt;p&gt;Meanwhile, massive scenario data such as real road conditions, crowd interference, complex lighting, extreme temperature and humidity, and collision anomalies can continuously enrich the simulation training sample library, narrow the gap between the simulation environment and real outdoor scenarios, accelerate algorithm iteration and real-machine adaptation efficiency, and enable humanoid robots to move faster from competition demonstration scenarios to normalized, large-scale deployment.&lt;/p&gt;

&lt;p&gt;Tiered Closed-Loop Emergency Response System for High Fault Tolerance Operation Assurance in Complex Scenarios&lt;br&gt;
Open outdoor scenarios inherently involve uncertainty. Instantaneous environmental changes, accidental mechanical disturbances, and short-term network anomalies cannot be completely eliminated. A standardized, tiered, and automated emergency response mechanism is the key line of defense for ensuring continuous and stable cluster operation. Based on the business characteristics of multi-robot formation operation, a comprehensive three-level failure handling logic is established: minor individual anomalies, local coordination failures, and systemic major failures. O&amp;amp;M resources are reasonably allocated through tiered control to avoid excessive response or delayed handling.&lt;/p&gt;

&lt;p&gt;When an abnormal event occurs, leverage the observability system to quickly locate the root cause: troubleshoot algorithm and schedule issues through business trace analysis, pinpoint the scope of hardware, power supply, and network anomalies using timing indicators, and restore the complete on-site context with full logs, significantly reducing failure troubleshooting and fix time. After each abnormal event is handled, the complete failure timeline, alerting records, root cause conclusions, and handling reports are automatically accumulated and archived. This not only forms an O&amp;amp;M closed loop, but also builds reusable practical experience for optimizing handling policies and iterating management rules for similar scenarios in the future.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ogf46loyaifikb8zgw7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ogf46loyaifikb8zgw7.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Summary and Outlook&lt;br&gt;
The Beijing Yizhuang Humanoid Robot Half Marathon vividly demonstrates the rapid rise of China's humanoid robot industry and clearly signals that clustering, outdoor operation, and scenario-based deployment are the inevitable direction for the future development of embodied intelligence. As hardware integration and AI algorithms continue to break through, O&amp;amp;M capabilities are becoming a key variable that widens the industry gap. Multi-robot collaboration, hidden threat prevention, and full lifecycle management in open and complex environments are common challenges that all humanoid robot companies must address.&lt;/p&gt;

&lt;p&gt;Alibaba Cloud's full-domain observability solution for embodied intelligence, built on a cloud-edge collaboration architecture, integrates three core capabilities: indicator monitoring, Tracing Analysis, and log analysis. It fully addresses the scenario features of humanoid robots, including mobile operations, cluster formation, weak network adaptation, and long-duration runs. Rather than being limited to a single event application, it provides a mature, standardized, and replicable O&amp;amp;M capability frame for similar outdoor cluster, dynamic operation, and large-scale deployment scenarios across the industry.&lt;/p&gt;

&lt;p&gt;In the future, as the mass production scale of humanoid robots continues to expand and application scenarios keep extending, data-driven artificial intelligence for IT operations, proactive predictive protection, and full-link observability systems will become the core foundation for high-quality development of the embodied intelligence industry, continuously helping China's humanoid robot technology advance from technical demonstration to full-scale commercial deployment.&lt;/p&gt;

&lt;p&gt;Related Products&lt;br&gt;
Simple Log Service: &lt;a href="https://www.alibabacloud.com/en/product/log-service" rel="noopener noreferrer"&gt;https://www.alibabacloud.com/en/product/log-service&lt;/a&gt;&lt;br&gt;
CloudMonitor: &lt;a href="https://www.alibabacloud.com/en/product/cloud-monitor" rel="noopener noreferrer"&gt;https://www.alibabacloud.com/en/product/cloud-monitor&lt;/a&gt;&lt;/p&gt;

</description>
      <category>intelligence</category>
      <category>beginners</category>
      <category>cloudnative</category>
    </item>
    <item>
      <title>Put a Microscope on Hermes: Full Visibility into Agent Execution</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Wed, 20 May 2026 02:26:18 +0000</pubDate>
      <link>https://dev.to/observabilityguy/put-a-microscope-on-hermes-full-visibility-into-agent-execution-2b1a</link>
      <guid>https://dev.to/observabilityguy/put-a-microscope-on-hermes-full-visibility-into-agent-execution-2b1a</guid>
      <description>&lt;p&gt;Alibaba Cloud's OpenTelemetry-based observability plugin brings full visibility to Hermes AI agent execution, enabling traceable costs, performance, and security auditing.&lt;/p&gt;

&lt;p&gt;Hermes is an autonomous AI agent runtime frame developed by Nous Research. Rather than a one-shot Q&amp;amp;A pair-style model encapsulation, it is an agent runtime that continuously runs, invokes tools, accumulates experience, and grows throughout the usage procedure.&lt;/p&gt;

&lt;p&gt;When an AI agent truly starts solving a problem — whether it completes correctly or exhibits bias — the real challenge is often not whether the result is right, but what exactly it did.&lt;/p&gt;

&lt;p&gt;A single run of Hermes is not an ordinary model invocation. A seemingly simple interaction may involve multiple rounds of inference, tool calling, result reinjection, context expansion, and new inference loops. The model decides whether a tool is needed for the next step, and tool results in turn affect the subsequent inference path. Cost, latency, and faults often occur in the middle of this procedure.&lt;/p&gt;

&lt;p&gt;If the system can only provide a final reply, a few scattered logs, or a usage summary for a single invocation, Hermes remains a black box. You know it completed the job, but you can hardly tell how. You know the request consumed a lot of tokens, but you can hardly tell which step drove up the cost. You know the user experience has slowed down, but you can hardly determine whether model generation slowed, tool execution went abnormal, or ReAct (Reasoning + Acting) loops spiraled out of control.&lt;/p&gt;

&lt;p&gt;This is exactly our starting point for building observability into Hermes.&lt;/p&gt;

&lt;p&gt;This article introduces a set of observability plugin solutions provided by Alibaba Cloud for Hermes. It can revert the real execution procedure of Hermes into a structured invocation chain: where a session starts, how many rounds of inference it goes through, which tools are invoked, how many tokens are spent, which step is the most time-consuming, and at which edge zone a fault occurs. Which operations are malicious, and how much sensitive data has been leaked.&lt;/p&gt;

&lt;p&gt;If you are using Hermes for real-world jobs, you will almost certainly encounter these problems:&lt;/p&gt;

&lt;p&gt;● Why is it so expensive this time?&lt;/p&gt;

&lt;p&gt;● Why is it so slow this time?&lt;/p&gt;

&lt;p&gt;● Did it actually invoke that tool?&lt;/p&gt;

&lt;p&gt;● Did the tool it used leak data?&lt;/p&gt;

&lt;p&gt;What these problems have in common is that they are not "results" but "procedures". So, if we can only see the last reply, then from an observational point of view, Hermes is still not interpretable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6xo8zr1qpczzk8pzix17.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6xo8zr1qpczzk8pzix17.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What Exactly Are We Trying to Solve&lt;br&gt;
The Alibaba Cloud Hermes observability plugin focuses on solving the following four types of problems.&lt;/p&gt;

&lt;p&gt;The first is that the procedure is invisible.&lt;/p&gt;

&lt;p&gt;After integrating an LLM, many systems still only show user input, final output, and a usage summary. But the real run of Hermes is far more than that. Behind a single response, there may be multiple rounds of inference, multiple tool executions, continuous context expansion, and new inference loops. Without a call chain, the intermediate procedure is essentially empty. The first thing we did was fill in that gap.&lt;/p&gt;

&lt;p&gt;The second is that costs are not attributable.&lt;/p&gt;

&lt;p&gt;The token bill itself isn't the hardest problem — the hardest part is not knowing where the money actually goes. A Hermes run can be expensive because the context suddenly explodes in a certain round, a tool returns an oversized result, the final round produces overly long output, or a certain class of jobs naturally triggers more steps. Without visibility into the tokens for each round of model invocation, cost analysis is nothing more than guesswork.&lt;/p&gt;

&lt;p&gt;The third category is that performance cannot be broken down.&lt;/p&gt;

&lt;p&gt;Users will only tell you "it's getting slower," but "slow" by itself carries no useful info. What you really need to distinguish is: is the first token slow, or is overall generation slow? Is tool execution slow, or is multi-round ReAct inference itself running too long? Only by separating these stages can a "slowdown" become a problem you can actually pinpoint.&lt;/p&gt;

&lt;p&gt;The fourth category is that results cannot be reviewed.&lt;/p&gt;

&lt;p&gt;Often the hardest issues to deal with are not clear-cut faults, but cases where "it looks like it succeeded, but the result is wrong." This is very common in agent systems: Hermes invokes the wrong tool, the tool returns incomplete results, Hermes continues to infer based on partial info, and ultimately produces an answer that seems reasonable on the surface but has already gone off track. Without traces, post-mortem review is nearly impossible. With traces, the problem shifts from "guessing the cause" to "examining the path."&lt;/p&gt;

&lt;p&gt;What We Did&lt;br&gt;
What we built for Hermes is a set of OpenTelemetry (open telemetry frame)-based Tracing Analysis capabilities.&lt;/p&gt;

&lt;p&gt;The core idea is straightforward: install runtime instrumentation in the Python environment where Hermes runs, establish spans around the key execution borders of Hermes, and then report traces and indicators to the observability backend through OTLP (OpenTelemetry Protocol), a standard protocol.&lt;/p&gt;

&lt;p&gt;Our focus is not on "what the last row of reply looks like", but on the running procedure of Hermes itself.&lt;/p&gt;

&lt;p&gt;This Solution Has Several Advantages Worth Highlighting&lt;br&gt;
It is worth mentioning that this set of plugins is not a temporary instrumentation script thrown together, but is designed along the OpenTelemetry system.&lt;/p&gt;

&lt;p&gt;First, it follows the GenAI standard specification as closely as possible at the semantics layer. The currently reported trace data preferentially snaps to the OpenTelemetry GenAI semantic conventions. For structures in the Agent runtime that are closer to the execution procedure, extensions are made in combination with LoongSuite Semantic Conventions. Instead of defining a batch of field names that can only be understood internally, we try to use a set of standard, reusable, and portable semantic expressions. In other words, this is not a makeshift approach, but a well-structured observability design that follows industry best practices.&lt;/p&gt;

&lt;p&gt;Second, it provides not only traces but also basic metrics signals. In addition to the call chain of a single request, you can also view trends such as the number of invocations, number of faults, invocation duration, and token usage. This way, you can replay a single request along a trace, or observe cost fluctuations, performance changes, and abnormal trends from a global perspective.&lt;/p&gt;

&lt;p&gt;Third, it records time to first token (TTFT) separately for streaming scenarios. In many cases, when users perceive something as "slow", it is not necessarily that the entire generation is slow, but rather that the first token takes too long to return. With TTFT, performance issues can be further broken down from "feels slow" into "slow first token" or "slow overall generation".&lt;/p&gt;

&lt;p&gt;Fourth, it is not attached to a single Alibaba Cloud service on the backend. The current solution can be directly connected to Alibaba Cloud ARMS, but it uses the OTLP standard protocol underneath and is not designed to be locked into a private data structure. Connecting to ARMS works today, and if you need to connect to other OTLP-compatible backends in the future, migration space is preserved.&lt;/p&gt;

&lt;p&gt;Fifth, it supports security audits of important behaviors in Hermes. By collecting full operation logs, access records, and user behavioral data from the Hermes system, and combining outlier detection algorithms to build a dynamic audit model, it can accurately detect suspicious behaviors such as unauthorized access, abnormal data exporting, and malicious prompt injection.&lt;/p&gt;

&lt;p&gt;What Can Already Be Seen&lt;br&gt;
The observability capability of the current version of Hermes can revert a real agent run into a ReAct structured trace.&lt;/p&gt;

&lt;p&gt;The core pipeline is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;invoke_agent Hermes  
└── react step  
    ├── chat   
   └── execute_tool &lt;span class="nt"&gt;&amp;lt;tool_name&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a job contains multiple rounds of inference and multiple tool calls, the pipeline naturally expands:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn7ar72a6q6h8y7ii4s8w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn7ar72a6q6h8y7ii4s8w.png" alt=" " width="799" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The significance of this pipeline is not that there are more spans, but that the actual execution of Hermes becomes visible for the first time.&lt;/p&gt;

&lt;p&gt;How many rounds an execution ran, which round triggered the tool, and how the tool affected subsequent inference — all of this can now be viewed in the same trace.&lt;/p&gt;

&lt;p&gt;Call a Model&lt;br&gt;
Each chat span can currently record:&lt;/p&gt;

&lt;p&gt;● gen_ai.request.model&lt;/p&gt;

&lt;p&gt;● gen_ai.usage.input_tokens&lt;/p&gt;

&lt;p&gt;● gen_ai.usage.output_tokens&lt;/p&gt;

&lt;p&gt;● gen_ai.usage.total_tokens&lt;/p&gt;

&lt;p&gt;● gen_ai.response.time_to_first_token&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi19w6a51mxuu43amt939.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi19w6a51mxuu43amt939.png" alt=" " width="800" height="267"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This means we can finally view tokens and latency per "actual model invocation" instead of only looking at the aggregate of an entire session. Especially in streaming scenarios, TTFT (time to first token,first-token latency) can help us further distinguish whether the first token is slow to return or the overall generation procedure is slow.&lt;/p&gt;

&lt;p&gt;Tool Calling&lt;br&gt;
Each execute_tool span can currently record:&lt;/p&gt;

&lt;p&gt;● gen_ai.tool.name&lt;/p&gt;

&lt;p&gt;● gen_ai.tool.call.arguments&lt;/p&gt;

&lt;p&gt;● gen_ai.tool.call.result&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbqo7di15j111tos3z40w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbqo7di15j111tos3z40w.png" alt=" " width="800" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Tools are no longer empty edge zones in the procedure. We can see when Hermes decided to invoke a tool, which tool was invoked, what parameters were passed, and what results were returned.&lt;/p&gt;

&lt;p&gt;Agent-Level Summary&lt;br&gt;
The root vertex invoke_agent Hermes span can now record the aggregation results of the entire run, including:&lt;/p&gt;

&lt;p&gt;● Cumulative Token&lt;/p&gt;

&lt;p&gt;● Final output message&lt;/p&gt;

&lt;p&gt;● Total time consumption info&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2y41w6atsykoywrgu3c8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2y41w6atsykoywrgu3c8.png" alt=" " width="800" height="279"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Important Behavior Audit&lt;br&gt;
Records agent behavior across the full chain, intelligently generates audit views, and exposes high-risk operations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fav746c10bip98ik2a16l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fav746c10bip98ik2a16l.png" alt=" " width="800" height="325"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Quick Observability Integration: Deployment in a Few Steps&lt;br&gt;
The integration path for Hermes observability is streamlined into a straightforward flow: get the command from the console, copy it to the terminal and execute it, enable the plugin, start Hermes, and begin reporting.&lt;/p&gt;

&lt;p&gt;Tracing Integration&lt;br&gt;
Go to the console to obtain the installation command&lt;br&gt;
Log on to the CMS 2.0 (Cloud Monitor Service 2.0) console, go to the corresponding application monitoring workspace, choose Integration Center &amp;gt; AI Application Observability, and click Hermes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69rcc05rakivmacum0r2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69rcc05rakivmacum0r2.png" alt=" " width="799" height="281"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the sidebar, enter the application name and click Get to immediately generate the integration command. Click the icon in the upper-right corner to copy it with one click.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fws1shidyunnurmuf0szs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fws1shidyunnurmuf0szs.png" alt=" " width="800" height="499"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One-line command to start installation&lt;br&gt;
Open the terminal on the machine where Hermes is located, paste the copied command, and execute it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;curl -fsSL https://arms-apm-cn-hangzhou-pre.oss-cn-hangzhou.aliyuncs.com/hermes-agent-cms-plugin/hermes-cms.sh | bash -s -- install \  
  --x-arms-license-key "auto" \  
  --x-arms-project "Your project" \  
  --x-cms-workspace "Your Workspace" \  
  --serviceName "hermes" \  
  --endpoint "https://Your ARMS-OTLP address/apm/trace/opentelemetry"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you execute the installation command for the first time, in addition to installing the plugin itself, the system also registers the hermes-cms command on the local machine for subsequent operations such as enable, disable, and uninstall.&lt;/p&gt;

&lt;p&gt;If the following message appears in the terminal, the plugin has been installed successfully:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;════════════════════════════════════════════════════

✅ hermes-agent-cms-plugin installed successfully!

════════════════════════════════════════════════════
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Throughout the procedure, you do not need to manually edit the configuration file. The script will first match the current environment. Only when the current environment does not meet the requirements will it resume trying the official default installation position.&lt;/p&gt;

&lt;p&gt;Turn on observability, and then start Hermes&lt;br&gt;
After the installation is complete, don't rush to check the console.&lt;/p&gt;

&lt;p&gt;The first step is to turn on the observability switch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;hermes-cms enable
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then start Hermes.&lt;/p&gt;

&lt;p&gt;To run in the foreground, execute directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;hermes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run executable in background:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;hermes gateway install

hermes gateway start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How to confirm that instrumentation is actually working&lt;br&gt;
If the following tooltip appears in the terminal after startup, the observability instrumentation has taken effect:&lt;/p&gt;

&lt;p&gt;loongsuite-site-bootstrap: started successfully (OpenTelemetry auto-instrumentation initialized).&lt;/p&gt;

&lt;p&gt;After confirming that the instrumentation has taken effect, send a few test requests to Hermes to run a real job that triggers multiple rounds of inference and tool calling. After a minute or two, return to the CMS 2.0 console, and you will see your Hermes application in AI Application Observability.&lt;/p&gt;

&lt;p&gt;At this point, Hermes is no longer just a black box responder — it becomes a running system that can be expanded, tracked, and analyzed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0acuihguxatpfc3sgb5u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0acuihguxatpfc3sgb5u.png" alt=" " width="800" height="119"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmvqvbeyjgmzvltck5v1q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmvqvbeyjgmzvltck5v1q.png" alt=" " width="800" height="554"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Enter our observability application to view not only the number of Hermes model invocations, token consumption trends, request fluctuations, and the average number of LLM invocation rounds per request, but also the latency and invocation distribution across AGENT, LLM, and TOOL phases. You can also trace a complete Trace to revert the actual execution procedure of Hermes, clearly seeing how many rounds of inference a job went through, which tools were invoked, which step took the longest, and which round consumed the most tokens.&lt;/p&gt;

&lt;p&gt;View the demo examples and the hermes_agentloop_support example at &lt;a href="https://sls.aliyun.com/doc/en/playground/cmsdemo.html" rel="noopener noreferrer"&gt;https://sls.aliyun.com/doc/en/playground/cmsdemo.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Want to shut down or uninstall? It's straightforward.&lt;br&gt;
To temporarily shut down observability, execute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;hermes-cms disable
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To completely uninstall the plugin, execute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;hermes-cms uninstall
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Log Ingestion&lt;br&gt;
Configure application info on the access Card&lt;br&gt;
Next, click the "Log Access" page, set a custom application name, click Initialize Resources, enter the previously configured Project name, and configure the machine group as prompted to complete the Hermes Audit Feature with one click.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3e0tgd36zoea72512m89.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3e0tgd36zoea72512m89.png" alt=" " width="800" height="740"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Auto-generated Audit dashboard&lt;br&gt;
After the access is complete, in the left sidebar, choose Audit &amp;gt; Hermes Insight &amp;gt; Hermes Audit to view the audit dashboard of your Hermes agent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fruup32h7qkxddl55lwe3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fruup32h7qkxddl55lwe3.png" alt=" " width="800" height="537"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Summary and Outlook&lt;br&gt;
This solution can reliably address Tracing Analysis, token attribution, and basic performance breakdown, while also providing basic metrics signals for trend analysis. However, this does not mean that all observability work for Hermes is complete.&lt;/p&gt;

&lt;p&gt;Next, we will continue to push forward in several directions.&lt;/p&gt;

&lt;p&gt;● On the data plane, continue to expand from traces, span properties, and basic indicators to more complete log audit and runtime diagnostics capabilities.&lt;/p&gt;

&lt;p&gt;● On the link plane, continue to refine Hermes-specific execution phases beyond agent, step, llm, and tool, such as memory lifecycle, delegation orchestration, and runtime recovery.&lt;/p&gt;

&lt;p&gt;● On the governance plane, continue to strengthen content collection control, finer-grained data governance capabilities, and unified desensitization and security policy development.&lt;/p&gt;

&lt;p&gt;Today, we already have an active runtime observability infrastructure, and the next goal is to further evolve it into a more complete, more detailed Agent observability system that is better suited for real production environments.&lt;/p&gt;

</description>
      <category>hermes</category>
      <category>agents</category>
      <category>beginners</category>
    </item>
    <item>
      <title>From Observable to Understandable: Building Agent-Native Code Knowledge Graphs with UModel</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Mon, 11 May 2026 06:57:40 +0000</pubDate>
      <link>https://dev.to/observabilityguy/from-observable-to-understandable-building-agent-native-code-knowledge-graphs-with-umodel-dll</link>
      <guid>https://dev.to/observabilityguy/from-observable-to-understandable-building-agent-native-code-knowledge-graphs-with-umodel-dll</guid>
      <description>&lt;p&gt;UModel builds agent-native code knowledge graphs using deterministic AST parsing and cross-domain associations for deeper AI code understanding.&lt;/p&gt;

&lt;p&gt;Background&lt;br&gt;
In recent years, AI agents (Cursor, Copilot, Claude Code, Codex, etc.) have become deeply involved in software development. From code completion to cross-file refactoring, from bug localization to architecture design, agent capabilities are growing stronger. From Prompt Engineering to Context Engineering to Harness Engineering, the ways to harness AI continue to evolve, and the capability boundaries of agents continue to expand.&lt;/p&gt;

&lt;p&gt;However, when we hand a real enterprise-level project to an agent, an overlooked question begins to surface: Does the agent really understand your project?&lt;/p&gt;

&lt;p&gt;The way agents currently understand code is diverging into two distinct schools:&lt;/p&gt;

&lt;p&gt;● No-index school: Claude Code follows the Unix philosophy and performs no pre-indexing at all — it searches the file system in real time using grep, rg, and glob. Anthropic's internal tests found that agentic search outperforms retrieval-augmented generation across the board, by a lot. It is concise, real-time, and free of privacy issues, but each session starts from scratch and is costly for large repositories.&lt;/p&gt;

&lt;p&gt;● CodeIndex School: Cursor, Windsurf, and Copilot follow the vector index route: using tree-sitter for semantic text segmentation, generating embeddings and storing them in a vector database (such as Turbopuffer), then using Merkle tree for incremental synchronization. Qodo and Augment Code go a step further by overlaying a code dependency graph and commit history index on top of the vector index.&lt;/p&gt;

&lt;p&gt;Both schools have their own strengths, but they still struggle with the following problems:&lt;/p&gt;

&lt;p&gt;● I want to change the Adapter interface of pkg/a2a. What is the scope of impact?&lt;/p&gt;

&lt;p&gt;Vector similarity search cannot find the dependency chain, and grep-based file-by-file search is inefficient and incomplete.&lt;br&gt;
● In production, the vibeops-xxx SLO has been breached with a large number of pending requests. What is the cause? Is it a code change?&lt;/p&gt;

&lt;p&gt;The code index only covers the code domain; O&amp;amp;M domain data is not in the graph.&lt;br&gt;
● Are there any abnormal dependencies in the project that cross architecture borders?&lt;/p&gt;

&lt;p&gt;Without architecture level modeling, crossing borders cannot be defined.&lt;br&gt;
What these problems have in common is that they require deterministic structural relationships, cross-domain entity associations, and change history across the time dimension.&lt;/p&gt;

&lt;p&gt;The author has been working in the observable field for more than ten years, reviewing the development of observable, especially with the increasing complexity of cloud native and AI native systems, observable has long faced not only "looking at a log and staring at a monitoring chart", but also putting the scattered objects such as applications, services, containers, databases, alarms, changes and events back into the same context, answer "who is related to whom", "how the impact is spread" and "when did the problem begin to occur".&lt;/p&gt;

&lt;p&gt;Because of this, Alibaba Cloud can observe the gradual evolution from the collection and display of scattered data such as logs, indicators, and links to the unified modeling of object-oriented, relationship, and time series. UModel is precipitated under this practical background.&lt;/p&gt;

&lt;p&gt;This is strikingly similar to the trajectory of the observability realm: from viewing logs to unified modeling, observability evolved from fragmented data to the UModel knowledge graph. Yet code understanding, even with the most advanced CodeIndex solution, remains at the stage of helping agents find relevant snippets — the snippets are found, but the structure is not understood.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxi9hzarx0fxtswhyt41a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxi9hzarx0fxtswhyt41a.png" alt=" " width="800" height="409"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Five Paradigms of Code Understanding&lt;br&gt;
Before diving into the technical solution, it is necessary to clarify the complete landscape of current code understanding. The five paradigms represent the evolution from stateless search to stateful inference.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3683oejicr4u4d9mp4fh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3683oejicr4u4d9mp4fh.png" alt=" " width="800" height="267"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Paradigm 1: Agentic Search (Claude Code School)&lt;br&gt;
Claude Code is currently the most extreme index-free route. Anthropic founding engineer Boris Cherny publicly shared the story behind this decision: early versions of Claude Code used retrieval-augmented generation + a local vector library, but internal tests found that agentic search won comprehensively — by a lot, and this was surprising.&lt;/p&gt;

&lt;p&gt;Its approach is pure to the point of elegance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;Agent receives a question  
  → Glob: pattern matching by file name (near-zero token cost)  
  → Grep (ripgrep): regex search by content (low token cost)  
  → Read: read the complete file (high token cost)  
  → Evaluate → next round of search or provide an answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tools are tiered by token cost, and the agent independently determines the search policy — like an experienced developer using rg + cat in the terminal to troubleshoot issues. This Unix-philosophy method has several real advantages:&lt;/p&gt;

&lt;p&gt;● Zero pre-processing: no index build time required — open the project and start working immediately&lt;/p&gt;

&lt;p&gt;● Always Fresh: No index expiration issues. Every search reflects the real-time file system status.&lt;/p&gt;

&lt;p&gt;● Privacy-Friendly: Code never leaves your local machine — no embeddings are generated, and nothing is uploaded to any server.&lt;/p&gt;

&lt;p&gt;● Simple and Reliable: The dependency chain is extremely short: Agent + file system + ripgrep. No vector database to crash.&lt;/p&gt;

&lt;p&gt;But the ceiling of this approach is equally clear:&lt;/p&gt;

&lt;p&gt;● No Structure Awareness: rg HandleRequest can find all occurrences, but cannot distinguish definitions from invocations or comments. The Agent has to read the code itself to determine this.&lt;/p&gt;

&lt;p&gt;● Start from Scratch Every Time: Dependencies analyzed in the previous session are entirely discarded in the next. There is no persistence of accumulated knowledge.&lt;/p&gt;

&lt;p&gt;● Limited scale: A TypeScript project with 200 files is fine, but for an enterprise-level monorepo with 50,000 files, agentic search may require 30+ rounds of tool calling and tens of thousands of tokens to piece together a global dependency graph. In practice, it is impossible to construct a complete global graph — only partial views relevant to the current job can be assembled.&lt;/p&gt;

&lt;p&gt;● Unable to perform global analysis: Cannot answer "list all invocations across architecture levels" because the architecture levels themselves have not been modeled.&lt;/p&gt;

&lt;p&gt;Paradigm 2: CodeIndex / Vector Index (Cursor, Windsurf, and Copilot School)&lt;br&gt;
This is the mainstream technical approach of current AI IDEs. Taking Cursor as an example, its technical architecture has been extensively analyzed in public:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;Code Repository  
  → Parse into AST with tree-sitter  
  → Segment by semantic unit (function, class, logic block)  
  → Generate vector embedding  
  → Store in Turbopuffer vector database  
  → Merkle Tree tracks changes for incremental synchronization
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cursor has achieved several elegant optimizations in engineering: it uses Merkle Tree root hash comparison to detect changes every 10 minutes and only re-embeds changed files; 92% codebase similarity among team members allows index reuse, reducing the initial indexing for new members from minutes to seconds; the index scope is controlled via .cursorignore.&lt;/p&gt;

&lt;p&gt;Windsurf (Codeium) uses a similar retrieval-augmented generation architecture: 768-dimensional vector embedding + proprietary M-Query retrieval, but additionally overlays the Cascade context engine to track edit history, terminal commands, navigation patterns, and other session states. GitHub Copilot achieved sub-second semantic search indexing in March 2025.&lt;/p&gt;

&lt;p&gt;The real value of CodeIndex is semantic search: the agent can find relevant code by describing intent in natural language without knowing the exact function name. This is something grep cannot do.&lt;/p&gt;

&lt;p&gt;But CodeIndex has a fundamental limitation: vector similarity is text-level approximate matching, not structure-level relational reasoning.&lt;/p&gt;

&lt;p&gt;● import pkg/a2a is a deterministic dependency in code, but in vector space it is merely a similarity signal of a text segment.&lt;/p&gt;

&lt;p&gt;● Finding all modules that directly or indirectly depend on pkg/a2a requires graph traversal, not AISearch.&lt;/p&gt;

&lt;p&gt;● Determining how many hops the impact of this interface change propagates along the invocation chain requires deterministic call relationships, not semantic similarity.&lt;/p&gt;

&lt;p&gt;● Augment Code's evaluation shows that Cursor produces inconsistencies in cross-file refactoring across 50+ files: the first 30 files are modified correctly, but the last 20 contain faults due to context window overflow.&lt;/p&gt;

&lt;p&gt;CodeIndex is essentially a smarter search engine: it helps agents find the correct snippets to insert into the context, but does not perform structured inference for agents.&lt;/p&gt;

&lt;p&gt;Paradigm 3: Code Graph + Retrieval-Augmented Generation Hybrid (Qodo and Augment Code School)&lt;br&gt;
Qodo and Augment Code represent the next evolutionary direction of CodeIndex: layering code structure graphs on top of vector indexes.&lt;/p&gt;

&lt;p&gt;Qodo's technology stack is particularly rigorous:&lt;/p&gt;

&lt;p&gt;● Self-developed Qodo-Embed-1 code embedding model (1.5B parameters surpassing 7B competitors on the CoIR benchmark), capturing syntax, variable dependencies, control flow, API usage, and other code-specific semantics through synthetic data training&lt;/p&gt;

&lt;p&gt;● Client-side code graph building: functions, classes, modules and their call graphs, inheritance relationships, and cross-language links&lt;/p&gt;

&lt;p&gt;● Server-side maintenance of vector database + design documents + architecture diagrams + PR/commit history&lt;/p&gt;

&lt;p&gt;● AST-aware segment policy: recursively chunk AST edge zones and backfill key contexts such as import statements and class definitions&lt;/p&gt;

&lt;p&gt;Augment Code 's Context Engine goes even further:&lt;/p&gt;

&lt;p&gt;● Semantic index across repositories to understand how services connect and depend on each other&lt;/p&gt;

&lt;p&gt;● Index beyond Code: commit history (why changes were made), codebase patterns, external documents, tickets, and even tribal knowledge&lt;/p&gt;

&lt;p&gt;● Released Context Lineage in 2025 to index commit histories and diff summaries, enabling agents to understand the evolution of architectural decisions&lt;/p&gt;

&lt;p&gt;● Open to any compatible agent via MCP protocol, with benchmarks showing 30–80% quality improvement&lt;/p&gt;

&lt;p&gt;The key advancement of this school of thought is that code is not just text, but a structured graph. Augment, in particular, demonstrates the insight that understanding requires context, and context requires history.&lt;/p&gt;

&lt;p&gt;However, even the most advanced code graph + retrieval-augmented generation hybrid solution still has several systemic borders:&lt;/p&gt;

&lt;p&gt;● The graph scope is limited to the code domain: It knows that A invokes B, but not what alerts the service corresponding to B has triggered in the production environment. The code graph and the O&amp;amp;M graph are disconnected.&lt;/p&gt;

&lt;p&gt;● Limited graph query capabilities: Graphs serving retrieval-augmented generation typically support neighbor lookup and short-path queries, but do not support arbitrary-depth graph traversal, pattern matching, or aggregation and analysis.&lt;/p&gt;

&lt;p&gt;● IDE-local, not team-global: The index is attached to a developer's IDE instance. Structural insights analyzed by one person cannot be directly reused by another.&lt;/p&gt;

&lt;p&gt;● Lack of a standardized timing dimension: Augment's Context Lineage has started incorporating commit history, but build logs, deployment logs, test logs, and event logs — these complete temporal memories are not yet in the graph.&lt;/p&gt;

&lt;p&gt;Paradigm 4: CodeWiki / LLM Document (DeepWiki School)&lt;br&gt;
DeepWiki (GitHub 15.7k stars, produced by the team behind Cognition AI / Devin) represents another approach: Code Repository → LLM → polished Wiki document. Simply replace github.com in the URL with deepwiki.com to see the automatically generated architecture diagrams, module documents, and function annotations.&lt;/p&gt;

&lt;p&gt;This provides an excellent experience for developers to quickly understand unfamiliar projects. DeepWiki also supports controlling the generation scope through the .devin/wiki.json configuration file, and provides tool interfaces such as ask_question, read_wiki_structure, and read_wiki_contents via the MCP Server.&lt;/p&gt;

&lt;p&gt;But documents are essentially linear narratives optimized for human reading:&lt;/p&gt;

&lt;p&gt;● Hard to authenticate: Descriptions generated by LLMs may hallucinate, and in code understanding, an incorrect "A invokes B" is more dangerous than no information at all.&lt;/p&gt;

&lt;p&gt;● Hard to traverse: Documents cannot answer graph traversal queries such as "list all functions that invoke X."&lt;/p&gt;

&lt;p&gt;● Difficult to infer: Multi-hop analysis is not supported: if A is changed, following the calls relationship for 3 hops, which entry points are affected?&lt;/p&gt;

&lt;p&gt;● Difficult to maintain: Changing a single line of code requires full regeneration. Although DeepWiki supports badge-triggered auto-refresh, each time it invokes a full LLM call, resulting in high cost and latency.&lt;/p&gt;

&lt;p&gt;● Not programmable: The MCP interface essentially asks a document a question, rather than executing a query on the graph.&lt;/p&gt;

&lt;p&gt;The relationship between CodeWiki and CodeIndex is similar to the relationship between materialized views and DPI engines in the database realm: documents are precomputed views that answer preset questions quickly, but cannot answer ad-hoc queries outside the view.&lt;/p&gt;

&lt;p&gt;Paradigm 5: Code Knowledge Graph (Our Choice)&lt;br&gt;
The five paradigms can be arranged along a single axis: from "stateless search" to "stateful inference".&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc0k9vq1u4wspamud8bz9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc0k9vq1u4wspamud8bz9.png" alt=" " width="789" height="488"&gt;&lt;/a&gt;&lt;br&gt;
If Agentic Search is each on-site survey, CodeIndex is surveying with a high-definition map, Code Graph + retrieval-augmented generation is a map annotated with highways and railways, and CodeWiki is a commissioned local chronicle: then what we want to build is a living GIS system: you can query the path between any two points, overlay real-time traffic data, annotate the traffic history of each road, continuously update as the terrain changes, and support storage analysis in any dimension.&lt;/p&gt;

&lt;p&gt;The key difference is not better search, but a systematic combination of three dimensions:&lt;/p&gt;

&lt;p&gt;1.Deterministic vs. Probabilistic: CodeIndex gives you the most likely relevant snippets (vector similarity). Code Graph gives you structural relationships parsed from the AST (but query capability is limited by the retrieval-augmented generation frame). We give you deterministic AST fetch + SPL/graph-match arbitrary query: confidence level 1.0 relationships + a Turing-complete query language.&lt;/p&gt;

&lt;p&gt;2.Code domain vs cross-domain: From Agentic Search to Code Graph + retrieval-augmented generation, all solutions stop at the code domain. Which functions does this module invoke: answerable. How many alerts did the production service corresponding to this module have last week: unanswerable. UModel's EntitySetLink can connect code.module to ops.service, event.alert, and req.issue. The agent infers along the link without needing to jump out of the graph.&lt;/p&gt;

&lt;p&gt;3.Snapshot vs timeline: CodeIndex is a snapshot index of the current code. Code Graph is starting to incorporate commit history. We provide a complete time dimension: commit_log, build_log, deploy_log, test_log, and incident_log. Each LogSet is associated with an EntitySet through DataLink. The agent not only knows what the current structure is, but also how it evolved to this point and how it performs in production.&lt;/p&gt;

&lt;p&gt;From Personal Wiki to Code Wiki: One Paradigm, Different Certainty&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fua8ekaipixgm8avpr4o1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fua8ekaipixgm8avpr4o1.png" alt=" " width="800" height="707"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The personal Wiki flow is: source data → LLM extracts entities and relationships → snap and normalization → UModel structure layer → Wiki pages. The entire extraction procedure depends entirely on the LLM, so each relationship is inherently uncertain: Are Zhang Cheng and Yuan Yi the same person? Is this article related to that project? Both require LLM judgment and correction by the snap layer.&lt;/p&gt;

&lt;p&gt;There is one fundamental difference in the code realm: the structural relationships of code are deterministic.&lt;/p&gt;

&lt;p&gt;import pkg/a2a imports pkg/a2a, and func (s *Server) HandleRequest() is a method of the Server class: these do not require LLM inference — AST parsing can determine them with a confidence level of 1.0.&lt;/p&gt;

&lt;p&gt;This means that code wikis can introduce a model layer deterministic guarantee on top of the personal wiki paradigm:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;Personal Wiki:   Source material → [LLM fetch] → Snap → UModel → Wiki Page  
                          ↑ Entirely dependent on LLM, confidence level 0.4–0.9  

Code Wiki:   Code Repository → [AST deterministic fetch] + [LLM semantics enhancement] → UModel → CLI query  
                          ↑ Structural relationships determined (1.0)   ↑ Summary/attribution supplement (0.6–0.9) 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This layer of determinism is critical to the agent's reasoning: when the agent performs RCA, it needs to trust every hop on the invocation chain. If a calls relationship is guessed by the LLM, the entire reasoning chain becomes unreliable. Relationships fetched by AST are deterministic facts that the agent can trust unconditionally.&lt;/p&gt;

&lt;p&gt;At the same time, the code wiki retains the LLM enhancement capabilities of the personal wiki: semantic layer information such as module summaries, document-code associations, and widget attributions is still generated by the LLM, annotated as INFERRED, and the agent can selectively accept it.&lt;/p&gt;

&lt;p&gt;Entity + Log + Link: Not Just a Structure Graph&lt;br&gt;
The core design of UModel in the observability realm is to describe the IT world with a graph composed of sets and links: EntitySet describes the current state of entities, LogSet describes timing management events, MetricSet describes measure indicators, and Link connects them into a network.&lt;/p&gt;

&lt;p&gt;When we apply the same modeling methodology to the code realm, we get more than just a structure graph.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foxcn1ncpk0skz1592aa3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foxcn1ncpk0skz1592aa3.png" alt=" " width="800" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Entity: Current Code Structure&lt;br&gt;
Five types of EntitySets describe the current state of the code and support the coexistence of multiple repositories through repo_id composite primary keys:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Futd48161x0yn31cepke8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Futd48161x0yn31cepke8.png" alt=" " width="789" height="224"&gt;&lt;/a&gt;&lt;br&gt;
repo_id participates in the primary key calculation (Entity ID = md5(repo_id:pk_value)), so that modules with the same name in different repositories do not conflict, and a single graph can accommodate multiple projects simultaneously.&lt;/p&gt;

&lt;p&gt;Six types of EntitySetLink describe structural relationships: contains, imports, calls, extends, describes, and belongs_to. Each relationship is annotated with &lt;strong&gt;confidence&lt;/strong&gt; and &lt;strong&gt;extraction_method&lt;/strong&gt; (EXTRACTED / INFERRED / AMBIGUOUS).&lt;/p&gt;

&lt;p&gt;Log: The Change History of Code&lt;br&gt;
This is a critical watershed between Code-WIKI and all pure graph tools.&lt;/p&gt;

&lt;p&gt;In the observability realm, we look at not only the current status of a pod (Entity), but also its logs and metric trends. Code is the same: looking only at the structure without the history is like looking at a single screenshot.&lt;/p&gt;

&lt;p&gt;Logs in the code realm go far beyond Git commits:&lt;/p&gt;

&lt;p&gt;The value of logs lies in the associated query with entities:&lt;/p&gt;

&lt;p&gt;● Who modified this module in the last week? →commit_log WHERE module_path = X AND time &amp;gt; now()-7d&lt;/p&gt;

&lt;p&gt;● Have any new incidents occurred since the last deployment? →deploy_log JOIN incident_log ON time_window&lt;/p&gt;

&lt;p&gt;● Has the build time increased after introducing this dependency? →build_log GROUP BY week, cross-referencing dependency change time in commit_log&lt;/p&gt;

&lt;p&gt;Each LogSet is associated with the corresponding EntitySet through DataLink. The agent can navigate from an entity to a log, or trace back from a log to an entity.&lt;/p&gt;

&lt;p&gt;Cross-Domain Association: Code Is Not an Island&lt;br&gt;
Code never exists in isolation. It serves requirements, reaches production through CICD, generates observable data at runtime, and traces back to the code for troubleshooting when issues arise. In the current toolchain, each link is an island: requirements are in Jira, code is in Git, builds are in Jenkins, services run in K8s, and alerts are in the monitoring system.&lt;/p&gt;

&lt;p&gt;When a production alert fires, how many systems must you jump through and how many pieces of info must you manually correlate to trace from the alert back to the code change?&lt;/p&gt;

&lt;p&gt;The value of UModel is that all these entities can live in the same graph.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9nca7oqh8b8ako28411.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9nca7oqh8b8ako28411.png" alt=" " width="800" height="718"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Technical Architecture: Dual-Track Fetch + Graph Build&lt;br&gt;
Overall Pipeline&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79bo2val5wqhk61sw304.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79bo2val5wqhk61sw304.png" alt=" " width="800" height="371"&gt;&lt;/a&gt;&lt;br&gt;
DETECT: Incremental Change Detection&lt;br&gt;
A SHA256 content fingerprint is computed for each file and compared against the cache from the last build. For vibeops-agents (~2,375 Go files), an incremental build typically processes only dozens of changed files, reducing the time from minutes to seconds.&lt;/p&gt;

&lt;p&gt;EXTRACT: AST + LLM Dual Track&lt;br&gt;
AST track (tree-sitter): A PEG-based incremental resolver that supports 40+ languages. It uses tags.scm rules to consistently fetch definitions, references, structural relationships, import relationships, invocation relationships, and inheritance relationships across languages. All extraction results have a confidence level of 1.0.&lt;/p&gt;

&lt;p&gt;Notably, CodeIndex solutions such as Cursor also use tree-sitter. However, they use tree-sitter for semantic text segmentation (splitting code into chunks suitable for embedding), whereas we use tree-sitter for structure extraction (fetching deterministic relationships such as definitions, references, invocations, and inheritance). The same resolver serves completely different goals: the former produces vectors, and the latter produces a graph.&lt;/p&gt;

&lt;p&gt;LLM track: Module summaries (agent context injection segments, not human-readable documents), document-code associations, and widget attribution. Each is annotated with &lt;strong&gt;extraction_method&lt;/strong&gt;: INFERRED + confidence level. Agents can select a trust threshold by scenario: RCA prefers high confidence levels, while exploration scenarios can be relaxed.&lt;/p&gt;

&lt;p&gt;RESOLVE: Cross-file Symbol Parsing&lt;br&gt;
Single-file AST cannot resolve cross-file references. RESOLVE handles the following:&lt;/p&gt;

&lt;p&gt;● Go import github.com/org/repo/pkg/a2a→ module_path pkg/a2a&lt;/p&gt;

&lt;p&gt;● Method receiver type (s *Server)→ attribution code.type pkg/server.Server&lt;/p&gt;

&lt;p&gt;● Invoke s.HandleRequest()→pkg/server.Server.HandleRequest&lt;/p&gt;

&lt;p&gt;● Interface implementation type Adapter struct implements Handler→ extends relationship&lt;/p&gt;

&lt;p&gt;Deterministic parsing, no dependency on LLM.&lt;/p&gt;

&lt;p&gt;BUILD: Graph Assembly + Architecture Discovery&lt;br&gt;
Architecture discovery is not simple community detection: Louvain/Leiden discovers clusters, not architectures. Complete flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;Step 1: Graph construction  
  Modules as edge zones, imports + calls + extends as directed edges  
  Edge weight: calls &amp;gt; imports &amp;gt; extends  

Step 2: Hierarchical analysis  
  Compute dependency directionality: A→B and B↛A → A is above B  
  Detect top-level entries with indegree = 0 and underlying infrastructure with outdegree = 0  

Step 3: Community detection  
  Leiden algorithm discovers functional clusters on directed graphs  
  Resolution parameter controls granularity (~150 modules → ~15 widgets)  

Step 4: Annotation and naming  
  Annotate hierarchy based on dependency direction: API/Gateway, Service/Business, Infrastructure/Utility  
  LLM naming and description, cross-validation with project documents 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output is a hierarchical, directional, named architecture view. The agent can use this to determine whether an invocation crosses architecture layers.&lt;/p&gt;

&lt;p&gt;SYNC: Synchronize to UModel&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;Entity write: starops umodel post-logs → __entity logstore  
Topo write:  starops umodel post-logs → __topo logstore  
Schema synchronization: starops umodel sync (register EntitySet/Link definitions) 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The UModel backend is based on the Simple Log Service storage engine and inherits capabilities such as high-throughput writes, second-level query, graph-match graph traversal, SQL aggregation, and full-text index.&lt;/p&gt;

&lt;p&gt;SERVE: Engineering Details of the Query&lt;br&gt;
Key patterns explored in practice:&lt;/p&gt;

&lt;p&gt;Two-step query: graph-match returns entity_id without business fields. All graph traversal queries first traverse the topology to obtain the ID set, then pull business fields in batches:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;Step 1: .topo | graph-match (n1:code@code.module {__entity_id__: '&lt;span class="nt"&gt;&amp;lt;id&amp;gt;&lt;/span&gt;'})  
              -[e]-&amp;gt;(n2) project n1, e, n2  

Step 2: .entity with(domain='code', name='code.module', ids=['id1','id2',...])  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aggregation via direct Simple Log Service (SLS) query: Statistical queries such as hot spot analysis directly run SQL against the __topo Logstore:&lt;/p&gt;

&lt;p&gt;SELECT &lt;strong&gt;dest_entity_id&lt;/strong&gt;, count(1) as import_count&lt;br&gt;&lt;br&gt;
FROM log WHERE &lt;strong&gt;relation_type&lt;/strong&gt; = 'imports'&lt;br&gt;&lt;br&gt;
GROUP BY &lt;strong&gt;dest_entity_id&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
ORDER BY import_count DESC LIMIT 20&lt;br&gt;&lt;br&gt;
At the current multi-repository scale (~11,000 entities, ~19,000 edges, including the vibeops-agents and starops-cli projects), the end-to-end latency of a single query is in the hundreds of milliseconds.&lt;/p&gt;

&lt;p&gt;Agent Interaction Layer: Command-Line Interface (CLI) + Skill&lt;br&gt;
CLI Design&lt;br&gt;
The agent's reasoning is progressive: search first, see the results, and then decide the next step. The CLI's search→context→impact naturally matches this pattern and supports batch execution and MPS queue combinations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;code-wiki query &lt;span class="nt"&gt;&amp;lt;subcommand&amp;gt;&lt;/span&gt;     # graph query  
  ├── search &lt;span class="nt"&gt;&amp;lt;keyword&amp;gt;&lt;/span&gt;       # entity search  
  ├── context &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;         # full context of a symbol  
  ├── impact &lt;span class="nt"&gt;&amp;lt;path&amp;gt;&lt;/span&gt;          # change impact analysis  
  ├── callers / callees      # invocation chain  
  ├── deps / rdeps           # dependencies / reverse dependencies  

code-wiki check &lt;span class="nt"&gt;&amp;lt;subcommand&amp;gt;&lt;/span&gt;     # administration check  
  ├── arch                   # architecture violation scan  
  └── hotspots               # coupling hot spots  

code-wiki ingest             # build/update graph  
code-wiki status             # health check  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Subcommands are organized by agent intent. The agent does not need to know whether the underlying implementation is graph-match or Simple Log Service SQL: use impact to view the impact scope.&lt;/p&gt;

&lt;p&gt;Output Format: Optimized for the Agent Context Window&lt;br&gt;
The default --format brief output is optimized for the agent's token budget:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;$ code-wiki query context pkg/a2a  

Module: pkg/a2a  
  LOC: 1,247 | Language: Go | Component: a2a-protocol  
  Summary: A2A protocol implementation for agent-to-agent communication  

Types (17): TaskStore(struct), A2AServer(struct), AgentCard(struct), ...  
Functions (52): HandleA2ARequest[entry], StartA2AServer[entry], ...  
Reverse dependencies (9): pkg/api/handler, pkg/server, cmd/vibeops-agents, ...  
Component crossings: → api, → scheduler 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output of a query context is &amp;lt; 500 tokens. Use --format json when full data is required.&lt;/p&gt;

&lt;p&gt;Skill: Scenario-based User Guide&lt;br&gt;
Agent Skills with the command-line interface (CLI) are organized by scenario. Agents do not need to learn Structured Process Language syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;## RCA: From alerting to code  
code-wiki query search &lt;span class="nt"&gt;&amp;lt;keyword&amp;gt;&lt;/span&gt;       # Locate module  
code-wiki query context &lt;span class="nt"&gt;&amp;lt;module&amp;gt;&lt;/span&gt;      # Understand structure  
code-wiki query callers &lt;span class="nt"&gt;&amp;lt;function&amp;gt;&lt;/span&gt;    # Trace invocation chain  

## Development: Evaluate impact before changing code  
code-wiki query impact &lt;span class="nt"&gt;&amp;lt;module&amp;gt;&lt;/span&gt;       # Impact scope  
code-wiki query deps / rdeps          # Dependencies  

## Administration: Regular checks  
code-wiki check arch                  # Architecture violations  
code-wiki check hotspots              # Hot spot analysis  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Case Study&lt;br&gt;
Case 1: Agent Independently Completes Change Impact Assessment&lt;br&gt;
We assign a job to a sub-agent:&lt;/p&gt;

&lt;p&gt;Assess the impact scope of modifying the pkg/a2a module in the vibeops-agents project, including affected modules and entry points, upstream and downstream dependencies, and architecture threats.&lt;/p&gt;

&lt;p&gt;The sub-agent completed the following inference only by code-wiki** command-line interface (CLI) and Skill**, without reading any source code files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;$ code-wiki query search a2a  
Found 9 entities matching a2a:  
  [module] pkg/a2a                    LOC:1247  A2A protocol core  
  [module] pkg/a2a/adapter            LOC:834   A2A adapter implementations  
  [module] pkg/a2a/taskstore          LOC:567   A2A task state management  
  ...  

$ code-wiki query context pkg/a2a  
Module: pkg/a2a  
  Types (17): TaskStore, A2AServer, AgentCard, ...  
  Functions (52): HandleA2ARequest[entry], StartA2AServer[entry], ...  
  Reverse dependencies (9): pkg/api/handler, pkg/server, cmd/vibeops-agents, ...  

$ code-wiki query impact pkg/a2a  
Impact of pkg/a2a:  
  Direct dependents (9): pkg/api/handler, pkg/server, cmd/vibeops-agents, ...  
  Entry points affected: HandleA2ARequest, StartA2AServer, main  
  Component crossings: a2a-protocol → api, a2a-protocol → server  

$ code-wiki check hotspots  
Top hotspots by import count:  
  1. pkg/a2a/adapter    (48 imports)  ← High coupling warning  
  2. pkg/util/logger     (35 imports)  
  3. pkg/config          (32 imports) 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agent final output:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;=== A2A Protocol Modification Impact Assessment ===  
  
Scope of Impact:  
- Directly dependent modules: 9 (pkg/api/handler, pkg/server, etc.)  
- Entry points affected: 3 (HandleA2ARequest, StartA2AServer, main)  
- Cross-component borders: 2 (a2a-protocol → api, a2a-protocol → server)  
  
Threat points:  
- pkg/a2a/adapter is imported by 48 modules, the highest coupling degree in the entire project  
- bus factor = 1, consider adding a reviewer  
  
Suggested execution order:  
1. Modify the pkg/a2a core protocol definition first  
2. Synchronize the interface implementation of pkg/a2a/adapter  
3. Modify the invocation method of pkg/api/handler  
4. Update the initialization logic of pkg/server  
5. Regression testing on the cmd/vibeops-agents entry point&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;5 commands in total, each taking 1–3 seconds, with a total time of less than 15 seconds. The Agent did not read any source code files and completed a full impact assessment solely through graph queries.&lt;/p&gt;

&lt;p&gt;Case 2: RCA: From Alerting to Code Root Cause&lt;br&gt;
Production alerting: service-vibeops error_rate &amp;gt; 5%&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# 1. Locate the code module from the O&amp;amp;M entity  
$ code-wiki query context pkg/server  
Module: pkg/server  
  Functions (23): StartServer[entry], handleRequest, applyMiddleware, ...  
  Dependencies (12): pkg/a2a, pkg/config, pkg/auth, ...  
  
# 2. Trace the invocation chain to locate the potentially faulty downstream  
$ code-wiki query callees pkg/server.handleRequest  
Callees of handleRequest:  
  pkg/auth.ValidateToken          [component: auth]  
  pkg/a2a.HandleA2ARequest        [component: a2a-protocol]  
  pkg/scheduler.DispatchTask      [component: scheduler]  
  
# 3. Check commit_log and find that the a2a module was changed 2 hours ago  
#    author=xxx, message=refactor adapter interface  
  
# 4. Confirm the impact of the change  
$ code-wiki query impact pkg/a2a  
Impact of pkg/a2a:  
  Direct dependents (9): pkg/api/handler, pkg/server, ...  
  Entry points affected: HandleA2ARequest, StartA2AServer  
  
# → Root cause: The a2a interface refactoring affected the server invocation chain. Check interface compatibility.&lt;/code&gt;&lt;/pre&gt; 

&lt;p&gt;Case 3: Architecture Administration: Detecting Architecture Decay&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# 1. Scan for architecture violations  
$ code-wiki check arch  
Architecture violations:  
  pkg/util/logger calls pkg/api/handler.GetRequestID  
    [utility → api] The utility layer should not invoke the api layer  
  pkg/config calls pkg/scheduler.GetDefaultConfig  
    [infra → service] The infrastructure layer should not depend on the business layer  
  
# 2. Identify coupling hot spots  
$ code-wiki check hotspots  
Top hotspots:  
  1. pkg/a2a/adapter      48 imports  [HIGH]  
  2. pkg/util/logger       35 imports  [NORMAL]  
  3. pkg/scheduler/queue   28 imports  [MEDIUM]  
  
# 3. Analyze the highly coupled module in depth  
$ code-wiki query rdeps pkg/a2a/adapter  
Reverse dependencies (48):  
  pkg/api/* (12 modules), pkg/server/* (8 modules), pkg/scheduler/* (6 modules), ...  
  
# Agent suggests splitting into adapter/protocol, adapter/transform, and adapter/routing&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Outlook&lt;br&gt;
Comprehensive Digital Evaluation&lt;br&gt;
We plan to build a standardized code comprehension evaluation benchmark covering core scenarios such as impact analysis, invocation chain tracing, architecture violation detection, and RCA root cause localization. On real codebases of varying scales, we will compare the performance of three paradigms — Model + Bash (Agentic Search), Model + CodeWiki (LLM document), and Model + UModel (knowledge graph) — across dimensions including accuracy, recall rate, number of inference steps, and token consumption.&lt;/p&gt;

&lt;p&gt;Use SWE-bench-style quantization evaluation to make the capability borders of each paradigm measurable and reproducible. Based on this, optimize the overall technical architecture based on benchmark fractions, including iterative upgrades to related skills and the command-line interface (CLI).&lt;/p&gt;

&lt;p&gt;Agent Self-Maintenance&lt;br&gt;
Agents are not just graph consumers, they can also be maintainers:&lt;/p&gt;

&lt;p&gt;● After a code schema evolution, the associated LLM-inferred relationships are marked for reevaluation&lt;/p&gt;

&lt;p&gt;● Regularly inspect orphaned entities, missing relationships, and expired data&lt;/p&gt;

&lt;p&gt;● On top of the above capabilities, a verification and quality assessment system is also needed to make self-maintenance controllable.&lt;/p&gt;

&lt;p&gt;Architecture Guard Gate&lt;br&gt;
Integrated into the CI flow, automatically run on PR:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;codecode-wiki ingest --incremental        # Incremental graph update  
code-wiki check arch                  # Architecture violation check  
code-wiki query impact &amp;lt;changed_files&amp;gt; # Change impact analysis &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;From Observable to Understandable&lt;br&gt;
From modeling observable data to modeling code knowledge, from describing running systems with Entity + Log to describing code systems with Entity + Log: UModel is evolving from observing IT systems to understanding the code and procedures that build them.&lt;/p&gt;

&lt;p&gt;When agents truly understand the structure, history, and production performance of code simultaneously, genuinely AI-native software engineering becomes possible.&lt;/p&gt;

</description>
      <category>umodel</category>
      <category>observable</category>
    </item>
    <item>
      <title>Build Alibaba Cloud API Gateway Monitoring with Realtime Compute for Apache Flink and SLS</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Mon, 11 May 2026 03:23:05 +0000</pubDate>
      <link>https://dev.to/observabilityguy/build-alibaba-cloud-api-gateway-monitoring-with-realtime-compute-for-apache-flink-and-sls-2mkp</link>
      <guid>https://dev.to/observabilityguy/build-alibaba-cloud-api-gateway-monitoring-with-realtime-compute-for-apache-flink-and-sls-2mkp</guid>
      <description>&lt;p&gt;This article introduces how to build a real-time, scalable API gateway monitoring system for Alibaba Cloud Open Platform using Realtime Compute for Apache Flink and SLS.&lt;/p&gt;

&lt;p&gt;By Pan Weilong (Alibaba Cloud Observability), Ruan Xiaozhen (Alibaba Cloud Open Platform)&lt;/p&gt;

&lt;p&gt;Background and Challenges&lt;br&gt;
Background&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbvgjiasjpbfmnmmfbk77.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbvgjiasjpbfmnmmfbk77.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
Alibaba Cloud Open Platform is the standard entry point for developers to manage cloud resources. The Open Platform hosts the external APIs of almost all cloud products, and allows for automated O&amp;amp;M and cloud resource management. As enterprise dependency on automation deepens, the stability of the Open Platform becomes crucial.&lt;/p&gt;

&lt;p&gt;The stakeholders of the monitoring system include:&lt;/p&gt;

&lt;p&gt;● Open Platform's O&amp;amp;M team: Responsible for the overall availability of the API gateway, requiring centralized monitoring and alerting capabilities.&lt;/p&gt;

&lt;p&gt;● Cloud product teams (such as ECS, RDS, and SLB): Need to view the API call metrics and dashboards of their own products, and configure fine-grained alerting.&lt;/p&gt;

&lt;p&gt;● SRE teams: Need to quickly locate faults and perform root cause analysis.&lt;/p&gt;

&lt;p&gt;Fluctuations in any API may impact the production business of customers. Therefore, a comprehensive metric monitoring system must be established, accompanied by timely alerting capabilities to ensure high availability.&lt;/p&gt;

&lt;p&gt;Challenges&lt;br&gt;
The primary data source for the monitoring system is the access logs of the API gateway. These logs are generated by gateway nodes distributed across various regions. The system faces the following challenges:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq91f143lg2c0ccahplkr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq91f143lg2c0ccahplkr.png" alt=" " width="800" height="788"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpoma89kncdhv2d4jlkbr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpoma89kncdhv2d4jlkbr.png" alt=" " width="789" height="380"&gt;&lt;/a&gt;&lt;br&gt;
Solution&lt;br&gt;
To address those challenges, we adopt the cloud-native combination of Realtime Compute for ApacheFlink and SLS to build a real-time monitoring system.&lt;/p&gt;

&lt;p&gt;Components&lt;br&gt;
The core components of this solution and the adoption rationale are as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frlb5xvr2nj5xq005g8bg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frlb5xvr2nj5xq005g8bg.png" alt=" " width="789" height="331"&gt;&lt;/a&gt;&lt;br&gt;
The advantages of this solution are:&lt;/p&gt;

&lt;p&gt;● Fully managed: SLS and Realtime Compute for Apache Flink are both fully managed services, eliminating the need to manage infrastructure.&lt;/p&gt;

&lt;p&gt;● Scalability: Consumption throughput and compute resources can be scaled on demand.&lt;/p&gt;

&lt;p&gt;● End-to-end guarantee: End-to-end observability, from collection to alerting.&lt;/p&gt;

&lt;p&gt;Architecture&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbesxjec8d3n5a1plejj0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbesxjec8d3n5a1plejj0.png" alt=" " width="800" height="482"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The entire data processing pipeline adopts a regional deployment and centralized aggregation design. Log collection and aggregation are completed within each region to reduce latency. Processed metric data is aggregated cross-region to a single MetricStore for centralized monitoring.&lt;/p&gt;

&lt;p&gt;Intra-region Processing&lt;br&gt;
An independent data processing pipeline is deployed in each region to reduce latency:&lt;/p&gt;

&lt;p&gt;1.Data collection: Logtail collects the gateway node logs in real time. Logtail is a high-performance, proprietary log collector from Alibaba Cloud. It has the capabilities of millisecond-level latency and a throughput of millions of EPS, ensuring the reliable transmission of massive logs.&lt;/p&gt;

&lt;p&gt;2.Log storage: The SLS Logstore stores the raw API access logs in the region. It supports real-time query and analysis of request details, and serves as the data source for Flink stream processing.&lt;/p&gt;

&lt;p&gt;3.Regional aggregation: Flink Job 1 is independently deployed in each region. It's joined with MySQL dimension tables (storing metadata, such as the cluster information of gateway nodes and API business domains like ECS) to aggregate business metrics. This can significantly reduce the size of data for cross-region transmission.&lt;/p&gt;

&lt;p&gt;Cross-region Aggregation&lt;br&gt;
Local aggregation results are sent to a single MetricStore:&lt;/p&gt;

&lt;p&gt;4.Cross-region aggregation: Flink Job 2 (metric transform) is independently deployed in each region, adding timestamp info to the aggregation results, and aggregating the results to the centralizedSLS MetricStore. This allows the O&amp;amp;M team to view the metrics of all regions centrally.&lt;/p&gt;

&lt;p&gt;5.Visualization and alerting: Connect Grafana to the centralized SLS MetricStore, and query multi-dimensional metrics using standard Prometheus Query Language (PromQL), and alert on abnormal metrics.&lt;/p&gt;

&lt;p&gt;Layered Design&lt;br&gt;
The layered design effectively balances data freshness and resource efficiency:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F41b4a1z6hf9mejaj07vi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F41b4a1z6hf9mejaj07vi.png" alt=" " width="789" height="329"&gt;&lt;/a&gt;&lt;br&gt;
Why not one-layer aggregation?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Avoid data skew: The API traffic distribution is extremely uneven, and the QPS of certain products (such as ECS) is thousands of times that of other products. Grouping data by product will cause data skew and state bloat in specific Flink tasks.&lt;/li&gt;
&lt;li&gt;Improve resource efficiency: Regional aggregation reduces data sent downstream by more than 90%, which significantly lowers compute and storage overhead.
Metric System Design
The target metric system is composed of metrics and labels, covering the following four dimensions:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffqmu7jg7ozwatvu4vxxt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffqmu7jg7ozwatvu4vxxt.png" alt=" " width="789" height="667"&gt;&lt;/a&gt;&lt;br&gt;
Metric naming pattern: Prefix_MetricName. For example, the QPS metric of ECS is namespace_product_gw_http_req.&lt;/p&gt;

&lt;p&gt;Flink Job Development&lt;br&gt;
Job 1: Intra-region Processing&lt;br&gt;
Consumes raw logs, joins with MySQL sources, and performs two-stage aggregation: fine-grained multi-dimensional aggregation (by product, API, tenant, etc), followed by global metric aggregation.&lt;/p&gt;

&lt;p&gt;1.Data Source: Raw logs&lt;br&gt;
Logtail collects raw logs from gateway nodes. Sample log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{  
  "AK": "STS.NZD***Lgwc",  
  "Api": "DescribeCustomResourceDetail",  
  "CallerUid": "109837***3503",  
  "ClientIp": "192.168.xx.xx",  
  "Domain": "acc-vpc.cn-huhehaote.aliyuncs.com",  
  "ErrorCode": "ResourceNotFound",  
  "Ext5": "{\"logRegionId\":\"cn-huhehaote\",\"appGroup\":\"pop-region-cn-huhehaote\",\"callerInfo\":{...},\"headers\":{...}}",  
  "HttpCode": "404",  
  "LocalIp": "11.197.xxx.xxx",  
  "Product": "acc",  
  "RegionId": "cn-huhehaote",  
  "RequestContent": "RegionId=cn-huhehaote;Action=DescribeCustomResourceDetail;Version=2024-04-02;...",  
  "TotalUsedTime": "14",  
  "Version": "2024-04-02",  
  "__time__": "1768484243"  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: Ext5 contains a nested JSON structure (such as caller information and request headers), and RequestContent is request parameters in key-value format. These complex structures need to be parsed.&lt;/p&gt;

&lt;p&gt;Based on the log structure, define a Flink source table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;CREATE TABLE openapi_log_source (  
  `__time__` BIGINT,  
  LocalIp STRING,           -- Gateway node IP  
  Product STRING,           -- Product code  
  Api STRING,               -- API   
  Version STRING,           -- API version   
  Domain STRING,            -- Access domain   
  AK STRING,                -- Access Key  
  CallerUid STRING,         -- Caller UID  
  HttpCode STRING,          -- HTTP code   
  ErrorCode STRING,         -- Error code   
  TotalUsedTime BIGINT,     -- Request time in ms  
  ClientIp STRING,          -- Client IP  
  RegionId STRING,          -- Region ID   
  Ext5 STRING,              -- Extended field (nested JSON)  
  RequestContent STRING,    -- Request parameters (k/v format)   
  ts AS TO_TIMESTAMP_LTZ(`__time__` * 1000, 3),  
  WATERMARK FOR ts AS ts - INTERVAL '5' SECOND  
) WITH (  
  'connector' = 'sls',  
  'project' = '*****',  
  'logstore' = 'pop_rpc_trace_log',  
  'endpoint' = 'cn-shanghai-intranet.log.aliyuncs.com'  
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Watermark strategy: A ts - INTERVAL '5' SECOND watermark allows for up to 5 seconds of out-of-order data. Adjust this value based on your business needs. In production, with Logtail collecting gateway logs, the end-to-end latency is typically 2 to 3 seconds, making a 5-second delay sufficient for most cases. For cross-region scenarios, consider relaxing this to 10 to 15 seconds.&lt;/p&gt;

&lt;p&gt;2.MySQL Lookup Source: Metadata Enrichment&lt;br&gt;
To add labels (such as app_group and gc_level) to metrics, associate a MySQL lookup source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Gateway cluster info (join on LocalIp)  
CREATE TABLE gateway_cluster_dim (  
  local_ip STRING,  
  app_group STRING,          -- Cluster name   
  region_id STRING,          -- Region ID  
  PRIMARY KEY (local_ip) NOT ENFORCED  
) WITH ('connector' = 'jdbc', ...);  

-- Tenant info (join on Uid)  
CREATE TABLE user_level_dim (  
  uid STRING,  
  gc_level STRING,           -- Customer level (GC5/GC6/GC7)  
  PRIMARY KEY (uid) NOT ENFORCED  
) WITH (  
  'connector' = 'jdbc',  
  'url' = 'jdbc:mysql://xxx:3306/dim_db',  
  'table-name' = 'user_level',  
  'lookup.cache.max-rows' = '50000',       -- Max num of rows to cache  
  'lookup.cache.ttl' = '10min',            -- Cache TTL  
  'lookup.max-retries' = '3'               -- Max retries   
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cache policy: In production, gateway_cluster_dim adopts the ALL policy: loads data upon startup and refreshes regularly. user_level_dim uses the LRU policy: caches 50,000 hot spot tenant data records and sets the TTL to 10 minutes to balance the hit rate and data freshness.&lt;/p&gt;

&lt;p&gt;3.Job 1 Output: Write to Regional Aggregation Log&lt;br&gt;
The processing results are written to the SLS Logstore machine_agg_log as intermediate storage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Define a regional log aggregation sink  
CREATE TABLE machine_agg_log_sink (  
  window_start TIMESTAMP(3),  
  product STRING,  
  api STRING,  
  version STRING,  
  caller_uid STRING,  
  region_id STRING,  
  app_group STRING,  
  gc_level STRING,  
  http_code STRING,  
  error_code STRING,  
  qps BIGINT,  
  rt_mean DOUBLE,  
  slow1s_count BIGINT,  
  http_2xx BIGINT,  
  http_5xx BIGINT,  
  http_503 BIGINT  
) WITH (  
  'connector' = 'sls',  
  'project' = '****',  
  'logstore' = 'machine_agg_log',  -- Logstore name  
  'endpoint' = 'cn-shanghai-intranet.log.aliyuncs.com' -- Replace it with actual endpoint   
);  

-- Insert data  
INSERT INTO machine_agg_log_sink  
SELECT   
  TUMBLE_START(l.ts, INTERVAL '10' SECOND),  
  l.Product, l.Api, l.Version, l.CallerUid, g.region_id, g.app_group, u.gc_level, l.HttpCode, l.ErrorCode,  
  COUNT(*) as qps,  
  AVG(CAST(l.TotalUsedTime AS DOUBLE)),  
  SUM(CASE WHEN l.TotalUsedTime &amp;gt; 1000 THEN 1 ELSE 0 END),  
  SUM(CASE WHEN l.HttpCode &amp;gt;= '200' AND l.HttpCode &lt;span class="err"&gt;&amp;lt;&lt;/span&gt; '300' THEN 1 ELSE 0 END),  
  SUM(CASE WHEN l.HttpCode &amp;gt;= '500' THEN 1 ELSE 0 END),  
  SUM(CASE WHEN l.HttpCode = '503' THEN 1 ELSE 0 END)  
FROM openapi_log_source l  
LEFT JOIN gateway_cluster_dim FOR SYSTEM_TIME AS OF l.ts AS g ON l.LocalIp = g.local_ip  
LEFT JOIN user_level_dim FOR SYSTEM_TIME AS OF l.ts AS u ON l.CallerUid = u.uid  
GROUP BY   
  TUMBLE(l.ts, INTERVAL '10' SECOND),  
  l.Product, l.Api, l.Version, l.CallerUid, g.region_id, g.app_group, u.gc_level, l.HttpCode, l.ErrorCode;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Job 2: Transform and Aggregate Metrics&lt;br&gt;
Job 2 is deployed in each region to consume the log machine_agg_log, transform data into a time series format, and write the data to a centralized MetricStore in China (Shanghai).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Data Source: Consume a Regional Aggregation Log
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;CREATE TABLE machine_agg_log_source (  
  window_start TIMESTAMP(3),  
  product STRING,  
  region_id STRING,  
  -- ... Other field definitions are identical to machine_agg_log_sink   
  WATERMARK FOR window_start AS window_start - INTERVAL '5' SECOND  
) WITH (  
  'connector' = 'sls',  
  'project' = '****',  
  'logstore' = 'machine_agg_log',  -- Consume the logstore in the region   
  'endpoint' = 'cn-shanghai-intranet.log.aliyuncs.com'  
); 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;Sink: Centralized MetricStore Sink
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;CREATE TABLE metricstore_sink (  
  `__time_nano__` BIGINT,  
  `__name__` STRING,  
  `__labels__` STRING,  
  `__value__` DOUBLE  
) WITH (  
  'connector' = 'sls',  
  'project' = '****',      -- The centralized SLS project   
  'logstore' = 'openapi_metrics',            -- The centralized logstore   
  'endpoint' = 'cn-shanghai-intranet.log.aliyuncs.com' -- The region endpoint   
); 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;3.Compute and Aggregation Logic&lt;br&gt;
Job 2 performs further aggregation (such as by product), adds the timestamp info, and writes to the centralized project.&lt;/p&gt;

&lt;p&gt;Example: Calculate QPS by product and aggregate it&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;INSERT INTO metricstore_sink  
SELECT   
  UNIX_TIMESTAMP(CAST(TUMBLE_START(window_start, INTERVAL '1' MINUTE) AS STRING)) * 1000000000,  
  'namespace_product_gw_http_req',  
  CONCAT('product=', product, '|region_id=', region_id), -- Retain region info  
  CAST(SUM(qps) AS DOUBLE)  
FROM machine_agg_log_source  
GROUP BY TUMBLE(window_start, INTERVAL '1' MINUTE), product, region_id;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Solution benefits:&lt;/p&gt;

&lt;p&gt;Bandwidth savings: Job 1 aggregates massive logs into smaller data (reduced by 99%). Job 2 only transmits these lightweight metrics across regions, which greatly reduces transfer costs.&lt;/p&gt;

&lt;p&gt;Isolation: Data processing in each region is independent. A failure in a single region does not affect other regions.&lt;/p&gt;

&lt;p&gt;Job Configuration and Optimization&lt;br&gt;
To ensure job stability and data accuracy, we performed special optimization on the checkpoint and state backend in the production environment.&lt;/p&gt;

&lt;p&gt;Checkpoint Configuration and Trade-offs&lt;br&gt;
Two checkpointing strategies are provided: one for data consistency, the other for service availability:&lt;/p&gt;

&lt;p&gt;Strategy A: Prioritizing data consistency (recommended for general scenarios)&lt;/p&gt;

&lt;p&gt;This strategy is applicable to most monitoring scenarios that prioritize data accuracy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;SET 'execution.checkpointing.interval' = '60s';           -- Checkpoint every one minute   
SET 'execution.checkpointing.mode' = 'EXACTLY_ONCE';      -- Exactly-once semantics   
SET 'execution.checkpointing.timeout' = '10min';
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Strategy B: Prioritizing high availability (this example)&lt;/p&gt;

&lt;p&gt;Because this example involves highly concurrent data processing and is sensitive to availability, we adopt strategy B to reduce performance jitter from frequent checkpointing, without sacrificing consistency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;SET 'execution.checkpointing.interval' = '180s';          -- Checkpoint at a three-minute interval  
SET 'execution.checkpointing.mode' = 'AT_LEAST_ONCE';     -- Use at-least-once semantics   
SET 'execution.checkpointing.timeout' = '15min';          -- Relax checkpointing timeout   
SET 'execution.checkpointing.max-concurrent-checkpoints' = '1';  
SET 'execution.checkpointing.tolerable-failed-checkpoints' = '10'; -- Tolerate consecutive checkpoint failures to avoid job restart
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Strategy comparison:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2uhtxumfsoa9uccqvaga.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2uhtxumfsoa9uccqvaga.png" alt=" " width="789" height="161"&gt;&lt;/a&gt;&lt;br&gt;
State Backend&lt;br&gt;
Realtime Compute for Apache Flink provides the enterprise-level GeminiStateBackend. Compared with RocksDB used in Apache Flink, GeminiStateBackend is optimized for large-state jobs under the storage-compute-separation architecture. This example enables GeminiStateBackend and key-value separation to deal with large state and multiple aggregation keys:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;SET 'table.exec.state.backend' = 'gemini';                -- Enable GeminiStateBackend  
SET 'state.backend.gemini.kv.separate.mode' = 'GLOBAL_ENABLE'; -- Enable k/v separation 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GeminiStateBackend vs. RocksDB:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5uofz4negoyhlhjlff9r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5uofz4negoyhlhjlff9r.png" alt=" " width="789" height="438"&gt;&lt;/a&gt;&lt;br&gt;
Production recommendations: For scenarios such as log aggregation with a large state size and extremely high throughput requirements, use GeminiStateBackend and key-value separation. Actual tests show after key-value separation is enabled, the CPU utilization of the job during traffic peaks decreases by 20%, and the checkpoint duration is more stable.&lt;/p&gt;

&lt;p&gt;Visualization and Alert&lt;br&gt;
Metric Visualization&lt;br&gt;
A multi-dimensional API monitoring Grafana dashboard is built for deep drill-down analysis, by product or specific error code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqgjqe9hjecqs1cv0myf7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqgjqe9hjecqs1cv0myf7.png" alt=" " width="800" height="305"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd66bf2n57buyd4feref6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd66bf2n57buyd4feref6.png" alt=" " width="800" height="305"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F25gogi31s4vweh9booa6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F25gogi31s4vweh9booa6.png" alt=" " width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpvwn9eq2ooxf4990y7s3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpvwn9eq2ooxf4990y7s3.png" alt=" " width="800" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Self-service Query and Alerting&lt;br&gt;
After SLS MetricStore is added as a data source in Grafana, each cloud product team can use Prometheus Query Language (PromQL) syntax to query metrics and configure their own alert rules:&lt;/p&gt;

&lt;p&gt;Sample query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;# QPS trend  
sum(namespace_product_gw_http_req) by (product)  

# Error rate (current 1 min vs. 1hr ago)  
(  
  sum(rate(namespace_product_gw_http_5xx[1m])) / sum(rate(namespace_product_gw_http_req[1m]))  
) / (  
  sum(rate(namespace_product_gw_http_5xx[1m] offset 1h)) / sum(rate(namespace_product_gw_http_req[1m] offset 1h))  
) &amp;gt; 2  

# Avg latency   
avg(namespace_product_gw_rt_mean) by (product)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example alert rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;- alert: HighErrorRate
  expr: sum(namespace_product_gw_http_5xx) by (product) / sum(namespace_product_gw_http_req) by (product) &amp;gt; 0.01
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "{
   { $labels.product }} error rate is too high"
    description: "Current error rate: {
   {
    $value | printf \"%.2f\" }}%"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each cloud service team can configure their monitoring dashboard and alert rules in Grafana for autonomous O&amp;amp;M.&lt;/p&gt;

&lt;p&gt;Validation in Production&lt;br&gt;
This solution has been stably running in production. Core metrics:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzzguzfjwllb5lt1z8fe0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzzguzfjwllb5lt1z8fe0.png" alt=" " width="800" height="461"&gt;&lt;/a&gt;&lt;br&gt;
Thanks to the distributed computing capability of Flink and the high throughput storage of SLS, this solution has successfully supported the real-time monitoring of all API calls in Alibaba Cloud Open Platform. It covers more than 60 global regions and more than 300 cloud products, processes more than 200 TB of compressed logs (about 2 PB of raw logs, with a single log being about 4 to 5 KB) per day, and generates over 500,000 time series metrics.&lt;/p&gt;

&lt;p&gt;Data Processing Size&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdf52xigh2bgcixze4qo4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdf52xigh2bgcixze4qo4.png" alt=" " width="789" height="187"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Metric Generation Capability&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpgdsddpfp6y3q5vyw3c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpgdsddpfp6y3q5vyw3c.png" alt=" " width="789" height="188"&gt;&lt;/a&gt;&lt;br&gt;
System Stability&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqq4v2w4o95jxnnlhdq4g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqq4v2w4o95jxnnlhdq4g.png" alt=" " width="789" height="187"&gt;&lt;/a&gt;&lt;br&gt;
Business Benefits&lt;br&gt;
● Rapid fault discovery: The fault discovery time is shortened from minutes to seconds.&lt;/p&gt;

&lt;p&gt;● Improved O&amp;amp;M efficiency: More than 300 cloud service teams have achieved self-service monitoring configuration.&lt;/p&gt;

&lt;p&gt;During the implementation of the solution, we found the raw log contains a large number of redundant fields and nested structures, whereas metric calculation requires several core fields. To address this, we introduced predicate pushdown at the source for field pruning before data enters Flink, which effectively reduced network transmission and accelerated Flink processing.&lt;/p&gt;

&lt;p&gt;Advanced Optimization: Predicate Pushdown&lt;br&gt;
Predicate Pushdown Capability by Connector&lt;br&gt;
Predicate pushdown, a classic database and big data optimization, executes filter conditions at the source. This reduces data volume and compute overhead. Flink's pushdown capability depends on its source connector implementation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkkmgg059cljtvwlxundt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkkmgg059cljtvwlxundt.png" alt=" " width="789" height="272"&gt;&lt;/a&gt;&lt;br&gt;
Predicate Pushdown with SPL&lt;br&gt;
In its early versions, the Realtime Compute for Apache Flink connector for SLS pulled all data from an SLS Logstore. But actually, many fields are not needed. SPL enables source-side predicate pushdown by doing filtering and conversion at SLS and sends processed results to Flink.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcctvo9hw2yl4co8qi9so.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcctvo9hw2yl4co8qi9so.png" alt=" " width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Benefits:&lt;/p&gt;

&lt;p&gt;● SIMD vectorization: SPL's vectorized execution engine uses CPU SIMD instructions (e.g., AVX2/AVX-512) for batch data processing, achieving several times the performance of row-by-row processing.&lt;/p&gt;

&lt;p&gt;● Local processing: Data processing is completed on the SLS data node. You do not need to transfer raw data across networks, which avoids network I/O from becoming a bottleneck.&lt;/p&gt;

&lt;p&gt;● Columnar storage acceleration: SLS's columnar storage, in combination with column pruning on project, reads only necessary column data. This significantly reduces disk I/O.&lt;/p&gt;

&lt;p&gt;● Zero-copy transmission: The processed data directly enters consumption, which reduces the memory copy overhead.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuee9e2uimz92bgal489l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuee9e2uimz92bgal489l.png" alt=" " width="800" height="391"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Billing tips:&lt;/p&gt;

&lt;p&gt;Non-SPL consumption: billing is based on the transmitted (compressed) data size.&lt;/p&gt;

&lt;p&gt;SPL consumption: billing is based on the raw (uncompressed) data size.&lt;/p&gt;

&lt;p&gt;For detailed pricing and differences, refer to SLS pricing documentation.&lt;/p&gt;

&lt;p&gt;Sample SPL Configuration&lt;br&gt;
This section introduces filtering data with SPL at the source. Consider the traditional approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Traditional approach: Pull all data and filter with Flink  
SELECT * FROM openapi_log_source  
WHERE Domain != 'popwarmup.aliyuncs.com'  
  AND JSON_VALUE(Ext5, '$.logRegionId') NOT IN ('cn-shanghai', 'cn-beijing') 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After SPL is used, filtering and transform are completed on SLS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- 1.Row filtering: Exclude invalid data  
*   
| where Domain != 'popwarmup.aliyuncs.com'  

-- 2.Expand nested JSON   
| parse-json -prefix='ext5_' Ext5    
| where ext5_logRegionId not in ('cn-shanghai', 'cn-beijing', 'cn-hangzhou')  
| parse-json -prefix='callerInfo_' ext5_callerInfo    
| parse-json -prefix='headers_' ext5_headers    

-- 3.Extract key-value fields  
| parse-regexp RequestContent, '[;]RegionId=([^;]*)' as request_regionId    

-- 4.Column pruning: Retain necessary fields to reduce output data size  
| project LocalIp, Product, Version, Api, Domain, ErrorCode, HttpCode,   
         TotalUsedTime, AK, RegionId, ClientIp,   
         callerInfo_callerType, callerInfo_callerUid, callerInfo_ownerId,  
         ext5_regionId, ext5_appGroup, ext5_stage, request_regionId
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use SPL&lt;br&gt;
In Flink SQL, reference the pre-configured SPL using the processor parameter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;CREATE TABLE openapi_log_source (  
  `__time__` BIGINT,  
  -- SPL processed fields (JSON object expanded, column pruned)  
  LocalIp STRING,  
  Product STRING,  
  Version STRING,  
  Api STRING,  
  Domain STRING,  
  ErrorCode STRING,  
  HttpCode STRING,  
  TotalUsedTime BIGINT,  
  AK STRING,  
  RegionId STRING,  
  ClientIp STRING,  
  callerInfo_callerType STRING,      -- Get from Ext5.callerInfo  
  callerInfo_callerUid STRING,  
  callerInfo_ownerId STRING,  
  ext5_regionId STRING,              -- Get from Ext5   
  ext5_appGroup STRING,  
  ext5_stage STRING,  
  request_regionId STRING,           -- Get from RequestContent  
  ts AS TO_TIMESTAMP_LTZ(`__time__` * 1000, 3),  
  WATERMARK FOR ts AS ts - INTERVAL '5' SECOND  
) WITH (  
  'connector' = 'sls',  
  'project' = '****',  
  'logstore' = 'pop_rpc_trace_log',  
  'endpoint' = 'cn-shanghai-intranet.log.aliyuncs.com',  
  'processor' = 'openapi-processor'  -- Use SPL for filter pushdown  
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Optimization Effects&lt;br&gt;
SPL delivers significant improvements in the following areas:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffmn7n383xaez6rfjuf4v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffmn7n383xaez6rfjuf4v.png" alt=" " width="789" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Summary&lt;br&gt;
With the cloud-native solution, we have successfully built a real-time monitoring system for Alibaba Cloud API gateway. Recap:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxbv3v1gcogpuphv5ri3k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxbv3v1gcogpuphv5ri3k.png" alt=" " width="789" height="187"&gt;&lt;/a&gt;&lt;br&gt;
Flink Highlights&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ccyxe2dhprn8fur8h0y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ccyxe2dhprn8fur8h0y.png" alt=" " width="789" height="344"&gt;&lt;/a&gt;&lt;br&gt;
Architectural Design Insights&lt;/p&gt;

&lt;p&gt;1.Alleviate data skew: Use layered aggregation: local first, then global by business dimension.&lt;br&gt;
2.Reduce costs with predicate pushdown: Filter at the source (e.g., with SPL) to minimize network transmission and compute.&lt;br&gt;
3.Enterprise-grade state backend: For large states, use GeminiStateBackend with key-value separation for improved I/O and job stability.&lt;br&gt;
The technical solution in this article can be promoted to similar scenarios, such as microservice invocation chain monitoring, Alibaba Cloud CDN log analysis, and Internet of Things (IoT) data aggregation.&lt;/p&gt;

&lt;p&gt;References&lt;br&gt;
● Realtime Compute for Apache Flink's SLS connector&lt;/p&gt;

&lt;p&gt;● SLS MetricStore&lt;/p&gt;

&lt;p&gt;● Send time series data from SLS to Grafana&lt;/p&gt;

&lt;p&gt;● SPL syntax&lt;/p&gt;

</description>
      <category>sls</category>
      <category>api</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Building Cross-Cloud Observability: One Architecture, Unified Analytics</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Wed, 29 Apr 2026 07:09:49 +0000</pubDate>
      <link>https://dev.to/observabilityguy/building-cross-cloud-observability-one-architecture-unified-analytics-3ae2</link>
      <guid>https://dev.to/observabilityguy/building-cross-cloud-observability-one-architecture-unified-analytics-3ae2</guid>
      <description>&lt;p&gt;This article introduces a unified observability architecture for cross-cloud log analysis and AIOps, designed to streamline multicloud O&amp;amp;M and reduce costs for global enterprises.&lt;/p&gt;

&lt;p&gt;1.Customer Requirements&lt;br&gt;
1.1 Unified Analysis of Multicloud Logs&lt;br&gt;
A common form in multicloud scenarios is that edge security and access capabilities outside China are handled by Cloudflare (Web Application Firewall (WAF), Content Delivery Network (CDN), and Access), and Verbose Logs are uniformly stored in Amazon Simple Storage Service (S3) through Logpush for low-cost archiving and compliance retention. Meanwhile, the core business and observability systems of the headquarters often Run on the Alibaba Cloud side. For example, application, gateway, and business logs enter Simple Log Service (SLS), and the alerting, on-call, and ticket systems are also built around the Alibaba Cloud side. The Result is that the "chain of evidence" of the same User Request, the same Attack, or the same publish Change is Distributed across both the Third-party cloud vendor and Alibaba Cloud side. This makes it difficult to complete unified retrieval, association analysis, or closed-loop handling in a single platform.&lt;br&gt;
For the platform engineering team, the core challenge is not the location of log storage, but rather the lack of a unified platform to perform analysis and complete operational tasks.&lt;/p&gt;

&lt;p&gt;● Logs are in S3, but troubleshooting, security analytics, and operation Analysis are often scattered across multiple Systems (Cloudflare console, Athena, Glue, Amazon Elastic MapReduce (EMR), CloudWatch, Business Intelligence (BI), and self-built alerting).&lt;/p&gt;

&lt;p&gt;● Metrics cannot be standardized: the same Metric (such as 5xx, P99 latency, and WAF block ratio) is calculated separately in different Systems. It is difficult to audit Changes, reuse them, or perform migration.&lt;/p&gt;

&lt;p&gt;● The management event response chain is long: it requires "first querying logs -&amp;gt; then manually summarizing -&amp;gt; then sending Notifications -&amp;gt; then dispatching tickets or performing rollback", and the Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR) are artificially lengthened.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6vxbsspzkilecruy48a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6vxbsspzkilecruy48a.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;1.2 Reduce Costs and Simplify O&amp;amp;M&lt;br&gt;
If S3 is used as Log Storage, to "use" the Data (query and analysis, visualization, and alerting filter interaction), a combination of additional components is usually required for querying, ETL, metrics, and alerting. The chain becomes longer, configuration and troubleshooting span multiple Systems, and O&amp;amp;M complexity will significantly increase.&lt;/p&gt;

&lt;p&gt;If Data is directly connected to CloudWatch: CloudWatch Logs is used for Collection and storage, Logs Insights is used for query and analysis, and Dashboards and Alarms are used for gauge and alerting closed-loops. The overall cost is usually very high.&lt;/p&gt;

&lt;p&gt;2.SLS Solutions&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxap1b0b5zqp553dzk3el.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxap1b0b5zqp553dzk3el.png" alt=" " width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, the data import, processing, query and analysis, gauge display, and alerting features in this set of SLS Solutions will be broken down and introduced step by step.&lt;/p&gt;

&lt;p&gt;2.1 Import Data from S3 to SLS&lt;br&gt;
In the eyes of many people, data import is just the three-step procedure of "read-transmit-write". But when you face:&lt;/p&gt;

&lt;p&gt;● Logs that generate thousands of files per minute&lt;/p&gt;

&lt;p&gt;● Attack and defense traffic that instantly surges from 1 GB to 10 GB&lt;/p&gt;

&lt;p&gt;● Various mixed data formats such as gzip, snappy, JavaScript Object Notation (JSON), and Comma-Separated Values (CSV)&lt;/p&gt;

&lt;p&gt;You will find that this is by no means a simple "copy and paste" operation.&lt;/p&gt;

&lt;p&gt;Next, the difficulties encountered in the actual import procedure will be clarified first, and then the corresponding implementation methods will be explained:&lt;/p&gt;

&lt;p&gt;Challenge 1: The "real-time Search" of massive small files is not simple (full traverse vs. real-time, incremental traverse vs. Integrity)&lt;br&gt;
The ListObjects operation of S3 only Supports traverse in lexicographic order, and does not Support "filtering by Time". When the volume of History files in a bucket or folder is huge, a full scan may take a long Time. However, if only an incremental scan is performed, files may be missed because file names are out of order.&lt;/p&gt;

&lt;p&gt;Consequences: New files are not Searched in Time (latency increases), or they are missed in extreme cases (Integrity threat).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8h8z5n7w609mudp7lm6k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8h8z5n7w609mudp7lm6k.png" alt=" " width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Challenge 2: The throughput must be able to keep up with the peak, but cannot rely on "manual parameter tuning" (traffic burst + the "long tail" problem, where processing is slowed down by a few oversized files)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1.In real business, traffic will burst: usually 1 GB/minute, but it may surge to 10 GB/minute during Activities or faults. If the scale-out is slow, the end-to-end latency immediately becomes out of control after the queue accumulates.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;2.Even if the concurrent capacity is fully utilized, long tails will still be encountered: "average assign by the number of files" will cause a Job to be dragged down by an oversized file, and the overall latency is determined by the slowest one.&lt;br&gt;
Challenge 3: The data formats are often mixed and unpredictable&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The same bucket may often mix JSON, CSV, and text. Even for JSON, it may be "line-by-line JSON, JSON array, or specific service formats (such as CloudTrail)". The compress may be .gz, .snappy, .lz4, or .zstd.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If you attempt to automatically detect the data format, sampling misjudgments and additional read overhead will be Imported, which will slow down the transmission chain instead.&lt;br&gt;
Challenge 4: Data integrity and traceability must be guaranteed (ensuring no data is lost, supporting reprocessing, and enabling problem-file identification)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The import chain naturally has retry and replay: network jitter, Consumption timeout, Job restart, and management events and scans hitting the same object at the same Time may all cause repeated pulls.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;More importantly, data loss is often more hidden. Missed events, permission Changes, scan point drifts, and parse abnormalities can cause data gaps during a certain period without being noticed.&lt;br&gt;
Our design solutions for these difficulties are as follows:&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;● Design point 1: A "dual-mechanism" for file discovery ensures both timeliness and completeness.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SQS Event-driven: S3 events → SQS → data import Job Consumption (suitable for scenarios with irregular file names or low-latency requirements).&lt;/li&gt;
&lt;li&gt;Dual-pattern traverse: Incremental catch-up to the latest point + periodic full fallback (to prevent missed discovery).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbmxsy86duzdw14qvqzw1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbmxsy86duzdw14qvqzw1.png" alt=" " width="789" height="320"&gt;&lt;/a&gt;&lt;br&gt;
● Design point 2: Auto Scaling + balanced allocation by data volume to handle traffic peaks and manage long-tail data.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Concurrent Jobs automatically scale out or in based on queues and data volumes. This avoids manual parameter tuning.&lt;/li&gt;
&lt;li&gt;Job assignment is upgraded from "by the number of files" to "balanced allocation by data volume". This ensures that a round of concurrent Jobs can be completed at the same time as much as possible.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feo1c632ufiou8uinlrqt.png" alt=" " width="800" height="276"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;● Design point 3: Auto compression detection and explicit configuration of data formats (no guessing).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compression Formats are automatically detected and decompressed based on file suffixes, such as .gz, .snappy, .lz4, and .zstd.&lt;/li&gt;
&lt;li&gt;Data formats are explicitly specified by data import Jobs (such as JSON, CSV, single-line, multi-line, CloudTrail, and JSON array). Encoding Settings are also provided (default to UTF-8, and can be specified when necessary).
● Design point 4: Point and Status Management + retry and fencing + file-level tracking to make data backfilling feasible.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the discovery side, "events + scan fallback" form a compensation closed loop to reduce the probability of missed discovery.&lt;br&gt;
On the pull side, points and processing Status are maintained. Failed files enter the retry or fencing queue. Data backfilling by replaying object keys is Supported.&lt;br&gt;
Deduplication and idempotence control the Impact of duplicates based on object identities (such as key + etag/version + offset) to make duplicates controllable and gaps visible.&lt;br&gt;
2.2 One-stop Data Analytics&lt;br&gt;
Data import is only the first step. A complete observability closed loop also requires data governance, interactive search, visualization, and intelligent alerting. SLS integrates these capabilities into a unified platform. The core principles of each step are described below.&lt;/p&gt;

&lt;p&gt;Data transformation: fully managed streaming extract, transform, and load (ETL)&lt;br&gt;
SLS data transformation is based on managed real-time Consumption Jobs and uses Structured Process Language (SPL) syntax to process logs in streams. It is fully managed, supports Auto Scaling, and makes Data visible in seconds. It also Supports line-by-line debugging and code hinting.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1efibo1kxg6ryooyf2q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1efibo1kxg6ryooyf2q.png" alt=" " width="800" height="421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;SLS uses the SPL engine as the kernel on the log pipeline, which includes advantages such as column-oriented calculation, single instruction multiple data (SIMD) acceleration, and C++ implementation. Based on the distributed architecture of the SPL engine, we have redesigned the Elasticity mechanism. It is not just scaling at the granularity of an instance (such as a Kubernetes pod or service compute unit) in the usual sense, but can quickly scale at the granularity of a DataBlock (MB level).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsy4j58z1j92prd4ewxd5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsy4j58z1j92prd4ewxd5.png" alt=" " width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Scenario capabilities:&lt;/p&gt;

&lt;p&gt;● Pre-compliance: IP-to-Geo transform and desensitization are completed outside China. Only compliance fields are retained after cross-border data transfer to meet General Data Protection Regulation (GDPR) and data export requirements.&lt;/p&gt;

&lt;p&gt;● Data filtering: Invalid Data is removed to reduce downstream index and storage overheads.&lt;/p&gt;

&lt;p&gt;● Structured extraction: Original fields are transformed into analyzable Metrics, and nested JSON is parsed to avoid repeated calculations during queries.&lt;/p&gt;

&lt;p&gt;● Field projection: Only Gold fields are delivered, which can reduce cross-border traffic and index costs by 50% to 80%.&lt;/p&gt;

&lt;p&gt;● Field enrichment: Field connection (JOIN) is performed on logs (such as order logs) and dimension tables (such as User information Tables) to Add more dimension information to logs for data analytics.&lt;/p&gt;

&lt;p&gt;● Data forwarding: Logstore Data can be forwarded and aggregated to destination databases. Data can also be flexibly forwarded based on field Content.&lt;/p&gt;

&lt;p&gt;Query and analysis: High-Performance engine and responses in seconds&lt;br&gt;
SLS provides a high-Performance query engine that Supports the index pattern (responses in seconds for tens of billions of Data records) and the scan pattern (lightweight Analysis). Queries are directly applied to indexes without the need to pre-build datasets or wait for purge delays. For ultra-large-scale data analytics scenarios, SLS provides the Dedicated SQL, which includes the enhancement mode and complete accuracy mode.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpgei9tgmsn1njf674nv9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpgei9tgmsn1njf674nv9.png" alt=" " width="800" height="384"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Query engine and capabilities:&lt;/p&gt;

&lt;p&gt;● Nearly a hundred Window Functions: Built-in statistical, aggregation, string, Time, and geospatial functions are provided out-of-the-box.&lt;/p&gt;

&lt;p&gt;● Cross-database federated queries: StoreView supports cross-Project and cross-Logstore Data associated queries.&lt;/p&gt;

&lt;p&gt;● SQL Exclusive: Provides high-precision Analysis capabilities in large data volume scenarios to avoid sampling errors.&lt;/p&gt;

&lt;p&gt;● Scheduled SQL: Supports scheduled execution of SQL queries for Report Generation and Metric pre-computation.&lt;/p&gt;

&lt;p&gt;Dashboards: Rich visualization, out-of-the-box&lt;br&gt;
SLS dashboards are Data Visualization Tools provided by Simple Log Service to display query and analysis Results in a graphical interface. A dashboard usually contains multiple statistical charts to summarize and render key performance metrics, important Data, and Analysis Results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgndkgclc9ahfavfhejvt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgndkgclc9ahfavfhejvt.png" alt=" " width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Visualization capabilities:&lt;/p&gt;

&lt;p&gt;● Rich chart Types: Multiple statistical charts such as Tables, line charts, column charts, pie charts, and maps are supported. The Pro Version supports the overlaid display of multiple query Results.&lt;/p&gt;

&lt;p&gt;● Interaction and drill down: Supports global Time filtering, variable filter interaction, and chart drill down to track from the overall situation to details layer by layer.&lt;/p&gt;

&lt;p&gt;● Subscribe and Share: Supports periodically rendering dashboards into Images and sending them by Email or to DingTalk groups. Supports embedding the console into third-party Systems.&lt;/p&gt;

&lt;p&gt;● Third-party Integration: Can be integrated with visualization tools such as DataV, Grafana, and Tableau, and supports bidirectional import and export of Grafana dashboards.&lt;/p&gt;

&lt;p&gt;Alerting: A one-stop artificial intelligence for IT operations platform&lt;br&gt;
SLS alerting is a one-stop artificial intelligence for IT operations platform for alerting and monitoring systems, denoising, transaction management, and Notification dispatch. It consists of subsystems such as the alerting and monitoring system, alert management system, and notification management system. After logs or metrics are ingested, you can create monitoring jobs, notification channels, and alert policies within minutes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fftjsc9auu9cxve0208ax.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fftjsc9auu9cxve0208ax.png" alt=" " width="800" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feature advantages:&lt;/p&gt;

&lt;p&gt;● Low cost and fully managed: Provided as Software as a Service (SaaS). Except for text messages and voice calls, no additional fees are charged for alerting and monitoring systems, transaction management, or other features.&lt;/p&gt;

&lt;p&gt;● Denoising and dispatch: Supports grouping, removing duplicates, suppression, and upgrading to avoid alert storms. Supports automatic dispatch to different teams based on rules.&lt;/p&gt;

&lt;p&gt;● Rich notification channels: Natively integrates DingTalk, WeCom, Lark, Slack, text messages, voice calls, and Webhooks.&lt;/p&gt;

&lt;p&gt;2.3 O&amp;amp;M Simplification (Using Integration to Replace Multiple Product Portfolios)&lt;br&gt;
2.3.1 THIRD-PARTY CLOUD VENDOR multiple product portfolios: Which components are usually required to achieve the same closed loop&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr3nwq8y52hq0j567g9ab.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr3nwq8y52hq0j567g9ab.png" alt=" " width="800" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Having multiple components is not necessarily bad, but when your requirement is "unified standards, minute-level closed loop, and controllable low cost," multiple components mean:&lt;/p&gt;

&lt;p&gt;● Longer pipeline: Data needs to be moved more times (extract, transform, and load (ETL), saving to intermediate tables, and refreshing datasets).&lt;/p&gt;

&lt;p&gt;● Larger failure surface: Jitter in any step will Impact the end-to-end timeliness.&lt;/p&gt;

&lt;p&gt;● More fragmented billing: Costs for storage, scans, ETL, alerting, visualization, and Networks are all increasing.&lt;/p&gt;

&lt;p&gt;2.3.2 SLS integration vs THIRD-PARTY CLOUD VENDOR multiple product portfolios&lt;br&gt;
In SLS, you can create a reusable engineering template that combines "import + processing + index + query + dashboard + alerting/transaction," use the template to deliver the first version, and use policies to iterate on costs and results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkjg7elcqn3zch5g6gtk2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkjg7elcqn3zch5g6gtk2.png" alt=" " width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.Case Study of Log Analysis Architecture Upgrades for Globalized Enterprises&lt;br&gt;
Background and Solutions&lt;br&gt;
A large globalized enterprise whose business covers multiple areas such as Europe, Asia-Pacific, and North America achieves global access acceleration and Web application protection through mainstream Alibaba Cloud CDN and security services. To meet Data compliance and audit requirements outside China, the enterprise continuously archives its security and access logs to public cloud Object Storage Service for long-term retention and subsequent Analysis through the native log push capability (Logpush) of the platform.&lt;/p&gt;

&lt;p&gt;Currently, the enterprise uses a combination of multiple components on THIRD-PARTY CLOUD VENDOR to achieve the Analysis and monitoring of logs outside China, and encounters the following problems:&lt;/p&gt;

&lt;p&gt;● Scattered Data: S3 is distributed across multiple Regions such as Frankfurt and Tokyo, and data silos are difficult to uniformly manage and analyze.&lt;/p&gt;

&lt;p&gt;● High query and analysis costs: Athena bills based on scan volume. CloudWatch Logs Insights has limited query capabilities and requires separate queries across regions. The costs of daily retrievals and alerting queries increase linearly with frequency.&lt;/p&gt;

&lt;p&gt;In addition, extract, transform, and load (ETL) dependencies on Glue or Lambda require self-maintenance. QuickSight visualization requires additional authorization and has synchronization latency. CloudWatch Alarms configurations are scattered and lack unified denoising capabilities. The multiple product portfolio causes issues such as high O&amp;amp;M complexity and uncontrollable costs.&lt;/p&gt;

&lt;p&gt;You can build a unified observability analysis platform based on SLS to achieve the following goals:&lt;/p&gt;

&lt;p&gt;● Unified data transformation: You can use Structured Process Language (SPL) to complete data governance outside China (such as field clipping, IP address desensitization, and Geo enrichment). This reduces the costs of cross-border transfer.&lt;/p&gt;

&lt;p&gt;● Unified query and analysis: You can aggregate gold data in the central Logstore in China to provide second-level interactive search for hundreds of millions of data records.&lt;/p&gt;

&lt;p&gt;● Unified visualization: A one-stop dashboard is provided, and no additional business intelligence (BI) tools are required.&lt;/p&gt;

&lt;p&gt;● Unified alerting closed loop: Intelligent alerting based on SLS query and analysis is provided. It supports denoising, dispatching, and multi-channel notifications.&lt;/p&gt;

&lt;p&gt;3.1 Data Flow&lt;br&gt;
Data is pushed from Cloudflare Logpush to various Amazon Web Services (THIRD-PARTY CLOUD VENDOR) S3 regions outside China for archiving. SLS imports the data into Logstores in the same region through event-driven mechanisms or scheduled scans. After the data is transformed by SPL, it is aggregated into the central Logstore in China to support unified query and analysis, dashboards, and alerting.&lt;/p&gt;

&lt;p&gt;3.1.1 Sample SPL data transformation&lt;br&gt;
Sample raw log (Cloudflare Web Application Firewall (WAF) log)&lt;/p&gt;

&lt;p&gt;The sample Cloudflare WAF raw log contains sensitive and security fields such as ClientIP, SecurityAction, and SecuritySources, and covers three security action scenarios: block, allow, and challenge. You can directly use these logs to test SPL data transformation statements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{  
  "EdgeStartTimestamp": "2024-12-25T10:30:00Z",  
  "RayID": "abc123def456",  
  "ClientIP": "203.0.113.50",  
  "OriginIP": "10.0.0.100",  
  "ClientRequestURI": "/api/v1/users?id=123",  
  "ClientRequestMethod": "POST",  
  "ClientRequestReferer": null,  
  "SecurityAction": "block",  
  "SecurityRuleID": "rule_001",  
  "SecuritySources": "[{\"source\":\"waf\",\"action\":\"block\"}]",  
  "OriginResponseStatus": 200,  
  "OriginResponseTime": 150,  
  "ResponseHeaders": "{\"x-cache\":\"MISS\"}"  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The following SPL script completes data governance outside China: time standardization, IP address to Geo geographic information conversion, IP address desensitization to anonymous fingerprints, security metadata parsing, and threat labeling. Finally, sensitive fields such as ClientIP and OriginIP are removed by using project-away, and only gold fields are retained for cross-border transfer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Core tracking and time standardization  
*   
| extend __time__ = cast(to_unixtime(date_parse(EdgeStartTimestamp, '%Y-%m-%dT%H:%i:%SZ')) as bigint)  
| extend RequestId = RayID  
| extend RequestPath = url_extract_path(ClientRequestURI)  

-- IP -&amp;gt; Geo (completed outside China)  
| extend  
    GeoCountry = ip_to_country(ClientIP),  
    GeoRegion  = ip_to_province(ClientIP),  
    GeoCity    = ip_to_city(ClientIP)  

-- IP address desensitization: Retain anonymous fingerprints (optional) and do not carry the raw IP address for cross-border transfer  
| extend ClientFingerprint = to_base64(sha256(to_utf8(ClientIP)))  

-- Security metadata parsing and labeling  
| expand-values -keep SecuritySources  
| parse-json -prefix='Security' SecuritySources  
| extend IsHighRisk = if(ClientRequestMethod = 'POST' and (ClientRequestReferer is null or SecurityAction = 'block'), 1, 0)  

-- Final denoising and field projection  
| project-away ClientIP, OriginIP, ResponseHeaders, RayID
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sample Data after data transformation&lt;/p&gt;

&lt;p&gt;The data after data transformation has completed Geo enrichment, IP masking, and threat labeling. Sensitive fields have been removed, and the data can be directly used for downstream query and analysis and alerting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{  
    "RequestPath": "/api/v1/users",  
    "__time__": "1735122600",  
    "RequestId": "abc123def456",  
    "ClientFingerprint": "O1zTaFfLyH1ZqEHS03UiLSNMzwMX+4ZW7OsIVsDGgEg=",  
    "OriginResponseTime": "150",  
    "GeoCity": "Richardson",  
    "ClientRequestURI": "/api/v1/users?id=123",  
    "IsHighRisk": "1",  
    "EdgeStartTimestamp": "2024-12-25T10:30:00Z",  
    "SecurityAction": "block",  
    "SecurityRuleID": "rule_001",  
    "Securityaction": "block",  
    "GeoCountry": "United State",  
    "GeoRegion": "Texas",  
    "OriginResponseStatus": "200",  
    "Securitysource": "waf",  
    "ClientRequestMethod": "POST"  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.1.2 query and analysis samples&lt;br&gt;
Sample 1: Web Application Firewall (WAF) rule hit Statistics - This sample aggregates the hit Count, high-threat proportion, and unique attacker count by rule.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;* | SELECT   
  SecurityRuleID,  
  count(*) AS TotalHits,  
  count_if(IsHighRisk = 1) AS HighRiskHits,  
  approx_distinct(ClientFingerprint) AS UniqueClients  
FROM log  
WHERE SecurityRuleID IS NOT NULL AND SecurityRuleID &lt;span class="err"&gt;&amp;lt;&lt;/span&gt;&amp;gt; ''  
GROUP BY SecurityRuleID   
ORDER BY TotalHits DESC
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmcpacqc3z8rbix4n8qo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmcpacqc3z8rbix4n8qo.png" alt=" " width="800" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sample 2: Top 10 Attack source regions - This sample aggregates the block Count and unique attacker count by country or city.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;* | SELECT   
  GeoCountry,  
  GeoCity,  
  count(*) AS AttackCount,  
  approx_distinct(ClientFingerprint) AS UniqueAttackers  
FROM log  
WHERE SecurityAction = 'block'  
GROUP BY GeoCountry, GeoCity  
ORDER BY AttackCount DESC  
LIMIT 10  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ezykvsphsm6de2xr2k4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ezykvsphsm6de2xr2k4.png" alt=" " width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frq4y0k30xfqkp65121j0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frq4y0k30xfqkp65121j0.png" alt=" " width="800" height="344"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sample 3: Origin 5xx fault Trend - This sample aggregates the fault Count, Error Rate, and total Request count by minute.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;* | SELECT   
  time_series(__time__, '1m', '%Y-%m-%d %H:%i:%s', '0') AS TimeBucket,  
  count_if(OriginResponseStatus &amp;gt;= 500) AS Origin5xxCount,  
  count_if(OriginResponseStatus &amp;gt;= 500) * 100.0 / count(*) AS Origin5xxRate,  
  count(*) AS TotalRequests  
FROM log  
GROUP BY TimeBucket  
ORDER BY TimeBucket  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgvl52yuujmupbv3y0dyl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgvl52yuujmupbv3y0dyl.png" alt=" " width="800" height="360"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnktpu3t9rk5feyhp6pmd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnktpu3t9rk5feyhp6pmd.png" alt=" " width="800" height="203"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sample 4: Request latency quantile Analysis - This sample aggregates P50/P90/P99 latency by path to locate slow APIs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;* | SELECT   
  RequestPath,  
  count(*) AS RequestCount,  
  approx_percentile(OriginResponseTime, 0.50) AS LatencyP50,  
  approx_percentile(OriginResponseTime, 0.90) AS LatencyP90,  
  approx_percentile(OriginResponseTime, 0.99) AS LatencyP99  
FROM log  
WHERE OriginResponseTime IS NOT NULL  
GROUP BY RequestPath  
HAVING count(*) &amp;gt; 100  
ORDER BY LatencyP99 DESC  
LIMIT 20
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsaoi7umbkun72q6ldsur.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsaoi7umbkun72q6ldsur.png" alt=" " width="800" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.1.3 Alert rule samples&lt;br&gt;
Alert 1: Sudden increase in origin 5xx faults - This alert is triggered when the Error Rate exceeds 5% to rapidly discover origin abnormalities.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;* | SELECT  
    count_if(OriginResponseStatus &amp;gt;= 500) * 100.0 / count(*) AS Origin5xxRate  
  FROM log  
  HAVING Origin5xxRate &amp;gt; 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjnom7e3uxwdor78a4bdm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjnom7e3uxwdor78a4bdm.png" alt=" " width="800" height="823"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Alert 2: Sudden increase in high-threat Requests - This alert is triggered when the Count exceeds 100 or the proportion exceeds 10% to detect potential Attacks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;* | SELECT  
    count_if(IsHighRisk = 1) AS HighRiskCount,  
    count_if(IsHighRisk = 1) * 100.0 / count(*) AS HighRiskRate  
  FROM log  
  HAVING HighRiskCount &amp;gt; 100 OR HighRiskRate &amp;gt; 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alert 3: Sudden increase in WAF blocks - This alert is triggered when the block Count exceeds 1000 or the unique attacker count exceeds 50 to assess the attack posture.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;* | SELECT  
    count_if(SecurityAction = 'block') AS BlockCount,  
    approx_distinct(ClientFingerprint) AS UniqueAttackers  
  FROM log  
  HAVING BlockCount &amp;gt; 1000 OR UniqueAttackers &amp;gt; 50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;4.Summary and Outlook&lt;br&gt;
During the data migration procedure, the network quality and fees of cross-cloud and Cross-border Transfer cannot be ignored. Therefore, we have implemented the capability to reduce the overhead of cross-cloud and Cross-border Transfer by using CloudFront for users to choose.&lt;/p&gt;

&lt;p&gt;References&lt;br&gt;
● Import data from Amazon S3 to Simple Log Service&lt;/p&gt;

&lt;p&gt;● THIRD-PARTY CLOUD VENDOR Glue Pricing&lt;/p&gt;

&lt;p&gt;● Simple Log Service Pricing&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>observability</category>
    </item>
    <item>
      <title>One Command Equips Your OpenClaw with an X-ray Machine - Alibaba Cloud Observability Makes Farming Lobsters Cheaper and Safer</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Tue, 28 Apr 2026 02:08:48 +0000</pubDate>
      <link>https://dev.to/observabilityguy/one-command-equips-your-openclaw-with-an-x-ray-machine-alibaba-cloud-observability-makes-farming-4ab9</link>
      <guid>https://dev.to/observabilityguy/one-command-equips-your-openclaw-with-an-x-ray-machine-alibaba-cloud-observability-makes-farming-4ab9</guid>
      <description>&lt;p&gt;One-command observability integration makes OpenClaw AI agent operations transparent via Alibaba Cloud monitoring plugins.&lt;/p&gt;

&lt;p&gt;❓Have you experienced this?&lt;/p&gt;

&lt;p&gt;OpenClaw🦞(an open-source AI agent framework) is becoming a "digital employee" for more enterprises. It processes emails, writes code, manages files, and executes commands. It does almost anything. Many teams have deployed dozens or hundreds of OpenClaw instances. They formed a sizable "digital lobster farm".&lt;/p&gt;

&lt;p&gt;However, a problem arises.&lt;/p&gt;

&lt;p&gt;Lobster farmers can at least watch their pond. What about your OpenClaw? Do you know how many tokens it consumed today? Do you know which model is silently draining your budget? Do you know if a "lobster" was lured into reading /etc/passwd at 3:00 AM?&lt;/p&gt;

&lt;p&gt;The answer for most is: I don't know. 😶&lt;/p&gt;

&lt;p&gt;You carefully deployed OpenClaw. However, when these issues arise, you find yourself without the right tools to pinpoint the problem.&lt;/p&gt;

&lt;p&gt;This article discusses using one command to equip your OpenClaw with an X-ray machine. This makes every LLM invocation, tool execution, and token consumption visible.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F20rpi8ncee9gkrlwq5gw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F20rpi8ncee9gkrlwq5gw.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;1.What Is Your Lobster Doing? Three “Blind Spots” Are Affecting Your Confidence&lt;br&gt;
📚 Before we start, let's discuss three "blind spots". If you use OpenClaw, at least one has likely troubled you.&lt;/p&gt;

&lt;p&gt;Blind spot 1: The inference process is a maze and debugging relies on guessing&lt;br&gt;
The complete path OpenClaw takes to process a user message is more complex than you think. A simple question may travel the following journey:&lt;/p&gt;

&lt;p&gt;User input → System prompt assembly → Model inference round 1 → Determine need for tool calling → Tool calling (such as search or code execution) → Return tool result → Model inference round 2 → Call another tool → Generate final response&lt;/p&gt;

&lt;p&gt;If any step fails, the final output may deviate from expectations. Without tracing analysis, you face an "input-output" black box. You can only guess where the problem lies. Is the prompt poor? Is it model hallucination? Did the tool return incorrect data?&lt;/p&gt;

&lt;p&gt;Tuning prompts relies on inspiration. Troubleshooting relies on luck. This is not science. It is mysticism. 🎲&lt;/p&gt;

&lt;p&gt;Blind spot 2: Token bills are like blind boxes and cause pain at month-end&lt;br&gt;
LLMs charge by token. Everyone knows this. However, as an agent, OpenClaw has a token consumption pattern different from directly invoking an API. It has a context snowball effect.&lt;/p&gt;

&lt;p&gt;In every conversation round, the agent stuffs previous conversation history, system prompts, and tool calling results into the context. The first round might use 2000 tokens. By the fifth round, it might expand to 20,000. If a tool returns a large block of HTML or JSON, the situation worsens.&lt;/p&gt;

&lt;p&gt;Worse, you do not know the source of the cost. Is a model too expensive? Is an agent prompt too wordy? Was the context not clipped in time? Without fine-grained consumption data, you cannot perform optimization. 💸&lt;/p&gt;

&lt;p&gt;Blind spot 3: System status is like Schrödinger's cat&lt;br&gt;
OpenClaw involves message queues, webhook processing, and session management during operation. When a user asks why it is not responding, the problem could lie in any layer. Did model inference timeout? Did tool calling stall? Are message queues stacked? Did the gateway fail?&lt;/p&gt;

&lt;p&gt;Without real-time metric monitoring, you only discover issues after user complaints. By then, a group of users may be affected. ⏰&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F40jl9etxksppcp0zm6sv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F40jl9etxksppcp0zm6sv.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2.The Antidote Is Here: openclaw-cms-plugin + diagnostics-otel, Traces and Metrics Working Together&lt;br&gt;
🛠️ To address these three "blind spots", our solution involves two plugins working together. They solve problems at different layers:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo9sdwav9rafsssxnnvzi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo9sdwav9rafsssxnnvzi.png" alt=" " width="789" height="185"&gt;&lt;/a&gt;&lt;br&gt;
Both rely on the OpenTelemetry standard protocol. Data is uniformly reported to Cloud Monitor 2.0 of Alibaba Cloud. View and analyze data on the same platform.&lt;/p&gt;

&lt;p&gt;The openclaw-cms-plugin is the focus of this topic. It is a trace reporting plugin designed for OpenClaw. It follows OpenTelemetry GenAI semantics and generates structured traces for every OpenClaw run.&lt;/p&gt;

&lt;p&gt;Specifically, it records the following types of spans:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4kxphynz10vp4lijx9kz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4kxphynz10vp4lijx9kz.png" alt=" " width="789" height="307"&gt;&lt;/a&gt;&lt;br&gt;
These spans have a parent-child relationship. Together, they form a complete trace. You can see a trace view similar to this in the Cloud Monitor 2.0 console:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmcbnw31kpvaan2if6rjv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmcbnw31kpvaan2if6rjv.png" alt=" " width="800" height="305"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can see at a glance how many times the LLM was invoked and how many tokens were used. You can also see which tools were invoked, which step took the longest, and if any errors occurred.&lt;/p&gt;

&lt;p&gt;It is that simple to go from "guessing" to "seeing". 👁&lt;/p&gt;

&lt;p&gt;diagnostics-otel is a built-in extension of OpenClaw. It outputs runtime metrics data, including token consumption rate, invocation QPS, response duration distribution, queue depth, and session status. The installation script automatically finds and enables it. You do not need to do anything else.&lt;/p&gt;

&lt;p&gt;Wait, does diagnostics-otel not also report traces? Why is openclaw-cms-plugin needed?&lt;br&gt;
Good question. The diagnostics-otel supports trace reporting. However, if you look closely at the generated trace, you will find a fundamental problem: All spans are independent and have no parent-child relationship.&lt;/p&gt;

&lt;p&gt;The diagnostics-otel uses an event-driven architecture to generate spans. Each event creates a span independently with a different trace ID. It generates the following five types of spans:&lt;/p&gt;

&lt;p&gt;● openclaw.model.usage: model invocation (records token usage)&lt;/p&gt;

&lt;p&gt;● openclaw.webhook.processed/openclaw.webhook.error: webhook processing&lt;/p&gt;

&lt;p&gt;● openclaw.message.processed: message processing (records processing results and duration)&lt;/p&gt;

&lt;p&gt;● openclaw.session.stuck: session stuck alerting&lt;/p&gt;

&lt;p&gt;There is no trace context propagation between these spans. Simply put, they are just independent data points. The only way to associate them is using business fields such as sessionKey.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;Webhook  [openclaw.webhook.processed]  traceId: abc123  
Message  [openclaw.message.processed]  traceId: def456  ❌ Different trace IDs  
Model    [openclaw.model.usage]        traceId: ghi789  ❌ Different trace IDs 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, openclaw-cms-plugin is designed for complete tracing. All spans share the same trace ID. They are linked into a call tree via an explicit parent-child relationship. You can see the full picture of a request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;enter_openclaw_system              traceId: aaa111  
  └── invoke_agent main            traceId: aaa111  ✅ Same trace ID  
        ├── chat qwen3-235b        traceId: aaa111  ✅ Same trace ID  
        ├── execute_tool search    traceId: aaa111  ✅ Same trace ID  
        └── execute_tool exec      traceId: aaa111  ✅ Same trace ID
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In addition to trace integrity, there is a fundamental difference in data richness between the two:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ckieb6x6bm1k19j9pqc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ckieb6x6bm1k19j9pqc.png" alt=" " width="789" height="392"&gt;&lt;/a&gt;&lt;br&gt;
Simply put: The trace from diagnostics-otel is a set of independent "record cards", while the trace from openclaw-cms-plugin is a complete "invocation map". The former only tells you "what happened," while the latter tells you "every step." Use them together. One handles system metrics, and the other handles business traces. They complement each other perfectly. 🤝&lt;/p&gt;

&lt;p&gt;3.Setup in One Minute: One-Command Integration Tutorial&lt;br&gt;
🚀 Enough theory. Let's get started. The entire integration process takes less than a minute.&lt;/p&gt;

&lt;p&gt;3.1 Get the install command&lt;br&gt;
Log on to the Cloud Monitor 2.0 console. Go to your application monitoring workspace. Choose Integration Center &amp;gt; AI application observability. Click OpenClaw.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkz0kc6mahanirwp0zsr5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkz0kc6mahanirwp0zsr5.png" alt=" " width="800" height="310"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the sidebar, enter the application name and click Click to obtain to generate the integration command immediately. Click the icon in the upper-right corner to copy it with one click.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzhpwfitwuu19ru9322la.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzhpwfitwuu19ru9322la.png" alt=" " width="800" height="541"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.2 Start installation with one command&lt;br&gt;
Open the terminal on the machine where OpenClaw runs. Paste the command you copied and press Enter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;curl -fsSL https://arms-apm-cn-hangzhou-pre.oss-cn-hangzhou.aliyuncs.com/openclaw-cms-plugin/install.sh | bash -s -- \  
  --endpoint "https://Your ARMS-OTLP address" \  
  --x-arms-license-key "Your license key" \  
  --x-arms-project "Your project" \  
  --x-cms-workspace "Your workspace" \  
  --serviceName "Your service name" 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, sit back and watch it run. ☕&lt;/p&gt;

&lt;p&gt;The installation script automatically does the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;[INFO]  Checking prerequisites...  
[OK]    Node.js v24.14.0  
[OK]    npm 11.9.0  
[OK]    OpenClaw CLI found  
[INFO]  Downloading plugin...  
[OK]    Downloaded  
[INFO]  Extracting...  
[OK]    Extracted  
[INFO]  Installing npm dependencies...  
[OK]    Dependencies installed  
[INFO]  Locating diagnostics-otel extension...  
[OK]    Found diagnostics-otel at: /home/.../extensions/diagnostics-otel  
[OK]    diagnostics-otel dependencies already present  
[INFO]  Updating config...  
[OK]    Config updated  
[INFO]  Restarting OpenClaw gateway...  
[OK]    Gateway restarted  

════════════════════════════════════════════════════  
  ✅ openclaw-cms-plugin installed successfully!  
════════════════════════════════════════════════════  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What does it do?&lt;br&gt;
✅ Checks the environment (Node.js, npm, OpenClaw CLI).&lt;br&gt;
✅ Downloads and decompresses openclaw-cms-plugin to the OpenClaw extension folder.&lt;br&gt;
✅ Installs runtime dependencies for the plugin.&lt;br&gt;
✅ Automatically locates the diagnostics-otel extension. If dependencies are missing, it installs them automatically.&lt;br&gt;
✅ Updates the openclaw.json configuration (configurations for both plugins are written at once).&lt;br&gt;
✅ Restarts the gateway to apply the configuration.&lt;br&gt;
You do not need to manually edit any configuration files. The installation script intelligently handles various edge cases. It merges updates into existing configurations instead of overwriting them. It also searches for multiple possible installation locations for diagnostics-otel based on priority.&lt;/p&gt;

&lt;p&gt;3.3 Verify installation&lt;br&gt;
After installation, chat with your OpenClaw. Wait a minute or two. Open the Cloud Monitor 2.0 console. Go to AI application observability in the sidebar on the right. Your OpenClaw application appears. Congratulations. Your lobster is no longer a black box. 🎉&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwpc2ratb5kz9davwfx0v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwpc2ratb5kz9davwfx0v.png" alt=" " width="800" height="228"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.4 Want to uninstall? It is even simpler&lt;br&gt;
If you want to stop using it (though I doubt it), one command does it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;curl -fsSL https://arms-apm-cn-hangzhou-pre.oss-cn-hangzhou.aliyuncs.com/openclaw-cms-plugin/uninstall.sh | bash  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The uninstall script automatically cleans up the plugin folder and all related configurations in openclaw.json. It also disables the diagnostics-otel configuration. If you only want to uninstall the trace plugin but keep metrics, add the --keep-metrics parameter.&lt;/p&gt;

&lt;p&gt;Clean and quick. No side effects. 🧹&lt;/p&gt;

&lt;p&gt;4.The Highlight: What Can You See After Installation?&lt;br&gt;
📈 Integration is just the beginning. The truly exciting part is what you see and solve after integration.&lt;br&gt;
4.1 Complete trace: Finally understand its "thought process"&lt;br&gt;
This is the core value of openclaw-cms-plugin. Cloud Monitor 2.0 displays a structured trace for every user request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;enter_openclaw_system (Request entry: sender and source)
　└── invoke_agent main (Agent execution procedure)
　　　├── chat qwen3-235b  (LLM invoke: model inference + token usage details) 
　　　├── execute_tool search (Tool calling: search)
　　　└── execute_tool exec (Tool calling: code execution)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a conversation round, the plugin records agent-level LLM invokes and each independent tool calling. If the agent runs a tool loop internally (such as "invoke tool → get result → invoke next tool"), each tool calling is recorded independently as a tool span. This includes input parameters, return values, and execution status. You can clearly see the complete toolchain execution procedure.&lt;/p&gt;

&lt;p&gt;💡 In the current version, LLM invokes in a conversation round aggregate into one LLM span. It records the final total token usage and input/output content for that round. Future versions will refine this. They will support generating a separate span for each independent LLM inference. Then, even intermediate inference steps in multi-round tool loops will be fully visible.&lt;/p&gt;

&lt;p&gt;Each span is annotated with rich properties:&lt;/p&gt;

&lt;p&gt;● Duration—see which step is slowest at a glance&lt;/p&gt;

&lt;p&gt;● Model information—which model and provider were used&lt;/p&gt;

&lt;p&gt;● Token usage—input_tokens, output_tokens, cache_read_tokens, and total_tokens, broken down item by item&lt;/p&gt;

&lt;p&gt;● Tool parameters and return values—what tool was invoked, what parameters were passed, and what results were returned&lt;/p&gt;

&lt;p&gt;● Error message—displayed in red if an error occurs&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftulka8wfztvzv2je5yby.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftulka8wfztvzv2je5yby.png" alt=" " width="800" height="674"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feoh7z959e0gynf0j5oir.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feoh7z959e0gynf0j5oir.png" alt=" " width="800" height="740"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What does this mean?&lt;/p&gt;

&lt;p&gt;Previously, if a user said the "answer is wrong," you had to guess by checking chat records. Now, check the traces. You see the search tool returned an empty result. The model "creatively" made up a paragraph based on that empty result. Problem localization drops from "two hours" to "two minutes". ⚡&lt;/p&gt;

&lt;p&gt;4.2 Token usage breakdown—know exactly where every penny goes&lt;br&gt;
Each LLM span in trace carries complete token usage properties:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftm47qi1nesxhfarywz23.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftm47qi1nesxhfarywz23.png" alt=" " width="789" height="224"&gt;&lt;/a&gt;&lt;br&gt;
Use gen_ai.request.model and gen_ai.provider.name. You can know exactly: which model consumed how many tokens at which step.&lt;/p&gt;

&lt;p&gt;Consider a real scenario. You find five LLM invocations in a conversation trace. The input_tokens for the third invocation reach 12,000. Click it. You see the tool returned a full page of HTML, all stuffed into the context. You found the "token-swallowing blackhole." Optimization now has a direction.&lt;/p&gt;

&lt;p&gt;Token usage transforms from a "messy account" to a "detailed ledger". 💰&lt;/p&gt;

&lt;p&gt;4.3 System running metrics—pulse visible in real-time&lt;br&gt;
Metrics data exported by the diagnostics-otel plugin can build running metric gauges on Cloud Monitor 2.0. This allows real-time monitoring:&lt;/p&gt;

&lt;p&gt;● Token usage rate and fee trends — broken down by model and time dimension&lt;/p&gt;

&lt;p&gt;● Invoke QPS and response duration — is system throughput normal&lt;/p&gt;

&lt;p&gt;● MSMQ depth and wait time — is there a backlog&lt;/p&gt;

&lt;p&gt;● Session stall count — Are any lobsters "playing dead"?&lt;/p&gt;

&lt;p&gt;● Context size trend — Is the context expanding uncontrollably?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fra5e70krdi2a4pkednhm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fra5e70krdi2a4pkednhm.png" alt=" " width="800" height="601"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2gp6bbtwvdnds0rx2d4m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2gp6bbtwvdnds0rx2d4m.png" alt=" " width="800" height="579"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Paired with the alerting feature of Ccloud Monitor 2.0, these metrics enable automatic alerts for a 50% day-over-day surge in daily token consumption, automatic alerts when queue depth exceeds a threshold, and automatic alerts for session stalls. You know immediately when a problem occurs, rather than waiting for user complaints. 🔔&lt;/p&gt;

&lt;p&gt;4.4 GenAI semantic conventions — Professional standards, not ad hoc solutions&lt;br&gt;
Note that the trace data reported by openclaw-cms-plugin strictly follows the OpenTelemetry GenAI semantic conventions. These are not field names we defined arbitrarily, but international standards.&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;p&gt;Standardized data structures — Property names such as gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.tool.name match industry standards. This simplifies integration with other tools.&lt;br&gt;
Normalized message formats — gen_ai.input.messages, gen_ai.output.messages, and gen_ai.system_instructions are formatted according to standard JSON schema. This supports multiple message types, such as TextPart, ReasoningPart, ToolCallRequestPart, and ToolCallResponsePart.&lt;br&gt;
Future extensibility — As GenAI semantic conventions evolve, the plugin allows smooth upgrades.&lt;br&gt;
4.5 Beyond standards — The "extra helpings" of Alibaba Cloud GenAI conventions&lt;br&gt;
While compatible with OTel open-source standards, openclaw-cms-plugin also implements extension capabilities from the Alibaba Cloud GenAI semantic conventions. Compared to the community Standard Edition, you receive some "extra helpings":&lt;/p&gt;

&lt;p&gt;ENTRY span — A clear "entry point" for the trace&lt;/p&gt;

&lt;p&gt;The OTel community specification defines only span types such as LLM (inference), tool (tool calling), and agent. It lacks an "entry point" concept. The Alibaba Cloud specification extends the ENTRY span type to specifically identify the call entry point of an AI application. In openclaw-cms-plugin, this is the enter_openclaw_system span. It records "who initiated the request" (gen_ai.user.id) and the "current session ID" (gen_ai.session.id). This lets you view the trace and perform analysis and tracking by user and session dimensions.&lt;/p&gt;

&lt;p&gt;🔗 Session-level association —gen_ai.session.id&lt;/p&gt;

&lt;p&gt;The OTel standard provides gen_ai.conversation.id. However, for agent applications, "session" is more appropriate than "conversation". The Alibaba Cloud specification introduces gen_ai.session.id, which spans ENTRY, AGENT, and LLM spans. This lets you search directly by session ID in Cloud Monitor 2.0, retrieve all traces under that session at once, and quickly restore the full session content.&lt;/p&gt;

&lt;p&gt;📊 gen_ai.span.kind — An AI-specific span categorization system&lt;/p&gt;

&lt;p&gt;The SpanKind in the OpenTelemetry standard includes only generic types such as CLIENT, INTERNAL, and SERVER. For an AI application trace, SpanKind alone cannot distinguish between an LLM inference and a tool calling. Alibaba Cloud introduces the gen_ai.span.kind property to define a GenAI-specific classification system: LLM, TOOL, AGENT, ENTRY, TASK, STEP (ReAct round), CHAIN, RETRIEVER, and RERANKER. Cloud Monitor 2.0 uses this categorization to automatically detect the AI application structure and render a dedicated AI trace view. LLM calls appear in orange, tool calling in pink, and agents in green. This lets you see the "role distribution" of the entire trace at a glance.&lt;/p&gt;

&lt;p&gt;💡 These extensions do not disrupt standard compatibility. The data reported by openclaw-cms-plugin displays basic information normally on any backend that supports OpenTelemetry. However, Cloud Monitor 2.0 unlocks the complete AI application observability experience.&lt;/p&gt;

&lt;p&gt;This standardized approach benefits future data analytics and platform evolution.&lt;/p&gt;

&lt;p&gt;5.From Black Box to Transparent: How Observability Changes Your Lobster Farming&lt;br&gt;
📈 Installing an X-ray machine fundamentally changes your "lobster farming" method:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc75i2os4c52sjdump463.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc75i2os4c52sjdump463.png" alt=" " width="789" height="320"&gt;&lt;/a&gt;&lt;br&gt;
This is not merely an improvement. It is a leap from "blind farming" to "precision farming."&lt;/p&gt;

&lt;p&gt;A farmer upgrades from "checking water color visually" to using "water quality sensors, cameras, and automatic feeding systems." You manage the same lobsters, but your control level changes completely. 🦞📊&lt;/p&gt;

&lt;p&gt;One more thing: Security audit&lt;br&gt;
Beyond performance tuning and cost control, enterprise AI agent deployment involves an unavoidable topic: security compliance and behavior audit. Agents can execute commands, read and write files, and initiate network requests. Without behavior audit capabilities, you cannot know if an agent secretly read an SSH key at 3:00 a.m.&lt;/p&gt;

&lt;p&gt;Our observability team covers this capability with another solution: the Alibaba Cloud Simple Log Service (SLS) OpenClaw one-click solution. It collects OpenClaw session audit logs and application operational logs. It provides out-of-the-box security audit dashboards, including high-risk command detection, prompt injection detection, and sensitive data leakage analysis. This makes every agent operation traceable.&lt;/p&gt;

&lt;p&gt;If you are interested in security audits, read this article: &lt;a href="https://www.alibabacloud.com/help/sls/enable-managed-openclaw-with-sls" rel="noopener noreferrer"&gt;https://www.alibabacloud.com/help/sls/enable-managed-openclaw-with-sls&lt;/a&gt; (SLS one-click integration and audit solution makes OpenClaw controlled operation possible).&lt;/p&gt;

&lt;p&gt;Cloud Monitor 2.0 manages performance and cost, and SLS manages security and compliance. Together, they form a complete control system for the "lobster farm." 🔐&lt;/p&gt;

&lt;p&gt;6.FAQs&lt;br&gt;
💡 Here are answers to common questions about the process:&lt;/p&gt;

&lt;p&gt;Q: Does the integration impact OpenClaw performance?&lt;/p&gt;

&lt;p&gt;A: The impact is minimal. The openclaw-cms-plugin uses the OpenTelemetry batch export mechanism. Span data is buffered in memory and reported in batches periodically. This does not block the normal processing flow of the agent.&lt;/p&gt;

&lt;p&gt;Q: Can I install only traces without metrics?&lt;/p&gt;

&lt;p&gt;A: Yes. Add the --disable-metrics parameter during installation to skip the diagnostics-otel configuration.&lt;/p&gt;

&lt;p&gt;Q: Do traces from diagnostics-otel conflict with traces from openclaw-cms-plugin?&lt;/p&gt;

&lt;p&gt;A: The installation script sets diagnostics.otel.traces to false by default. The openclaw-cms-plugin handles trace reporting. They work independently without duplication.&lt;/p&gt;

&lt;p&gt;Q: I have configured diagnostics-otel. Will the installation overwrite my configuration?&lt;/p&gt;

&lt;p&gt;A: No. The traces, logs, sample rate, and other configurations remain unchanged. It adds necessary fields such as endpoints and headers.&lt;/p&gt;

&lt;p&gt;Q: Which OpenClaw versions are supported?&lt;/p&gt;

&lt;p&gt;A: The version must be 26.2.19 or later (earlier versions exclude the diagnostics-otel plugin). The openclaw-cms-plugin works using the standard OpenClaw Hook mechanism. It does not depend on internal APIs of specific versions.&lt;/p&gt;

&lt;p&gt;Q: Why is the token consumption always 0?&lt;/p&gt;

&lt;p&gt;A: OpenClaw introduced a bug in V2026.3.8. This causes incorrect token consumption collection. We are urging the community to expedite the fix. Relevant issue link: &lt;a href="https://github.com/openclaw/openclaw/issues/46616" rel="noopener noreferrer"&gt;https://github.com/openclaw/openclaw/issues/46616&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;7.Summary&lt;br&gt;
📋 Back to the first question: Do you know what your lobster is doing underwater?&lt;/p&gt;

&lt;p&gt;If the answer is "I don't know", it is time to install an X-ray machine.&lt;/p&gt;

&lt;p&gt;The openclaw-cms-plugin + diagnostics-otel, and one command: ten minutes to integrate, bringing three core capabilities to your OpenClaw:&lt;/p&gt;

&lt;p&gt;✅Tracing analysis— End-to-end visualization of every LLM invocation, tool execution, and token flow.&lt;/p&gt;

&lt;p&gt;✅Real-time metrics— Monitor system pulse in real time, including token consumption rate, invocation QPS, queue depth, and session status.&lt;/p&gt;

&lt;p&gt;✅GenAI semantic standards— Standardized data structures. They lay the foundation for cost analysis, performance optimization, and exception detection.&lt;/p&gt;

&lt;p&gt;Stop letting your lobster "freestyle" in a black box. Install an X-ray machine. Make every step visible, traceable, and optimizable.&lt;/p&gt;

&lt;p&gt;After all, a visible lobster is a good lobster. 🦞✨&lt;/p&gt;

&lt;p&gt;❓Interaction time!&lt;/p&gt;

&lt;p&gt;What is the most troublesome "black box problem" you encountered while using OpenClaw?&lt;br&gt;
How do you troubleshoot OpenClaw issues now? Do you have any hacks to share?&lt;br&gt;
What data do you want to see most after enabling observability?&lt;br&gt;
Share your "lobster farming" insights in the comments. Bring your questions. We are here! 🦞🎉&lt;/p&gt;

</description>
      <category>observability</category>
      <category>openclaw</category>
    </item>
    <item>
      <title>Accepted by Top Conferences! Multiple Alibaba Cloud Achievements Improve O&amp;M Intelligence Accuracy and Efficiency</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Fri, 24 Apr 2026 06:32:07 +0000</pubDate>
      <link>https://dev.to/observabilityguy/accepted-by-top-conferences-multiple-alibaba-cloud-achievements-improve-om-intelligence-accuracy-3fmj</link>
      <guid>https://dev.to/observabilityguy/accepted-by-top-conferences-multiple-alibaba-cloud-achievements-improve-om-intelligence-accuracy-3fmj</guid>
      <description>&lt;p&gt;This article introduces three top-conference-accepted research achievements by Alibaba Cloud that solve core AIOps challenges in data augmentation, se...&lt;/p&gt;

&lt;p&gt;As the core direction of enterprise digital transformation and artificial intelligence for IT operations (AIOps), operation intelligence is becoming a key enabler for improving business stability and reducing O&amp;amp;M costs in the AI-native era. Its technical development and engineering implementation always revolve around core aspects such as data processing, semantic understanding, and exception detection.&lt;/p&gt;

&lt;p&gt;The Alibaba Cloud Observability team continues to work deeply in this field. Recently, a series of research achievements in the operation intelligence realm jointly published with universities such as Fudan University, Tsinghua University, and Tongji University have been consecutively accepted by top international academic conferences International Conference on Learning Representations (ICLR) 2026, Transactions on Software Engineering (TSE) 2026, and International Symposium on Software Testing and Analysis (ISSTA) 2025. These achievements systematically overcome core technical challenges in realms such as metric data augmentation, large-scale semantic parsing, and cross-system exception detection. They build a complete operation intelligence technical system from data infrastructure to semantic understanding, and then to industrial-level deployment. This further promotes the engineering implementation of large language model (LLM) in scenarios such as automatic inspection by AI agents, assisted root cause analysis, and automatic fault recovery. This lays a solid technical foundation for large-scale applications.&lt;/p&gt;

&lt;p&gt;Three Major Challenges in the Engineering Implementation of AIOps&lt;br&gt;
Challenge 1: Semantics Gap&lt;br&gt;
Traditional tools process O&amp;amp;M data essentially by performing "format matching". Log resolvers categorize similar strings into one class. Timing analysis applies common methods in the image realm. Exception detection only looks at a single metric. These methods do not understand the essential difference between "timeout after 30s" and "timeout after 0.01s" in the O&amp;amp;M context. They do not understand the statistical semantics such as the trend, epoch, or stationarity of metrics. They also do not know the deep association among logs, metrics, or traces. The lack of semantics directly leads to persistently high missed detections and false positives.&lt;/p&gt;

&lt;p&gt;Challenge 2: Generalization Bottleneck&lt;br&gt;
Real O&amp;amp;M systems are never static. Microservices frequently release new versions, and log templates continuously evolve. After new operational systems are published, all history annotations become invalid. The data distribution drifts over time, and the model that was well-trained yesterday may fail today. More critically, the annotation cost of industry-level systems is extremely high. For each new system annotated, it often requires months of human effort. Existing methods perform excellently in a stable lab environment. However, they struggle to adapt to a dynamically evolving production environment.&lt;/p&gt;

&lt;p&gt;Challenge 3: Industrial Availability&lt;br&gt;
The academic community pursues accuracy. The industrial community requires both accuracy and efficiency. Log streaming of 100,000 logs per second, abnormal response requirements within 100 ms, and limited memory and computing power budgets are hard constraints. These hard constraints keep many "good methods in papers" confined to the lab. They cannot be truly implemented.&lt;/p&gt;

&lt;p&gt;Systematic Breakthroughs of Alibaba Cloud Observability&lt;br&gt;
① AutoDA-Timeseries: Break through the limitations of timing modeling, enabling AI to predict faults with less data&lt;br&gt;
Without a good augmentation policy, the true potential of metrics cannot be tapped. For a long time, metric data augmentation has been limited by paradigm migration in the image domain. Timing features are ignored. Augmentation policies cannot be adaptive. Existing Automated Data Augmentation (AutoDA) frames blindly apply image transformations. This destroys autocorrelation and time dependencies. This critically restricts the performance of downstream tasks such as categorization, prediction, and exception detection.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjkhrmlj0774gzk04gfnj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjkhrmlj0774gzk04gfnj.png" alt=" " width="800" height="361"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The paper "AutoDA-Timeseries: Automated Data Augmentation for Time Series" (Tsinghua University &amp;amp; Alibaba Cloud) accepted by ICLR 2026 proposes the first general automated data augmentation frame for metrics. It fetches 24-dimensional timing statistical features and integrates them into a stacking augmentation layer. Through Gumbel-Softmax differentiable sampling, it adaptively optimizes the augmentation probability and intensity in a single-stage end-to-end manner. It covers five major jobs such as categorization, long- and short-term prediction, regression, and exception detection. The categorization accuracy reaches 0.730 (+6.7%) on Temporal Convolutional Network (TCN) and 0.721 (+5.2%) on ROCKET. It comprehensively surpasses 7 state-of-the-art (SOTA) baselines. This provides the first generalized and automated solutions for metric data augmentation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqqg4fusrxgqm81conskn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqqg4fusrxgqm81conskn.png" alt=" " width="718" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Paper address: &lt;a href="https://openreview.net/forum?id=vTLmHAkoIW" rel="noopener noreferrer"&gt;https://openreview.net/forum?id=vTLmHAkoIW&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;② A SemanticLog: Balancing high accuracy and high throughput, the peak throughput of semantic log parsing reaches 1.28 million logs per second&lt;br&gt;
Without good semantic understanding, the true meaning behind log parameters cannot be read. Log parsing technology has remained at the syntax layer for a long time. That is, it uniformly replaces dynamic parameters with the wildcard character (*). This loses semantic information carried by parameters, such as object identifier (ID), status code, and UNIX timestamp. This critically restricts the accuracy of AIOps downstream tasks such as exception detection and root cause analysis. Existing LLM resolvers mostly depend on the online APIs of ChatGPT. They face three major challenges: privacy leakage, unstable latency, and uncontrollable versions. They are difficult to implement in a production environment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fair7b83igd98qxtx67n0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fair7b83igd98qxtx67n0.png" alt=" " width="800" height="609"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The paper "SemanticLog: Towards Effective and Efficient Large-Scale Semantic Log Parsing" (Fudan University &amp;amp; Alibaba Cloud &amp;amp; Tongji University), accepted by TSE 2026, proposes the first semantic log resolver based on an open-source LLM. The semantic log resolver consists of three core modules that work together. LogLLM removes causal masks and reconstructs log parsing from text generation to a token categorization job to fully utilize bidirectional context. The SemPerception module uses multi-head cross-attention to aggregate subword features and achieves 16 classes of fine-granularity semantic categorization (which is extended by 60% compared to the VALB 10-class system, and 96% of parameters in enterprise logs can be accurately categorized). The EffiParsing prefix tree caches parsed templates to significantly reduce repetitive inference overhead.&lt;/p&gt;

&lt;p&gt;A comprehensive evaluation based on LLaMA2-7B on the LogHub-2.0 benchmark shows that SemanticLog achieves the best results in five traditional and semantic parsing Metrics (GA 93.3%, PA 93.6%, FTA 84.4%, SPA 83.2%, SPA+ 55.9%). SemanticLog comprehensively surpasses 11 SOTA resolvers including the ChatGPT solution. The semantic parsing accuracy SPA is improved by 18.7% compared to the similar method VALB. The inference speed is better than all LLM resolvers. In the downstream exception detection experiment, fine-granularity semantic tagging increases the detection F1 score by up to 4%. This provides an efficient and reliable open-source solution for the engineering implementation of semantic log parsing in privacy-sensitive scenarios.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0snu5nv2ubktoqii03r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0snu5nv2ubktoqii03r.png" alt=" " width="800" height="680"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Paper address: &lt;a href="https://ieeexplore.ieee.org/document/11216353/" rel="noopener noreferrer"&gt;https://ieeexplore.ieee.org/document/11216353/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;③ LogBase: The first semantic log parsing benchmark, enabling AI to truly "understand" every log&lt;br&gt;
Without a good ruler, you cannot measure true progress. The semantic log parsing realm has long faced systematic challenges such as scarce annotations, limited data size, and fragmented evaluation standards. The mainstream benchmark LogHub-2.0 only covers 14 systems and 3,488 templates, which critically restricts the accuracy of AIOps downstream tasks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72pbelkceq26e69h6ubc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72pbelkceq26e69h6ubc.png" alt=" " width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The paper "LogBase: A Large-Scale Benchmark for Semantic Log Parsing" (Fudan University &amp;amp; Alibaba Cloud &amp;amp; Tongji University), accepted by ISSTA 2025, builds the first large-scale semantic log parsing benchmark. The benchmark covers 130 open-source projects and provides 85,300 high-quality semantic tagging templates. Compared to LogHub-2.0, the data source size is increased by about 9 times, and the quantity of templates is expanded by 24.5 times. The benchmark is equipped with an 8+16 hierarchical semantic categorization system and an automated building frame GenLog. The benchmark achieves the evaluation paradigm upgrade from syntax parsing to semantic understanding for the first time. A comprehensive evaluation of 15 mainstream resolvers exposes the true shortcomings of existing methods in complex scenarios. This provides a unified standard and reliable foundation for the engineering implementation of semantic log parsing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw2bwrolmz1neqb33zzm9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw2bwrolmz1neqb33zzm9.png" alt=" " width="800" height="187"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7lf0olgcwclb52n2tkwb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7lf0olgcwclb52n2tkwb.png" alt=" " width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Paper address: &lt;a href="https://dl.acm.org/doi/10.1145/3728969" rel="noopener noreferrer"&gt;https://dl.acm.org/doi/10.1145/3728969&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Currently, the Alibaba Cloud observability team has integrated the aforementioned innovative technologies into product systems such as Cloud Monitor (CMS), Simple Log Service (SLS), and Application Real-Time Monitoring Service (ARMS). This achieves accurate intelligent alerting, in-depth log understanding, and low-threshold intelligent O&amp;amp;M. This helps enterprises break O&amp;amp;M efficiency bottlenecks, reduce costs, and improve business stability.&lt;/p&gt;

&lt;p&gt;The iteration of LLM and AI agent technologies is accelerating. The value of observability data as a key link connecting AI and production systems continues to become prominent. The Alibaba Cloud Observability team will continue to drive technological breakthroughs through academic innovation. The team will improve the operation intelligence technology system, participate in the construction of industry standards, and promote the large-scale implementation of AIOps. This provides more solid artificial intelligence for IT operations support for the digital transformation of enterprises.&lt;/p&gt;

</description>
      <category>intelligence</category>
      <category>cloudnative</category>
    </item>
    <item>
      <title>Two Thousand Years of Ontology: From Metaphysics to the Engineering Practice of Alibaba Cloud UModel</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Fri, 24 Apr 2026 06:22:45 +0000</pubDate>
      <link>https://dev.to/observabilityguy/two-thousand-years-of-ontology-from-metaphysics-to-the-engineering-practice-of-alibaba-cloud-umodel-58b2</link>
      <guid>https://dev.to/observabilityguy/two-thousand-years-of-ontology-from-metaphysics-to-the-engineering-practice-of-alibaba-cloud-umodel-58b2</guid>
      <description>&lt;p&gt;This article introduces the evolution of ontology from philosophy to engineering practice, highlighting how Alibaba Cloud UModel utilizes it to unify observability data and empower AIOps.&lt;/p&gt;

&lt;p&gt;Have you ever thought that the underlying logic used by Alibaba Cloud Operations and Maintenance (O&amp;amp;M) engineers today to locate server faults is essentially the same as the thinking of ancient Greek philosophers who asked "what the world is made of" more than two thousand years ago? From the first existence analysis frame built by Aristotle in "Metaphysics" to the observability modeling of enterprise Information Technology (IT) systems in the digital age today, ontology has spanned more than two thousand years, gradually evolving from a core branch of metaphysics into the underlying methodology for the digital transformation of various industries. It is never an obscure philosophical speculation in a study. Instead, it always revolves around the simplest proposition: how can we clearly understand the world? How can we turn scattered and personal experiences into a transferable, reusable, and verifiable consensus?&lt;/p&gt;

&lt;p&gt;Today, we will follow this path from philosophy to practice to thoroughly analyze the essence of ontology, and see how ontology transforms from an abstract philosophical theory into an engineering implementation tool, and finally completes its native practice in the realm of observability and artificial intelligence for IT operations (AIOps) on Alibaba Cloud UModel.&lt;/p&gt;

&lt;p&gt;I. What Exactly is Ontology?&lt;br&gt;
When many people hear about ontology, their first reaction is that ontology is a "profound philosophical concept". However, to put it plainly, ontology is to draw a unified and unambiguous map for the "world" you want to study. The etymology of ontology comes from the Greek words ontos (existence) and logos (doctrine), which literally translates to "the doctrine of existence". In the philosophical system, ontology is the core of metaphysics, and the ultimate questions it needs to answer are: what is the world made of? What is the essence of things? How does existence become existence? Whether it is ontology in philosophy or ontology in the computer realm, the core must solve three problems:&lt;/p&gt;

&lt;p&gt;● What truly exists in this world? (What exists?)&lt;/p&gt;

&lt;p&gt;● How should we perform categorization and definition for these things? (How to classify?)&lt;/p&gt;

&lt;p&gt;● What are the relationships between these things, and how will they interact with each other? (How to relate?)&lt;/p&gt;

&lt;p&gt;Here we must distinguish three concepts that are easily confused, which is also the foundation for us to understand the value of ontology:&lt;/p&gt;

&lt;p&gt;● Ontology: It defines "what the world itself is". It is the starting point of all cognition. For example, you must first clearly define what a host, pod, and service are, and what the relationships between them are, before you can discuss subsequent O&amp;amp;M operations.&lt;/p&gt;

&lt;p&gt;● Epistemology: It answers "how we should understand this world" and is the method of cognition. For example, we need to decide whether to conduct an observation of the Status of a host through Metrics, logs, or traces.&lt;/p&gt;

&lt;p&gt;● Methodology: It solves "what means we should use to transform this world" and is the path for implementation. For example, after a fault occurs, we need to determine what steps to take to locate the root cause and complete the disposal.&lt;/p&gt;

&lt;p&gt;Without the underlying "map" of ontology, epistemology and methodology become water without a source. If you have not even clearly explained what you want to study, subsequent observations and operations will inevitably fall into chaos. The biggest misunderstanding of ontology is thinking that it is just defining things and attaching labels to things. However, the true soul of ontology is never a static entity definition, but rather dynamic relationships and behaviors. Take the simplest example: to understand "water", we must first clarify that its molecular formula is H₂O. This is the essential definition of water. However, what truly makes us understand "water" is its status changes at different temperatures, its chemical reactions with other substances, and its loop patterns in the ecosystem. Detached from these dynamic behaviors and relationships, "H₂O" is just a cold symbol without any practical significance. This is the essential difference between the static perspective and the dynamic perspective in ontology. The static perspective only focuses on the properties of the things themselves, while the dynamic perspective believes that the essence of a thing can only truly manifest in its relationships with other things and in its own movement and changes. This core cognition is also the fundamental reason why ontology can step out of the philosophical study and take root in the engineering realm. The most painful problem in enterprise digitalization is never "we do not have data", but rather "we have a pile of data, but we do not know what the relationships between the data are, let alone what the business logic behind the data is".&lt;/p&gt;

&lt;p&gt;II. Two Thousand Years of Ontology: From Philosophical Speculation to Engineering Practice&lt;br&gt;
The development of ontology has never been a random accumulation of scattered viewpoints, but has completed three key leaps along the main line of "standardizing human cognition".&lt;/p&gt;

&lt;p&gt;2.1 Philosophical Foundation: From "Questioning the Origin" to "Building a System"&lt;br&gt;
The starting point of ontology was ancient Greece in the 6th century BC. Before this, people used myths to explain the world. The philosophers of ancient Greece used reason for the first time to start asking "what the origin of the world actually is." Thales said that "water is the origin of all things." He attributed the essence of the world to concrete matter for the first time and opened the prelude to rational inquiry. Heraclitus said that "all things stream, and a person cannot step into the same river twice." He shifted the perspective to "change" and believed that the essence of the world is a procedure of movement. Parmenides proposed that "true existence is eternal and unchanging." The dispute between the two also planted the core proposition of "static and dynamic" in ontology. The person who truly transformed ontology into a complete System was Aristotle. In "Metaphysics", he treated "the study of existence itself" as an independent discipline for the first time. He dismantled the underlying logic of the existence of things using the theory of four causes (material cause, formal cause, efficient cause, and final cause). He also used the ten-category System to perform categorization on all manifestations of existence. Aristotle drew a universal "ontology map" for the world for the first time. This turned scattered inquiries into a reusable Analysis frame.&lt;/p&gt;

&lt;p&gt;In the subsequent Middle Ages, European philosophy was incorporated into the theological frame. The dispute between realism and nominalism became the core. Realism believed that universal concepts truly exist. Nominalism believed that only concrete individuals are real, and concepts are just names. This dispute appears to be attached to theology, but this dispute actually clarified the "relationship between concepts and entities." This is exactly the core premise of "knowledge representation" in the computer realm later. From the 17th century to the 19th century, driven by the profound impact of the modern scientific revolution and the rational spirit of the Enlightenment, the modern scientific revolution completely pulled ontology out of theology. Descartes' mind-body dualism separated the cognitive entity and the objective world, and set the research paradigm of "subject-object separation" for modern science. Kant's twelve-category system achieved the conversion of the inquiry of traditional ontology into an epistemological issue. Kant's twelve-category system no longer inquired about the unknowable 'thing-in-itself', but instead studied the a priori logical frame of how humans perceive the world. Hegel's dialectics thoroughly injected the thinking of dynamic evolution into ontology. Hegel's dialectics completed the crucial upgrade from the "description of static existence" to the "description of the laws of motion of existence."&lt;/p&gt;

&lt;p&gt;At this point, the philosophical kernel of ontology had become completely mature. The remaining task was to wait for an era that could allow ontology to be implemented.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8gig6x72wzy193vbtez1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8gig6x72wzy193vbtez1.png" alt=" " width="800" height="540"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2.2 Paradigm Transformation: From Philosophical Theory to Engineering Tools&lt;br&gt;
Since the 20th century, the successive explosion of mathematical logic, computer science, and IT has opened the door of engineering for ontology. The development of ontology has also followed the technological wave and completed four crucial steps:&lt;/p&gt;

&lt;p&gt;The first step is from text to symbols. The concept of a "universal language" proposed by Leibniz in the 17th century was finally implemented from the end of the 19th century to the 20th century. The mathematical logic founded by Frege and Russell provided ontology with a rigorous and unambiguous formal expression tool. The "existence" that could only be described in words before can now be calculated using symbols and formulas. Ontology transformed from philosophical speculation into a scientific system that can be authenticated and computed.&lt;/p&gt;

&lt;p&gt;The second step is from science to the core tool of artificial intelligence (AI). When the discipline of AI was born in the mid-20th century, the first problem to be solved was "how to make machines understand human knowledge." This is exactly what ontology is best at. In 1993, the scholar Gruber proposed the classic definition: &lt;em&gt;An ontology is an explicit specification of a conceptualization&lt;/em&gt;. Later, scholars such as Studer performed extension and perfection on this definition, and formed the consensus definition still in use today: Ontology is a formal and explicit normative specification of a shared conceptual system in a certain realm. This completely accomplished the paradigm transformation of ontology. Ontology is no longer a toy for philosophers, but ontology has become the core foundation of knowledge representation in the realm of AI.&lt;/p&gt;

&lt;p&gt;The third step is from a single System to the infrastructure of the Internet. Around the year 2000, the Internet rapidly became popular. However, the information on the Internet could only be read by humans. Machines could not understand the information, and the data between different websites were completely isolated islands. The concept of the Semantic Web proposed by Tim Berners-Lee, the father of the World Wide Web, was to use ontology to perform unified semantic tagging on information on the Internet. The publication of standards such as Resource Description Framework (RDF) and Web Ontology Language (OWL) made ontology the underlying infrastructure for knowledge interconnection and interoperability on the Internet.&lt;/p&gt;

&lt;p&gt;The fourth step is from the Internet to the era of big data and Large Language Models (LLMs). In 2012, Google published the knowledge graph, and Google took ontology as the "pattern layer" of the knowledge graph. The combination of ontology and Graph Database allowed ontology to achieve large-scale engineering implementation in the era of big data. After the explosion of large language models in 2022, ontology found a new positioning. LLMs have massive knowledge, but LLMs are prone to "talking nonsense", and the procedure is uncontrollable when LLMs infer. However, the structured and precise attributes of ontology can exactly put a "halter" on LLMs, and ontology and LLMs become a Gold combination for the industry implementation of LLMs.&lt;/p&gt;

&lt;p&gt;2.3 Modern Exploration: The Breakthroughs and Limitations of Palantir&lt;br&gt;
At this point, many people will certainly ask: Is there an application for ontology in industries such as healthcare, finance, industry, and government affairs? In essence, ontology needs to solve three common difficulties that cannot be bypassed in the digital transformation of all enterprises:&lt;/p&gt;

&lt;p&gt;● Data silos. Different systems and departments have different data standards and disconnected semantics. Even though data is available, the data cannot be used together. For example, in the medical industry, the disease glossaries of different hospitals are not unified, and data cannot interoperate at all. In the government realm, the data of different departments is managed independently, and citizens must visit several departments to complete a single task. Ontology builds a unified "translation language" for this heterogeneous data, which allows data from different systems to communicate with each other.&lt;/p&gt;

&lt;p&gt;● Experience churn. Most of the core capabilities of an enterprise are hidden in the minds of senior employees. For example, a senior worker in a factory knows what sound a device makes when the device is about to malfunction, and a senior risk control expert in a bank knows what features indicate fraud. Newcomers need to spend several years learning this tacit experience, and if the employees leave, the experience is lost. Ontology can break down this fragmented experience into standardized rules, and turn the experience into reusable and inheritable knowledge in the system. This knowledge will not be lost because of personnel turnover.&lt;/p&gt;

&lt;p&gt;● Disconnection between systems and businesses. The IT systems of many enterprises only move offline flows online, but do not incorporate business logic. A heap of data exists in the system, but the data cannot support business decisions, and problems cannot be quickly located when problems occur. Ontology models business entities, relationships, and rules into the system. This makes the system truly understand the business, rather than just storing data.&lt;/p&gt;

&lt;p&gt;On the path of modern engineering practice of ontology, Palantir is an unavoidable benchmark. This company can gain a firm foothold in global intelligence, finance, and industrial realms. This is never because of how powerful its big data technology is, but because it is the first to truly implement the core value of ontology into enterprise-level scenarios. Palantir hits the nail on the head by exposing the fatal flaw of traditional data systems. The data of enterprises lies in the database, but the business relationships between data are invisible, and the business experience and judgment rules in the minds of senior employees cannot be incorporated into the system. Everyone is looking at the data, but no one can clearly explain how the business behind the data actually runs. Palantir uses ontology to find the answer to this problem:&lt;/p&gt;

&lt;p&gt;● Palantir jumps out of the cold association of primary and foreign keys in traditional databases, and adds business semantics to the relationships between entities. This is not a simple "Identifier (ID) match", but an association with practical significance, such as "Company A holds shares in Company B" and "Account C transfers money to Account D".&lt;/p&gt;

&lt;p&gt;● Palantir breaks down the tacit experience in the minds of business experts into configurable and executable rules, and incorporates the rules into the system. This allows the system to replicate the judgment logic of experts.&lt;/p&gt;

&lt;p&gt;● Palantir not only records the final status of data, but also traces the end-to-end flow of data generation and circulation. This makes the complete procedure of the business observable and traceable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ghhol2dt014k3kbawb8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ghhol2dt014k3kbawb8.png" alt=" " width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This set of strategies allows Palantir to prove the huge value of ontology in highly complex scenarios, such as anti-terrorism, finance risk control, and industrial manufacturing. However, its limitations are also obvious. Palantir is positioned to serve only top-tier customers, takes the route of heavy customization and heavy delivery, has a long implementation cycle, and has an extremely high cost. Ordinary small and medium-sized enterprises cannot afford it at all. Moreover, the threshold for ontology modeling is very high, which requires the cooperation of professional teams and cannot be popularized on a large scale. Palantir has paved the way for the engineering of ontology, but it has also left a new problem. How to turn this system into a reusable, low-threshold, and inclusive capability for the entire industry, so that ordinary enterprises can also use it? This has become a brand-new proposition for the engineering implementation of ontology.&lt;/p&gt;

&lt;p&gt;III. UModel: Making Ontology "Light" and "Practical"&lt;br&gt;
If we focus on the observability realm, we will find a deeply meaningful point of convergence. With the continuous deepening of the digital transformation of enterprises, and IT architectures fully evolving toward microservices, cloud-native, and containerization, the core dilemma faced by the observability realm is essentially homologous to the philosophical proposition that ontology sought to solve more than 2,000 years ago. Both aim to solve the fundamental problem of how to clearly define cognitive objects, sort out association relationships, and form a unified consensus. The current enterprise observability systems generally face three major core pain points:&lt;/p&gt;

&lt;p&gt;● Data silos and semantic fragmentation. The four major core observability data types, which are metrics, logs, traces, and changes, are scattered in systems of different vendors and different features. The data formats are not unified, and the business semantics are not interoperable. When a fault occurs, O&amp;amp;M engineers need to switch back and forth among multiple platforms for troubleshooting, and cannot achieve end-to-end association analysis or root cause location at all.&lt;/p&gt;

&lt;p&gt;● Tacit experience and inheritance failure. Senior O&amp;amp;M engineers can quickly locate faults based on long-accumulated experience. However, this core judgment logic and handling methods exist in the minds of individuals in the form of tacit knowledge. Not only is the training cycle for newcomers long and difficult to master, but the core O&amp;amp;M capabilities of the enterprise cannot achieve standardized accumulation or scaled reuse. The fault handling efficiency always highly depends on individual capabilities.&lt;/p&gt;

&lt;p&gt;● LLMs lack a reliable foundation for implementation. The industry generally attempts to apply LLMs to artificial intelligence for IT operations scenarios. However, LLMs lack a standardized knowledge framework in the vertical O&amp;amp;M realm, and have biases in understanding professional terms and business logic. They are highly prone to hallucinate, and their infer procedures and Results are uncontrollable. Therefore, they can never be truly implemented in a production environment.&lt;/p&gt;

&lt;p&gt;Alibaba Cloud UModel emerged precisely to systematically solve these industry pain points. Based on the underlying logic of ontology, which prioritizes behavior and takes relationships as the core, UModel creates a universal and unified modeling framework for the observability realm. Essentially, it draws a complete and unambiguous cognitive map of the digital world for complex and heterogeneous IT systems, truly transforming ontology from an abstract theory into a practical tool that O&amp;amp;M engineers can use, know how to use, and afford to use. During the design procedure, UModel is not only an abstraction of data, but also a complete system that integrates data, knowledge, and actions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqj2jyzemp5vjrguav99a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqj2jyzemp5vjrguav99a.png" alt=" " width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;UModel Builds a complete product system around four core dimensions. Each dimension not only aligns with the native ideas of ontology, but also forms an irreplaceable differentiated advantage against the industry pain points in the observability realm. This completely distinguishes UModel from general-purpose ontology platforms such as Palantir and traditional observability monitoring tools:&lt;/p&gt;

&lt;p&gt;● Standardized semantics definition to solve the core pain points of data silos and semantics fragmentation&lt;/p&gt;

&lt;p&gt;With the native ideas of ontology as the core, UModel provides unified and unambiguous standardized definitions for all entities, associate relationships, and business rules in the O&amp;amp;M world. This allows O&amp;amp;M engineers, applications, and AI LLMs to form a consistent understanding of observable data, solving the problem of semantics inconsistency from the root. Unlike general-purpose platforms that have a high threshold of requiring users to Build realm models from scratch, UModel is optimized in depth specifically for IT O&amp;amp;M and cloud Resource Management scenarios. It has built-in mature realm ontology libraries and standardized modeling templates that cover all scenarios such as infrastructure, intermediaries, application performance, and Alibaba Cloud services. Enterprises do not need to build from scratch, and can complete the adaptation of core scenarios out of the box.&lt;/p&gt;

&lt;p&gt;● End-to-end closed-loop Build to achieve complete implementation from Data to actions&lt;/p&gt;

&lt;p&gt;Based on the graph model, UModel bridges the complete closed loop of "data-knowledge-action". It connects the underlying multi-source observation Data, expert knowledge in the O&amp;amp;M realm, and automated disposal execute actions in depth, achieving end-to-end integration from Data observation and Root Cause Analysis to decision-making and disposal, rather than the simple static data storage and display of traditional tools. At the same time, as the core foundation of Cloud Monitor 2.0, UModel can natively connect to full-stack observability products such as Alibaba Cloud Simple Log Service (SLS) and Application Real-Time Monitoring Service (ARMS). It provides one-stop integration of all observable data, including metrics, logs, traces, and changes. Enterprises do not need to perform complex system integration or custom development, significantly reducing implementation costs.&lt;/p&gt;

&lt;p&gt;● Explicit precipitation of implicit experience to achieve standardized inheritance of enterprise O&amp;amp;M capabilities&lt;/p&gt;

&lt;p&gt;Closely following the core definition of "rules and constraints" in ontology, UModel dismantles the implicit experience accumulated by O&amp;amp;M engineers during fault judgment, root cause analysis, and emergency disposal into a standardized, configurable, and reusable rule system. This system is precipitated into the system, allowing personal experience to be converted into inheritable digital knowledge assets of the enterprise. Unlike the pattern of Palantir that heavily relies on professional teams to customize rules, UModel relies on visualization modeling tools and standardized modeling flows to completely break the technical barriers of rule precipitation. O&amp;amp;M engineers do not need to master complex ontology theories to independently complete the standardized dismantling of experience and model configuration, achieving universal reuse of core capabilities.&lt;/p&gt;

&lt;p&gt;● LLM native integration design to achieve universally beneficial AIOps with bidirectional empowerment&lt;/p&gt;

&lt;p&gt;UModel uses a unified ontology model to provide reliable realm knowledge constraints and logical frames for LLMs. This avoids the problem where LLMs hallucinate in vertical O&amp;amp;M scenarios from the root. At the same time, by leveraging the natural language understanding and generate capabilities of LLMs, UModel significantly lowers the technical threshold for ontology modeling and O&amp;amp;M operations, truly achieving bidirectional empowerment between ontology models and LLMs. This is also the core advantage that distinguishes UModel from traditional tools: traditional tools can only achieve simple connection with LLMs, whereas UModel has completed the native integration of the ontology model and the Qwen LLM from the beginning of its design. Users can complete fault localization, root cause analysis, and model configuration through daily conversations. They do not need to memorize complex query syntax or operation instructions, truly achieving universally beneficial "conversational O&amp;amp;M".&lt;/p&gt;

&lt;p&gt;In terms of specific architecture implementation, UModel adopts a directed graph structure of "nodes + edges" to completely describe the entire IT world. Each architecture component forms a precise one-to-one mapping with the core concepts of ontology.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffn0sydgc74s00z1p6r2m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffn0sydgc74s00z1p6r2m.png" alt=" " width="800" height="358"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the same time, the implementation procedure of UModel essentially breaks down the philosophical ideas of ontology into standardized flows that can be executed and copied in O&amp;amp;M scenarios. Through five core actions, it helps enterprises convert scattered, tacit O&amp;amp;M experience into standardized ontological models. Each step of the end-to-end flow deeply aligns with the core logic of ontology:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Division of business domains (corresponding to the realm concept definition in ontology). Based on the IT architecture, line-of-business division, and O&amp;amp;M team labor division of an enterprise, clear business domains are delineated, such as the infrastructure domain, application performance domain, Alibaba Cloud service domain, and operational system domain. The border scope and responsible team for each domain are clarified to avoid duplicate model construction from the source, building a foundation for ontological modeling.&lt;/li&gt;
&lt;li&gt;Definition of entities and relationships (corresponding to class and association modeling in ontology). The core observability entities within each business domain are sorted out. The properties and field specifications of entity sets, as well as the business semantics relationships between entities, are defined. Examples include the "containment" relationship between a service and a pod, the "running on" relationship between a pod and a host, and the "invoke" relationship between microservices.&lt;/li&gt;
&lt;li&gt;Explicitization of O&amp;amp;M rules (corresponding to constraint rule definition in ontology). Through expert interviews and reviews of history fault cases, the tacit experience of senior O&amp;amp;M engineers is extracted and broken down into standardized rule elements. These elements include fault trigger conditions, root cause analysis logic, alerting denoising rules, and automated handling flows. They are then mapped to the constraint rule system of UModel.&lt;/li&gt;
&lt;li&gt;Multi-source data fusion (corresponding to instantiate implementation in ontology). Relying on the storage decoupling capability of UModel, it connects with various existing observability data sources of the enterprise. It completes the unified semantics snap of full data, and uniformly maps the metric, log, and trace scattered across different systems into the built ontological model, completely breaking data silos to formlive data that can be subjected to association analysis.&lt;/li&gt;
&lt;li&gt;Scenario-based application and iterative optimization. Based on the built ontological model, specific O&amp;amp;M scenarios such as fault early warning, root cause analysis, alerting denoising, and automated handling are implemented. Then, according to the run effects in the actual production environment, the entity definitions, relationship rules, and judgment logic of the model are continuously iterated and optimized. This allows the ontological model to continuously evolve along with the business architecture and O&amp;amp;M requirements of the enterprise.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy6mjoos3h82nwr48d2yw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy6mjoos3h82nwr48d2yw.png" alt=" " width="800" height="371"&gt;&lt;/a&gt;&lt;br&gt;
IV. UModel Practices in Multiple Industries&lt;br&gt;
Based on this set of standardized methodologies, we are also actively exploring the further implementation practices of UModel in industries such as the Internet, finance, industrial manufacturing, and government affairs. This forms replicable implementation solutions adapted to the attributes of different industries, truly authenticating the inclusive value of ontology in the observability realm.&lt;/p&gt;

&lt;p&gt;4.1 The Internet Industry: End-to-end Observability of Ultra-large-scale Microservices Models&lt;br&gt;
The Internet industry generally adopts distributed microservices models. Core business traces often span tens to hundreds of microservices, with tens of thousands to hundreds of thousands of container instances running online. The industry generally faces three core challenges. First, observability data is scattered across multiple sets of monitoring tools. Metric, trace, log, and change data lack unified semantics definitions, forming critical data silos. When online faults occur, O&amp;amp;M engineers need to repeatedly troubleshoot across multiple platforms, resulting in extremely low positioning efficiency. Second, core fault handling and root cause analysis experience is highly concentrated in the hands of senior O&amp;amp;M engineers. The parenting epoch for new team members is long, and experience is difficult to standardize for accumulation and reuse. Third, the alert storm problem triggered by massive alerting is prominent. Effective alerting is overwhelmed by invalid information, and fault response efficiency is significantly reduced.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Based on the business architecture and O&amp;amp;M labor division, you can divide five core domains: the application performance domain, infrastructure domain, intermediary domain, Alibaba Cloud service domain, and operational system domain. You can clarify the border and core entity scope of each domain, and build the basic frame for ontological modeling.&lt;/li&gt;
&lt;li&gt;You can define core entity sets such as service, instance, pod, edge zone, database, and Microsoft Message Queuing (MSMQ), as well as core semantics relationships such as service invocation, instance deployment, container run, and data read/write. You can build an end-to-end unified ontological model that covers from user requests to infrastructure.&lt;/li&gt;
&lt;li&gt;Through reviews of History fault cases and extraction of experience from senior O&amp;amp;M experts, you can break down tacit experience, such as fault root cause judgment, alerting denoising, and automated handling flows, into a standardized rule system. You can accumulate this into UModel to achieve the explicitization and reusability of O&amp;amp;M experience.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You can connect multi-source heterogeneous monitoring data sources. Through UModel, you can complete the semantics snap of full data, achieve the association and connection of end-to-end Data, and completely break data silos.&lt;br&gt;
4.2 The Finance Industry: Compliance-oriented AIOps under IT Application Innovation Transformation&lt;br&gt;
The finance industry is currently in a critical stage of IT application innovation transformation. IT architectures are transforming from traditional centralized architectures to distributed hybrid cloud architectures. Hundreds of operational systems, such as core trading, credit, and wealth management, are running simultaneously in IT application innovation environments and traditional environments. The core pain points of the industry include the following aspects. First, observability data is scattered across monitoring tools of multiple vendors and types. Cross-environment data semantics are disconnected, making troubleshooting extremely difficult. Second, the size of O&amp;amp;M teams is limited, and senior O&amp;amp;M engineers are scarce. Fault handling highly depends on experts, and core experience is difficult to cover all operational systems. Third, the industry faces strict financial regulatory compliance requirements. It needs to achieve end-to-end traceability and auditability of O&amp;amp;M operations and trading traces. Traditional O&amp;amp;M patterns are difficult to meet rigid compliance requirements.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You can combine the IT application innovation transformation architecture and regulatory compliance requirements to divide the architecture into four core domains: the infrastructure domain, core business domain, IT application innovation resource domain, and compliance audit domain. This adapts to the architecture attributes and compliance requirements of the finance industry.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You can define core entities and association relationships, such as hosts, storage, databases, operational systems, and transaction links, to complete the building of the basic ontology model. You do not need to develop from scratch.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You can break down senior O&amp;amp;M experience, such as fault handling of core transaction systems, threat early warning, and compliance audits, into standardized rules, and accumulate them into the ontology model. This achieves system-wide reuse of expert experience.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You can connect multi-source monitoring data between the IT application innovation environment and the traditional environment. You can achieve semantic alignment of cross-environment data through a unified ontology model. This ensures that the end-to-end transaction links are traceable and meets regulatory compliance requirements.&lt;br&gt;
4.3 Industrial Manufacturing Industry: End-to-end Observability of Production Lines in Industrial Internet Scenarios&lt;br&gt;
The discrete manufacturing and process manufacturing industries are accelerating their transformation to the Industrial Internet. A single production line is often equipped with thousands of industrial devices, and the automation rate of production lines continues to increase. The core pain points of the industry include the following. First, the operational data of production line devices, manufacturing execution system (MES) data, and IT O&amp;amp;M data are isolated from each other. They lack a unified semantics definition. Operational technology (OT) and IT data cannot be integrated for analysis. Second, device fault handling highly depends on the personal experience of on-site maintenance personnel. The fault handling cycle is long, which easily causes unplanned downtime of production lines. Third, the core experience in device maintenance and process optimization is scattered across various production bases. When new bases are built or new employees are trained, the experience cannot be quickly reused. Fourth, there is a lack of a standardized predictive maintenance system. Sudden device faults occur frequently, and production continuity is difficult to guarantee.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Centered on the end-to-end production line, you can divide the architecture into four core domains: the device domain, production line domain, process domain, and IT system domain. This covers all scenarios from underlying industrial devices to upper-layer operational systems.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You can define core entities, such as industrial robots, machine tools, sensors, production lines, and process segments. You can also define the ownership relationships between devices and production lines, the transfer relationships between process segments, and the association relationships between device faults and parameters. This helps build a full-scenario ontology model for the production line.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You can extract the experience of senior maintenance personnel across multiple bases in device fault diagnosis, predictive maintenance, and process optimization. You can break down this experience into a standardized rule system and accumulate it into UModel. This achieves cross-base knowledge reuse.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You can connect production line programmable logic controller (PLC) data, sensor data, MES data, and IT O&amp;amp;M data. You can achieve deep integration of OT and IT data through a unified ontology model.&lt;br&gt;
In addition to the standardized O&amp;amp;M scenarios in the aforementioned core industries, UModel explores and implements various innovative scenarios based on the underlying philosophy of ontology, which prioritizes behavior and centers on relationships. This further expands the implementation borders of ontology in the observability realm. These include conversational O&amp;amp;M with native integration of the LLM. Based on the unified realm ontology model built by UModel, the LLM can accurately understand the professional terms, entity relationships, and business rules in O&amp;amp;M scenarios. This fundamentally avoids the hallucination problem of the LLM in vertical O&amp;amp;M scenarios. Users can complete operations such as core system run status queries, fault root cause localization, and O&amp;amp;M policy configurations through natural language. They do not need to master professional query syntax or technical knowledge. This lowers the technical threshold for O&amp;amp;M operations. In response to the common industry status of enterprise hybrid cloud and multicloud deployments, UModel overcomes the limitation of traditional monitoring tools in cross-environment adaptation capabilities. It achieves unified ontology modeling across cloud vendors, deployment environments, and technology stacks. A single ontology model is compatible with the observable data of public clouds, private clouds, and traditional self-managed data centers. You do not need to build independent monitoring or O&amp;amp;M systems for different environments. This reduces the O&amp;amp;M complexity and Management costs under hybrid cloud architectures.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;V. Conclusion&lt;br&gt;
More than two thousand years ago, Aristotle wrote Metaphysics to find a unified and unambiguous explanation for the chaotic world. Today, we use UModel to build ontology models for IT Systems. We aim to draw a map for the complex digital world that can be understood and utilized. Today, when the LLM is rapidly popularized, we do not lack AI that can generate Content. What we lack is a knowledge frame that can put a "halter" on AI, make AI truly understand the business, and prevent it from talking nonsense. Ontology is exactly the core of this frame. The combination of the LLM and UModel essentially equips AI with a "business brain." This transforms it from being "eloquent" to being "capable of working and working accurately." This is probably the most charming aspect of ontology. From questioning the origin of the world to locating server faults, ontology has spanned more than two thousand years. What has changed is only the object of research. What remains unchanged is humanity's obsession with "explaining cognition clearly and passing it down." To this day, it still provides the most underlying power for our digital age.&lt;/p&gt;

&lt;p&gt;Recommended Reading&lt;br&gt;
🔥 UModel Data Governance: Practice of Building an O&amp;amp;M World Model&lt;/p&gt;

&lt;p&gt;🔥 UModel Explorer: Redefining Observability Data Modeling with a Graphical Approach&lt;/p&gt;

&lt;p&gt;🔥 From Symptoms to Root Causes: How MetricSet Explorer Reinvents the Metric Analysis Experience&lt;/p&gt;

&lt;p&gt;🔥 Building a Unified Entity Search Engine by Using UModel for Observability Scenarios&lt;/p&gt;

</description>
      <category>umodel</category>
      <category>cloudnative</category>
    </item>
    <item>
      <title>One Command Equips Your OpenClaw with an X-ray Machine - Alibaba Cloud Observability Makes Farming Lobsters Cheaper and Safer</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Thu, 23 Apr 2026 02:46:10 +0000</pubDate>
      <link>https://dev.to/observabilityguy/one-command-equips-your-openclaw-with-an-x-ray-machine-alibaba-cloud-observability-makes-farming-424i</link>
      <guid>https://dev.to/observabilityguy/one-command-equips-your-openclaw-with-an-x-ray-machine-alibaba-cloud-observability-makes-farming-424i</guid>
      <description>&lt;p&gt;One-command observability integration makes OpenClaw AI agent operations transparent via Alibaba Cloud monitoring plugins.&lt;br&gt;
❓Have you experienced this?&lt;/p&gt;

&lt;p&gt;OpenClaw🦞(an open-source AI agent framework) is becoming a "digital employee" for more enterprises. It processes emails, writes code, manages files, and executes commands. It does almost anything. Many teams have deployed dozens or hundreds of OpenClaw instances. They formed a sizable "digital lobster farm".&lt;/p&gt;

&lt;p&gt;However, a problem arises.&lt;/p&gt;

&lt;p&gt;Lobster farmers can at least watch their pond. What about your OpenClaw? Do you know how many tokens it consumed today? Do you know which model is silently draining your budget? Do you know if a "lobster" was lured into reading /etc/passwd at 3:00 AM?&lt;/p&gt;

&lt;p&gt;The answer for most is: I don't know. 😶&lt;/p&gt;

&lt;p&gt;You carefully deployed OpenClaw. However, when these issues arise, you find yourself without the right tools to pinpoint the problem.&lt;/p&gt;

&lt;p&gt;This article discusses using one command to equip your OpenClaw with an X-ray machine. This makes every LLM invocation, tool execution, and token consumption visible.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fej3rjwqg84uk0yjczfcb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fej3rjwqg84uk0yjczfcb.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;1.What Is Your Lobster Doing? Three “Blind Spots” Are Affecting Your Confidence&lt;br&gt;
📚 Before we start, let's discuss three "blind spots". If you use OpenClaw, at least one has likely troubled you.&lt;/p&gt;

&lt;p&gt;Blind spot 1: The inference process is a maze and debugging relies on guessing&lt;br&gt;
The complete path OpenClaw takes to process a user message is more complex than you think. A simple question may travel the following journey:&lt;/p&gt;

&lt;p&gt;User input → System prompt assembly → Model inference round 1 → Determine need for tool calling → Tool calling (such as search or code execution) → Return tool result → Model inference round 2 → Call another tool → Generate final response&lt;/p&gt;

&lt;p&gt;If any step fails, the final output may deviate from expectations. Without tracing analysis, you face an "input-output" black box. You can only guess where the problem lies. Is the prompt poor? Is it model hallucination? Did the tool return incorrect data?&lt;/p&gt;

&lt;p&gt;Tuning prompts relies on inspiration. Troubleshooting relies on luck. This is not science. It is mysticism. 🎲&lt;/p&gt;

&lt;p&gt;Blind spot 2: Token bills are like blind boxes and cause pain at month-end&lt;br&gt;
LLMs charge by token. Everyone knows this. However, as an agent, OpenClaw has a token consumption pattern different from directly invoking an API. It has a context snowball effect.&lt;/p&gt;

&lt;p&gt;In every conversation round, the agent stuffs previous conversation history, system prompts, and tool calling results into the context. The first round might use 2000 tokens. By the fifth round, it might expand to 20,000. If a tool returns a large block of HTML or JSON, the situation worsens.&lt;/p&gt;

&lt;p&gt;Worse, you do not know the source of the cost. Is a model too expensive? Is an agent prompt too wordy? Was the context not clipped in time? Without fine-grained consumption data, you cannot perform optimization. 💸&lt;/p&gt;

&lt;p&gt;Blind spot 3: System status is like Schrödinger's cat&lt;br&gt;
OpenClaw involves message queues, webhook processing, and session management during operation. When a user asks why it is not responding, the problem could lie in any layer. Did model inference timeout? Did tool calling stall? Are message queues stacked? Did the gateway fail?&lt;/p&gt;

&lt;p&gt;Without real-time metric monitoring, you only discover issues after user complaints. By then, a group of users may be affected. ⏰&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4l1lbj9oubuggmk75vk8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4l1lbj9oubuggmk75vk8.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2.The Antidote Is Here: openclaw-cms-plugin + diagnostics-otel, Traces and Metrics Working Together&lt;br&gt;
🛠️ To address these three "blind spots", our solution involves two plugins working together. They solve problems at different layers:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F29r6z7zk5t5xsu9emgz6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F29r6z7zk5t5xsu9emgz6.png" alt=" " width="789" height="185"&gt;&lt;/a&gt;&lt;br&gt;
Both rely on the OpenTelemetry standard protocol. Data is uniformly reported to Cloud Monitor 2.0 of Alibaba Cloud. View and analyze data on the same platform.&lt;/p&gt;

&lt;p&gt;The openclaw-cms-plugin is the focus of this topic. It is a trace reporting plugin designed for OpenClaw. It follows OpenTelemetry GenAI semantics and generates structured traces for every OpenClaw run.&lt;/p&gt;

&lt;p&gt;Specifically, it records the following types of spans:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficngd65ankz8rj0hv2q4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficngd65ankz8rj0hv2q4.png" alt=" " width="789" height="307"&gt;&lt;/a&gt;&lt;br&gt;
These spans have a parent-child relationship. Together, they form a complete trace. You can see a trace view similar to this in the Cloud Monitor 2.0 console:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgqme8au0exetixuqeitl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgqme8au0exetixuqeitl.png" alt=" " width="800" height="305"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can see at a glance how many times the LLM was invoked and how many tokens were used. You can also see which tools were invoked, which step took the longest, and if any errors occurred.&lt;/p&gt;

&lt;p&gt;It is that simple to go from "guessing" to "seeing". 👁&lt;/p&gt;

&lt;p&gt;diagnostics-otel is a built-in extension of OpenClaw. It outputs runtime metrics data, including token consumption rate, invocation QPS, response duration distribution, queue depth, and session status. The installation script automatically finds and enables it. You do not need to do anything else.&lt;/p&gt;

&lt;p&gt;Wait, does diagnostics-otel not also report traces? Why is openclaw-cms-plugin needed?&lt;br&gt;
Good question. The diagnostics-otel supports trace reporting. However, if you look closely at the generated trace, you will find a fundamental problem: All spans are independent and have no parent-child relationship.&lt;/p&gt;

&lt;p&gt;The diagnostics-otel uses an event-driven architecture to generate spans. Each event creates a span independently with a different trace ID. It generates the following five types of spans:&lt;/p&gt;

&lt;p&gt;● openclaw.model.usage: model invocation (records token usage)&lt;/p&gt;

&lt;p&gt;● openclaw.webhook.processed/openclaw.webhook.error: webhook processing&lt;/p&gt;

&lt;p&gt;● openclaw.message.processed: message processing (records processing results and duration)&lt;/p&gt;

&lt;p&gt;● openclaw.session.stuck: session stuck alerting&lt;/p&gt;

&lt;p&gt;There is no trace context propagation between these spans. Simply put, they are just independent data points. The only way to associate them is using business fields such as sessionKey.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;Webhook  [openclaw.webhook.processed]  traceId: abc123  
Message  [openclaw.message.processed]  traceId: def456  ❌ Different trace IDs  
Model    [openclaw.model.usage]        traceId: ghi789  ❌ Different trace IDs  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, openclaw-cms-plugin is designed for complete tracing. All spans share the same trace ID. They are linked into a call tree via an explicit parent-child relationship. You can see the full picture of a request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;enter_openclaw_system              traceId: aaa111  
  └── invoke_agent main            traceId: aaa111  ✅ Same trace ID  
        ├── chat qwen3-235b        traceId: aaa111  ✅ Same trace ID  
        ├── execute_tool search    traceId: aaa111  ✅ Same trace ID  
        └── execute_tool exec      traceId: aaa111  ✅ Same trace ID  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In addition to trace integrity, there is a fundamental difference in data richness between the two:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F56khh4l2n7na5pwnyjsd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F56khh4l2n7na5pwnyjsd.png" alt=" " width="789" height="392"&gt;&lt;/a&gt;&lt;br&gt;
Simply put: The trace from diagnostics-otel is a set of independent "record cards", while the trace from openclaw-cms-plugin is a complete "invocation map". The former only tells you "what happened," while the latter tells you "every step." Use them together. One handles system metrics, and the other handles business traces. They complement each other perfectly. 🤝&lt;/p&gt;

&lt;p&gt;3.Setup in One Minute: One-Command Integration Tutorial&lt;br&gt;
🚀 Enough theory. Let's get started. The entire integration process takes less than a minute.&lt;/p&gt;

&lt;p&gt;3.1 Get the install command&lt;br&gt;
Log on to the Cloud Monitor 2.0 console. Go to your application monitoring workspace. Choose Integration Center &amp;gt; AI application observability. Click OpenClaw.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6htz4sjb5ly9autnbh0n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6htz4sjb5ly9autnbh0n.png" alt=" " width="800" height="310"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the sidebar, enter the application name and click Click to obtain to generate the integration command immediately. Click the icon in the upper-right corner to copy it with one click.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc7q8mv7yf7eoxl20jowq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc7q8mv7yf7eoxl20jowq.png" alt=" " width="800" height="541"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.2 Start installation with one command&lt;br&gt;
Open the terminal on the machine where OpenClaw runs. Paste the command you copied and press Enter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;curl -fsSL https://arms-apm-cn-hangzhou-pre.oss-cn-hangzhou.aliyuncs.com/openclaw-cms-plugin/install.sh | bash -s -- \  
  --endpoint "https://Your ARMS-OTLP address" \  
  --x-arms-license-key "Your license key" \  
  --x-arms-project "Your project" \  
  --x-cms-workspace "Your workspace" \  
  --serviceName "Your service name"  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, sit back and watch it run. ☕&lt;/p&gt;

&lt;p&gt;The installation script automatically does the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;[INFO]  Checking prerequisites...  
[OK]    Node.js v24.14.0  
[OK]    npm 11.9.0  
[OK]    OpenClaw CLI found  
[INFO]  Downloading plugin...  
[OK]    Downloaded  
[INFO]  Extracting...  
[OK]    Extracted  
[INFO]  Installing npm dependencies...  
[OK]    Dependencies installed  
[INFO]  Locating diagnostics-otel extension...  
[OK]    Found diagnostics-otel at: /home/.../extensions/diagnostics-otel  
[OK]    diagnostics-otel dependencies already present  
[INFO]  Updating config...  
[OK]    Config updated  
[INFO]  Restarting OpenClaw gateway...  
[OK]    Gateway restarted  

════════════════════════════════════════════════════  
  ✅ openclaw-cms-plugin installed successfully!  
════════════════════════════════════════════════════  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What does it do?&lt;/p&gt;

&lt;p&gt;✅ Checks the environment (Node.js, npm, OpenClaw CLI).&lt;br&gt;
✅ Downloads and decompresses openclaw-cms-plugin to the OpenClaw extension folder.&lt;br&gt;
✅ Installs runtime dependencies for the plugin.&lt;br&gt;
✅ Automatically locates the diagnostics-otel extension. If dependencies are missing, it installs them automatically.&lt;br&gt;
✅ Updates the openclaw.json configuration (configurations for both plugins are written at once).&lt;br&gt;
✅ Restarts the gateway to apply the configuration.&lt;br&gt;
You do not need to manually edit any configuration files. The installation script intelligently handles various edge cases. It merges updates into existing configurations instead of overwriting them. It also searches for multiple possible installation locations for diagnostics-otel based on priority.&lt;/p&gt;

&lt;p&gt;3.3 Verify installation&lt;br&gt;
After installation, chat with your OpenClaw. Wait a minute or two. Open the Cloud Monitor 2.0 console. Go to AI application observability in the sidebar on the right. Your OpenClaw application appears. Congratulations. Your lobster is no longer a black box. 🎉&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnpy3oe278bfnqo00vgqp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnpy3oe278bfnqo00vgqp.png" alt=" " width="800" height="228"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.4 Want to uninstall? It is even simpler&lt;br&gt;
If you want to stop using it (though I doubt it), one command does it:&lt;/p&gt;

&lt;p&gt;curl -fsSL &lt;a href="https://arms-apm-cn-hangzhou-pre.oss-cn-hangzhou.aliyuncs.com/openclaw-cms-plugin/uninstall.sh" rel="noopener noreferrer"&gt;https://arms-apm-cn-hangzhou-pre.oss-cn-hangzhou.aliyuncs.com/openclaw-cms-plugin/uninstall.sh&lt;/a&gt; | bash&lt;br&gt;&lt;br&gt;
The uninstall script automatically cleans up the plugin folder and all related configurations in openclaw.json. It also disables the diagnostics-otel configuration. If you only want to uninstall the trace plugin but keep metrics, add the --keep-metrics parameter.&lt;/p&gt;

&lt;p&gt;Clean and quick. No side effects. 🧹&lt;/p&gt;

&lt;p&gt;4.The Highlight: What Can You See After Installation?&lt;br&gt;
📈 Integration is just the beginning. The truly exciting part is what you see and solve after integration.&lt;br&gt;
4.1 Complete trace: Finally understand its "thought process"&lt;br&gt;
This is the core value of openclaw-cms-plugin. Cloud Monitor 2.0 displays a structured trace for every user request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;enter_openclaw_system (Request entry: sender and source)
　└── invoke_agent main (Agent execution procedure)
　　　├── chat qwen3-235b  (LLM invoke: model inference + token usage details) 
　　　├── execute_tool search (Tool calling: search)
　　　└── execute_tool exec (Tool calling: code execution)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a conversation round, the plugin records agent-level LLM invokes and each independent tool calling. If the agent runs a tool loop internally (such as "invoke tool → get result → invoke next tool"), each tool calling is recorded independently as a tool span. This includes input parameters, return values, and execution status. You can clearly see the complete toolchain execution procedure.&lt;/p&gt;

&lt;p&gt;💡 In the current version, LLM invokes in a conversation round aggregate into one LLM span. It records the final total token usage and input/output content for that round. Future versions will refine this. They will support generating a separate span for each independent LLM inference. Then, even intermediate inference steps in multi-round tool loops will be fully visible.&lt;/p&gt;

&lt;p&gt;Each span is annotated with rich properties:&lt;/p&gt;

&lt;p&gt;● Duration—see which step is slowest at a glance&lt;/p&gt;

&lt;p&gt;● Model information—which model and provider were used&lt;/p&gt;

&lt;p&gt;● Token usage—input_tokens, output_tokens, cache_read_tokens, and total_tokens, broken down item by item&lt;/p&gt;

&lt;p&gt;● Tool parameters and return values—what tool was invoked, what parameters were passed, and what results were returned&lt;/p&gt;

&lt;p&gt;● Error message—displayed in red if an error occurs&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F09ug3ntwhqp9gpew7mng.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F09ug3ntwhqp9gpew7mng.png" alt=" " width="800" height="674"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ferp3fiz02iicnvog2fgx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ferp3fiz02iicnvog2fgx.png" alt=" " width="800" height="740"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What does this mean?&lt;/p&gt;

&lt;p&gt;Previously, if a user said the "answer is wrong," you had to guess by checking chat records. Now, check the traces. You see the search tool returned an empty result. The model "creatively" made up a paragraph based on that empty result. Problem localization drops from "two hours" to "two minutes". ⚡&lt;/p&gt;

&lt;p&gt;4.2 Token usage breakdown—know exactly where every penny goes&lt;br&gt;
Each LLM span in trace carries complete token usage properties:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffzhxp9bgerri96gqpya8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffzhxp9bgerri96gqpya8.png" alt=" " width="789" height="224"&gt;&lt;/a&gt;&lt;br&gt;
Use gen_ai.request.model and gen_ai.provider.name. You can know exactly: which model consumed how many tokens at which step.&lt;/p&gt;

&lt;p&gt;Consider a real scenario. You find five LLM invocations in a conversation trace. The input_tokens for the third invocation reach 12,000. Click it. You see the tool returned a full page of HTML, all stuffed into the context. You found the "token-swallowing blackhole." Optimization now has a direction.&lt;/p&gt;

&lt;p&gt;Token usage transforms from a "messy account" to a "detailed ledger". 💰&lt;/p&gt;

&lt;p&gt;4.3 System running metrics—pulse visible in real-time&lt;br&gt;
Metrics data exported by the diagnostics-otel plugin can build running metric gauges on Cloud Monitor 2.0. This allows real-time monitoring:&lt;/p&gt;

&lt;p&gt;● Token usage rate and fee trends — broken down by model and time dimension&lt;/p&gt;

&lt;p&gt;● Invoke QPS and response duration — is system throughput normal&lt;/p&gt;

&lt;p&gt;● MSMQ depth and wait time — is there a backlog&lt;/p&gt;

&lt;p&gt;● Session stall count — Are any lobsters "playing dead"?&lt;/p&gt;

&lt;p&gt;● Context size trend — Is the context expanding uncontrollably?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Femloqdwchducadyfu1dk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Femloqdwchducadyfu1dk.png" alt=" " width="800" height="601"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi5knnciu3uyarnekidys.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi5knnciu3uyarnekidys.png" alt=" " width="800" height="579"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Paired with the alerting feature of Ccloud Monitor 2.0, these metrics enable automatic alerts for a 50% day-over-day surge in daily token consumption, automatic alerts when queue depth exceeds a threshold, and automatic alerts for session stalls. You know immediately when a problem occurs, rather than waiting for user complaints. 🔔&lt;/p&gt;

&lt;p&gt;4.4 GenAI semantic conventions — Professional standards, not ad hoc solutions&lt;br&gt;
Note that the trace data reported by openclaw-cms-plugin strictly follows the OpenTelemetry GenAI semantic conventions. These are not field names we defined arbitrarily, but international standards.&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;p&gt;Standardized data structures — Property names such as gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.tool.name match industry standards. This simplifies integration with other tools.&lt;br&gt;
Normalized message formats — gen_ai.input.messages, gen_ai.output.messages, and gen_ai.system_instructions are formatted according to standard JSON schema. This supports multiple message types, such as TextPart, ReasoningPart, ToolCallRequestPart, and ToolCallResponsePart.&lt;br&gt;
Future extensibility — As GenAI semantic conventions evolve, the plugin allows smooth upgrades.&lt;br&gt;
4.5 Beyond standards — The "extra helpings" of Alibaba Cloud GenAI conventions&lt;br&gt;
While compatible with OTel open-source standards, openclaw-cms-plugin also implements extension capabilities from the Alibaba Cloud GenAI semantic conventions. Compared to the community Standard Edition, you receive some "extra helpings":&lt;/p&gt;

&lt;p&gt;ENTRY span — A clear "entry point" for the trace&lt;/p&gt;

&lt;p&gt;The OTel community specification defines only span types such as LLM (inference), tool (tool calling), and agent. It lacks an "entry point" concept. The Alibaba Cloud specification extends the ENTRY span type to specifically identify the call entry point of an AI application. In openclaw-cms-plugin, this is the enter_openclaw_system span. It records "who initiated the request" (gen_ai.user.id) and the "current session ID" (gen_ai.session.id). This lets you view the trace and perform analysis and tracking by user and session dimensions.&lt;/p&gt;

&lt;p&gt;🔗 Session-level association —gen_ai.session.id&lt;/p&gt;

&lt;p&gt;The OTel standard provides gen_ai.conversation.id. However, for agent applications, "session" is more appropriate than "conversation". The Alibaba Cloud specification introduces gen_ai.session.id, which spans ENTRY, AGENT, and LLM spans. This lets you search directly by session ID in Cloud Monitor 2.0, retrieve all traces under that session at once, and quickly restore the full session content.&lt;/p&gt;

&lt;p&gt;📊 gen_ai.span.kind — An AI-specific span categorization system&lt;/p&gt;

&lt;p&gt;The SpanKind in the OpenTelemetry standard includes only generic types such as CLIENT, INTERNAL, and SERVER. For an AI application trace, SpanKind alone cannot distinguish between an LLM inference and a tool calling. Alibaba Cloud introduces the gen_ai.span.kind property to define a GenAI-specific classification system: LLM, TOOL, AGENT, ENTRY, TASK, STEP (ReAct round), CHAIN, RETRIEVER, and RERANKER. Cloud Monitor 2.0 uses this categorization to automatically detect the AI application structure and render a dedicated AI trace view. LLM calls appear in orange, tool calling in pink, and agents in green. This lets you see the "role distribution" of the entire trace at a glance.&lt;/p&gt;

&lt;p&gt;💡 These extensions do not disrupt standard compatibility. The data reported by openclaw-cms-plugin displays basic information normally on any backend that supports OpenTelemetry. However, Cloud Monitor 2.0 unlocks the complete AI application observability experience.&lt;/p&gt;

&lt;p&gt;This standardized approach benefits future data analytics and platform evolution.&lt;/p&gt;

&lt;p&gt;5.From Black Box to Transparent: How Observability Changes Your Lobster Farming&lt;br&gt;
📈 Installing an X-ray machine fundamentally changes your "lobster farming" method:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo8lxyj1nqh6so6634fv5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo8lxyj1nqh6so6634fv5.png" alt=" " width="790" height="319"&gt;&lt;/a&gt;&lt;br&gt;
This is not merely an improvement. It is a leap from "blind farming" to "precision farming."&lt;/p&gt;

&lt;p&gt;A farmer upgrades from "checking water color visually" to using "water quality sensors, cameras, and automatic feeding systems." You manage the same lobsters, but your control level changes completely. 🦞📊&lt;/p&gt;

&lt;p&gt;One more thing: Security audit&lt;br&gt;
Beyond performance tuning and cost control, enterprise AI agent deployment involves an unavoidable topic: security compliance and behavior audit. Agents can execute commands, read and write files, and initiate network requests. Without behavior audit capabilities, you cannot know if an agent secretly read an SSH key at 3:00 a.m.&lt;/p&gt;

&lt;p&gt;Our observability team covers this capability with another solution: the Alibaba Cloud Simple Log Service (SLS) OpenClaw one-click solution. It collects OpenClaw session audit logs and application operational logs. It provides out-of-the-box security audit dashboards, including high-risk command detection, prompt injection detection, and sensitive data leakage analysis. This makes every agent operation traceable.&lt;/p&gt;

&lt;p&gt;If you are interested in security audits, read this article: &lt;a href="https://www.alibabacloud.com/help/sls/enable-managed-openclaw-with-sls" rel="noopener noreferrer"&gt;https://www.alibabacloud.com/help/sls/enable-managed-openclaw-with-sls&lt;/a&gt; (SLS one-click integration and audit solution makes OpenClaw controlled operation possible).&lt;/p&gt;

&lt;p&gt;Cloud Monitor 2.0 manages performance and cost, and SLS manages security and compliance. Together, they form a complete control system for the "lobster farm." 🔐&lt;/p&gt;

&lt;p&gt;6.FAQs&lt;br&gt;
💡 Here are answers to common questions about the process:&lt;/p&gt;

&lt;p&gt;Q: Does the integration impact OpenClaw performance?&lt;/p&gt;

&lt;p&gt;A: The impact is minimal. The openclaw-cms-plugin uses the OpenTelemetry batch export mechanism. Span data is buffered in memory and reported in batches periodically. This does not block the normal processing flow of the agent.&lt;/p&gt;

&lt;p&gt;Q: Can I install only traces without metrics?&lt;/p&gt;

&lt;p&gt;A: Yes. Add the --disable-metrics parameter during installation to skip the diagnostics-otel configuration.&lt;/p&gt;

&lt;p&gt;Q: Do traces from diagnostics-otel conflict with traces from openclaw-cms-plugin?&lt;/p&gt;

&lt;p&gt;A: The installation script sets diagnostics.otel.traces to false by default. The openclaw-cms-plugin handles trace reporting. They work independently without duplication.&lt;/p&gt;

&lt;p&gt;Q: I have configured diagnostics-otel. Will the installation overwrite my configuration?&lt;/p&gt;

&lt;p&gt;A: No. The traces, logs, sample rate, and other configurations remain unchanged. It adds necessary fields such as endpoints and headers.&lt;/p&gt;

&lt;p&gt;Q: Which OpenClaw versions are supported?&lt;/p&gt;

&lt;p&gt;A: The version must be 26.2.19 or later (earlier versions exclude the diagnostics-otel plugin). The openclaw-cms-plugin works using the standard OpenClaw Hook mechanism. It does not depend on internal APIs of specific versions.&lt;/p&gt;

&lt;p&gt;Q: Why is the token consumption always 0?&lt;/p&gt;

&lt;p&gt;A: OpenClaw introduced a bug in V2026.3.8. This causes incorrect token consumption collection. We are urging the community to expedite the fix. Relevant issue link: &lt;a href="https://github.com/openclaw/openclaw/issues/46616" rel="noopener noreferrer"&gt;https://github.com/openclaw/openclaw/issues/46616&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;7.Summary&lt;br&gt;
📋 Back to the first question: Do you know what your lobster is doing underwater?&lt;/p&gt;

&lt;p&gt;If the answer is "I don't know", it is time to install an X-ray machine.&lt;/p&gt;

&lt;p&gt;The openclaw-cms-plugin + diagnostics-otel, and one command: ten minutes to integrate, bringing three core capabilities to your OpenClaw:&lt;/p&gt;

&lt;p&gt;✅Tracing analysis— End-to-end visualization of every LLM invocation, tool execution, and token flow.&lt;/p&gt;

&lt;p&gt;✅Real-time metrics— Monitor system pulse in real time, including token consumption rate, invocation QPS, queue depth, and session status.&lt;/p&gt;

&lt;p&gt;✅GenAI semantic standards— Standardized data structures. They lay the foundation for cost analysis, performance optimization, and exception detection.&lt;/p&gt;

&lt;p&gt;Stop letting your lobster "freestyle" in a black box. Install an X-ray machine. Make every step visible, traceable, and optimizable.&lt;/p&gt;

&lt;p&gt;After all, a visible lobster is a good lobster. 🦞✨&lt;/p&gt;

&lt;p&gt;❓Interaction time!&lt;/p&gt;

&lt;p&gt;What is the most troublesome "black box problem" you encountered while using OpenClaw?&lt;br&gt;
How do you troubleshoot OpenClaw issues now? Do you have any hacks to share?&lt;br&gt;
What data do you want to see most after enabling observability?&lt;br&gt;
Share your "lobster farming" insights in the comments. Bring your questions. We are here! 🦞🎉&lt;/p&gt;

</description>
      <category>observability</category>
      <category>ai</category>
    </item>
    <item>
      <title>Zero-Code Modification in 5 minutes Enables Go Applications to Automatically Obtain End-to-End Observability</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Wed, 22 Apr 2026 02:41:35 +0000</pubDate>
      <link>https://dev.to/observabilityguy/zero-code-modification-in-5-minutes-enables-go-applications-to-automatically-obtain-end-to-end-4jgm</link>
      <guid>https://dev.to/observabilityguy/zero-code-modification-in-5-minutes-enables-go-applications-to-automatically-obtain-end-to-end-4jgm</guid>
      <description>&lt;p&gt;This article introduces the Loongsuite Go agent, a compile-time instrumentation tool that enables zero-code modification for end-to-end observability in Go applications.&lt;/p&gt;

&lt;p&gt;💡Are you still worried about the observability transformation of Go applications?&lt;br&gt;
💡Are you still performing manual tracking, modifying code, or importing SDKs?&lt;br&gt;
💡Are you still worried about tracking points affecting performance? Today, we are bringing a solution with zero-code modification-Loongsuite Go agent, allowing your Go application to automatically obtain end-to-end observability capabilities at compile-time!🚀&lt;/p&gt;

&lt;p&gt;😫Three Pain Points of Traditional Observability Solutions&lt;br&gt;
In the microservices model, observability has become an essential capability for application O&amp;amp;M. However, traditional observability solutions often face three major pain points:&lt;/p&gt;

&lt;p&gt;According to statistics, traditional tracking plans require developers to spend 20-30% of their time on monitoring code, and it is very error-prone.&lt;/p&gt;

&lt;p&gt;1.High code intrusiveness&lt;br&gt;
Traditional tracking solutions require developers to manually insert monitoring code into the business code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;// Traditional method: Manual tracking is required.  
func handleRequest(w http.ResponseWriter, r *http. Request) {  
// Manually create a span.  
ctx, span := tracer.Start(r.Context(), "handleRequest")  
defer span.End()  

// Business logic  
result := doSomething()  

// Manually record attributes  
span.SetAttributes(attribute.String("result", result))  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This method raises the following issues:&lt;/p&gt;

&lt;p&gt;● Code pollution: Business code and monitoring code are mixed together.&lt;/p&gt;

&lt;p&gt;● High maintenance costs: The monitoring code must be updated each time the business logic is modified.&lt;/p&gt;

&lt;p&gt;● Easy to omit: Developers may forget to add tracking points to some critical paths.&lt;/p&gt;

&lt;p&gt;2.Heavy modification workload&lt;br&gt;
For an existing Go application, if you want to integrate observability, you usually need to:&lt;/p&gt;

&lt;p&gt;● Import the OpenTelemetry SDK&lt;/p&gt;

&lt;p&gt;● Modify each key function and add tracking code&lt;/p&gt;

&lt;p&gt;● Configure the exporter and sampling policy.&lt;/p&gt;

&lt;p&gt;● Test to verify that the tracking point is correct.&lt;/p&gt;

&lt;p&gt;This process can take days or even weeks of workload.&lt;/p&gt;

&lt;p&gt;3.Performance overhead concerns&lt;br&gt;
Although runtime tracking is flexible, it incurs certain performance overhead:&lt;/p&gt;

&lt;p&gt;● The tracking logic must be executed for each call.&lt;/p&gt;

&lt;p&gt;● Serialization, network transmission, and other operations&lt;/p&gt;

&lt;p&gt;● may affect application performance.&lt;/p&gt;

&lt;p&gt;✨Solution: Automatic Compile-time Instrumentation&lt;br&gt;
The Loongsuite Go agentuses compile-time instrumentation technology to automatically inject monitoring code during the compilation phase, achieving true zero-code modification..&lt;/p&gt;

&lt;p&gt;This is an enterprise-level Go application observability solution open-sourced by Alibaba, which has been used on a large scale in the production environment.&lt;/p&gt;

&lt;p&gt;Core Strengths&lt;br&gt;
Zero-Code Modification&lt;br&gt;
You only need to add the otel prefix before go build without modifying any business code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;# Traditional method  
go build -o app cmd/app  

# Use the Loongsuite Go agent  
otel go build -o app cmd/app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is that simple! Your application automatically obtains end-to-end observability capabilities.&lt;/p&gt;

&lt;p&gt;🚀Automatic Instrumentation&lt;br&gt;
The tool automatically detects the frameworks and libraries you use and injects the corresponding monitoring code:&lt;/p&gt;

&lt;p&gt;● HTTP frameworks: Gin, Echo, Fiber, FastHTTP, and Hertz&lt;/p&gt;

&lt;p&gt;● RPC frameworks: gRPC, Dubbo-go, Kitex, and Kratos&lt;/p&gt;

&lt;p&gt;● Databases: Database/SQL, GORM, MongoDB, and Elasticsearch&lt;/p&gt;

&lt;p&gt;● Caches: go-redis and redigo&lt;/p&gt;

&lt;p&gt;● Logstores: Zap, Logrus, Slog, and Zerolog&lt;/p&gt;

&lt;p&gt;● AI frameworks: LangChain and Ollama&lt;/p&gt;

&lt;p&gt;● More: Supports more than 50 mainstream Go frameworks and libraries.&lt;/p&gt;

&lt;p&gt;⚡Performance-friendly&lt;br&gt;
Compile-time instrumentation means:&lt;/p&gt;

&lt;p&gt;● Low runtime overhead: Monitoring code is already optimized at compile time.&lt;/p&gt;

&lt;p&gt;● No reflection overhead: Does not rely on runtime reflection mechanisms.&lt;/p&gt;

&lt;p&gt;● Production-ready: Validated in large-scale production environments.&lt;/p&gt;

&lt;p&gt;🎯Case Study: Automatic Instrumentation for the Official MCP SDK&lt;br&gt;
Recently, we implemented automatic instrumentation support for the official Model Context Protocol (MCP) Go SDK. MCP is a protocol introduced by companies such as Google and Anthropic. It is used to integrate LLM applications with external data sources and tools, becoming increasingly important in AI application development.&lt;/p&gt;

&lt;p&gt;Why choose MCP?&lt;br&gt;
With the rapid development of AI applications, more and more developers are using the MCP protocol to build LLM applications. However, the observability of MCP applications has always been a challenge:&lt;/p&gt;

&lt;p&gt;● Complex protocol: MCP supports multiple operations (such as tools/call, resources/read, and prompts/get).&lt;/p&gt;

&lt;p&gt;● Middleware mechanism: The official SDK provides middleware, but users may not actively use it.&lt;/p&gt;

&lt;p&gt;● Time measurement: It is necessary to accurately measure the complete time of requests and responses.&lt;/p&gt;

&lt;p&gt;Our Solution&lt;br&gt;
We adopted the strategy of automatic injection during initialization. Monitoring middleware is automatically injected when NewServer and NewClient are created, ensuring 100% coverage.&lt;/p&gt;

&lt;p&gt;Technical Challenges&lt;br&gt;
The official MCP SDK provides a comprehensive middleware mechanism, but how to automatically inject monitoring middleware without modifying user code is a technical challenge.&lt;/p&gt;

&lt;p&gt;Solution&lt;br&gt;
We adopted the strategy of automatic injection during initialization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;// Automatically inject monitoring middleware when NewServer is created.
func afterNewServer(call api.CallContext, s *mcp.Server) {
    if s == nil {
        return
    }
    // Automatically inject monitoring middleware.
    monitoringMiddleware := createServerMonitoringMiddleware()
    s.AddReceivingMiddleware(monitoringMiddleware)
}

// Automatically inject monitoring middleware when NewClient is created.
func afterNewClient(call api.CallContext, c *mcp.Client) {
    if c == nil {
        return
    }
    // Automatically inject monitoring middleware.
    monitoringMiddleware := createClientMonitoringMiddleware()
    c.AddReceivingMiddleware(monitoringMiddleware)
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Implementation Effect&lt;br&gt;
In this way, we achieved:&lt;/p&gt;

&lt;p&gt;100% coverage: The monitoring middleware is automatically injected regardless of whether the user manually invokes AddReceivingMiddleware.&lt;br&gt;
Accurate time measurement: The middleware can be executed before and after request processing, allowing accurate measurement of the complete request-response time.&lt;br&gt;
Automatically record key information:&lt;br&gt;
MCP method names (initialize, tools/call, and resources/read)&lt;br&gt;
Tool name, resource URI, and prompt name&lt;br&gt;
Request parameters and response results&lt;br&gt;
Error messages and duration statistics&lt;br&gt;
Examples&lt;br&gt;
User code does not need to be modified at all:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;// User code: Create an MCP server.
server := mcp.NewServer(&lt;span class="err"&gt;&amp;amp;&lt;/span&gt;mcp.Implementation{
    Name: "my-server",
    Version: "1.0.0",
}, nil)

// Add a tool for normal use.
mcp.AddTool(server, &lt;span class="err"&gt;&amp;amp;&lt;/span&gt;mcp.Tool{
    Name: "greet",
    Description: "Say hi",
}, handler)

// Run the server.
server.Run(ctx, transport)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After compilation using otel go build, all MCP requests are automatically monitored, including:&lt;/p&gt;

&lt;p&gt;● Client invoking tools (tools/call)&lt;/p&gt;

&lt;p&gt;● Read resources (resources/read)&lt;/p&gt;

&lt;p&gt;● Retrieving prompts (prompts/get)&lt;/p&gt;

&lt;p&gt;● Initializing connections (initialize)&lt;/p&gt;

&lt;p&gt;Technical Principle: Compile-time Instrumentation&lt;br&gt;
Workflow&lt;br&gt;
The Loongsuite Go agent adds two key phases during compile-time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;Traditional Go compilation flow:
Source code parsing → Type checking → Semantic analysis → Code optimization → Code generation → Linking

Use the Loongsuite Go agent:
Preprocessing → Instrumentation → Source code parsing → Type checking → Semantic analysis → Code optimization → Code generation → Linking
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Preprocessing: Analyze dependencies and select applicable instrumentation rules.&lt;/li&gt;
&lt;li&gt;Instrumentation: Generate code based on rules and inject the code into the source code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Core Technologies&lt;br&gt;
● go:linkname: Linking the instrumentation function to the namespace of the target package.&lt;/p&gt;

&lt;p&gt;● AST operation: Modify the abstract syntax tree to inject monitoring code.&lt;/p&gt;

&lt;p&gt;● Rule-driven: Define instrumentation behavior via JSON rule files.&lt;/p&gt;

&lt;p&gt;Instrumentation Methods&lt;br&gt;
Based on framework attributes, we support multiple instrumentation methods:&lt;/p&gt;

&lt;p&gt;Intermediary injection (such as MCP and gRPC): Inject the middleware during initialization.&lt;br&gt;
Hook mechanism (such as Redis and Kafka): Utilize the Hook API of the framework.&lt;br&gt;
Direct function peg (such as OpenAI SDK): Instrument directly on key functions.&lt;br&gt;
Struct field injection (such as database and SQL): Inject fields to store metadata.&lt;br&gt;
🚀Get Started in 5 Minutes&lt;br&gt;
Step 1: Install the tool (1 minute)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;# Linux/MacOS (Recommended)
sudo curl -fsSL https://cdn.jsdelivr.net/gh/alibaba/loongsuite-go-agent@main/install.sh | sudo bash

# Or download manually
wget https://github.com/alibaba/loongsuite-go-agent/releases/latest/download/otel-linux-amd64
chmod +x otel-linux-amd64
sudo mv otel-linux-amd64 /usr/local/bin/otel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 2: Compile the application (1 minute)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;# Just prefix the go build with otel.
otel go build -o app cmd/app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 3: Configure the export (1 minute)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;# Export to Jaeger (development environment)
export OTEL_EXPORTER_JAEGER_ENDPOINT=http://localhost:14268/api/traces

# Or export to OTLP (production environment)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 4: Run the application (1 minute)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;./app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's that simple! Your application is now equipped with end-to-end observability capabilities.🎉&lt;/p&gt;

&lt;p&gt;Demonstration&lt;br&gt;
After use, you can see the following on Jaeger, Zipkin, or other observability platforms that support OpenTelemetry:&lt;/p&gt;

&lt;p&gt;● ✅Complete invocation chain: From HTTP requests to database queries, everything is clear at a glance.&lt;/p&gt;

&lt;p&gt;● ✅Detailed performance metrics: duration and error rate of each operation&lt;/p&gt;

&lt;p&gt;● ✅Rich contextual information: request parameters, response results, and error messages&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8globzd7wxb8v1r7vsk2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8globzd7wxb8v1r7vsk2.png" alt=" " width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Supported export methods&lt;br&gt;
The tool supports multiple export methods. You only need to configure environment variables:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpuaxw5ktpgmsl0t1pphb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpuaxw5ktpgmsl0t1pphb.png" alt=" " width="789" height="187"&gt;&lt;/a&gt;&lt;br&gt;
For more information about the configuration options, see Official documentation.&lt;/p&gt;

&lt;p&gt;Supported frameworks&lt;br&gt;
The Loongsuite Go agent supports 50+ mainstream Go frameworks and libraries, including:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0jqa3gz0zdl57zb85n37.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0jqa3gz0zdl57zb85n37.png" alt=" " width="789" height="335"&gt;&lt;/a&gt;&lt;br&gt;
For more information, see GitHub repository.&lt;/p&gt;

&lt;p&gt;⚡Production-grade Performance&lt;br&gt;
Performance advantages brought by compile-time instrumentation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyl2w5dy77bschpo9vurc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyl2w5dy77bschpo9vurc.png" alt=" " width="789" height="187"&gt;&lt;/a&gt;&lt;br&gt;
Benefits:&lt;/p&gt;

&lt;p&gt;● ✅Low runtime overhead: Monitoring code is optimized at compile-time, and no runtime reflection is required.&lt;/p&gt;

&lt;p&gt;● ✅Production verification: It has been verified in large-scale production environments of companies such as Alibaba.&lt;/p&gt;

&lt;p&gt;● ✅Performance-friendly: According to benchmarks, the application performance overhead after instrumentation is usually less than 3%.&lt;/p&gt;

&lt;p&gt;💡Note: Although the compile time increases, this only occurs during the developer/build phase and does not affect runtime performance.&lt;/p&gt;

&lt;p&gt;Community and Support&lt;br&gt;
Open-source address&lt;br&gt;
● GitHub: &lt;a href="https://github.com/alibaba/loongsuite-go-agent" rel="noopener noreferrer"&gt;https://github.com/alibaba/loongsuite-go-agent&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;● Document: &lt;a href="https://alibaba.github.io/loongsuite-go-agent/" rel="noopener noreferrer"&gt;https://alibaba.github.io/loongsuite-go-agent/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Join the community&lt;br&gt;
● DingTalk group: 102565007776&lt;/p&gt;

&lt;p&gt;● GitHub issues: Feedback on questions and suggestions&lt;/p&gt;

&lt;p&gt;● Contribution code: Welcome to submit pull requests.&lt;/p&gt;

&lt;p&gt;Comparison summary&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnr55gbqabtafpnsqp96c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnr55gbqabtafpnsqp96c.png" alt=" " width="791" height="333"&gt;&lt;/a&gt;&lt;br&gt;
📚 Related Resources&lt;br&gt;
● 🌟GitHub: &lt;a href="https://github.com/alibaba/loongsuite-go-agent" rel="noopener noreferrer"&gt;https://github.com/alibaba/loongsuite-go-agent&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;● 📖Document: &lt;a href="https://alibaba.github.io/loongsuite-go-agent/" rel="noopener noreferrer"&gt;https://alibaba.github.io/loongsuite-go-agent/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;● 💼Commercial edition: &lt;a href="https://www.alibabacloud.com/help/arms/application-monitoring/user-guide/monitoring-the-golang-applications/" rel="noopener noreferrer"&gt;https://www.alibabacloud.com/help/arms/application-monitoring/user-guide/monitoring-the-golang-applications/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;● 💬DingTalk group: 102565007776&lt;/p&gt;

&lt;p&gt;If you find it useful, welcome to star⭐ and share!&lt;/p&gt;

&lt;p&gt;References:&lt;/p&gt;

&lt;p&gt;● GitHub: &lt;a href="https://github.com/alibaba/loongsuite-go-agent" rel="noopener noreferrer"&gt;https://github.com/alibaba/loongsuite-go-agent&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;● Document: &lt;a href="https://alibaba.github.io/loongsuite-go-agent/" rel="noopener noreferrer"&gt;https://alibaba.github.io/loongsuite-go-agent/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;● Commercial edition: &lt;a href="https://www.alibabacloud.com/help/arms/application-monitoring/user-guide/monitoring-the-golang-applications/" rel="noopener noreferrer"&gt;https://www.alibabacloud.com/help/arms/application-monitoring/user-guide/monitoring-the-golang-applications/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>cloudnative</category>
      <category>observability</category>
    </item>
  </channel>
</rss>
