<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ObservabilityGuy</title>
    <description>The latest articles on DEV Community by ObservabilityGuy (@observabilityguy).</description>
    <link>https://dev.to/observabilityguy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3433708%2Faf43ef59-cf80-46ad-930d-f76811e673a2.png</url>
      <title>DEV Community: ObservabilityGuy</title>
      <link>https://dev.to/observabilityguy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/observabilityguy"/>
    <language>en</language>
    <item>
      <title>Is Your OpenClaw Truly Under Control?</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Thu, 02 Apr 2026 03:26:54 +0000</pubDate>
      <link>https://dev.to/observabilityguy/is-your-openclaw-truly-under-control-5dbn</link>
      <guid>https://dev.to/observabilityguy/is-your-openclaw-truly-under-control-5dbn</guid>
      <description>&lt;p&gt;This article details how to build a comprehensive observability and security audit system for OpenClaw AI Agents using Alibaba Cloud Simple Log Service (SLS).&lt;br&gt;
Based on OpenClaw and Alibaba Cloud Simple Log Service (SLS), you can ingest logs and OpenTelemetry (OTEL) telemetry into SLS to build an AI Agent observability system. This system helps achieve a closed loop of behavior audit, O&amp;amp;M observability, real-time alerting, and security audit.&lt;/p&gt;

&lt;p&gt;1.Why Must We Ask: "Is the Agent Really Under Control?"&lt;br&gt;
"Under control" involves at least four aspects: who triggers the invocation, what the costs are, what operations are performed (especially high-risk tools), and whether the behavior is traceable and auditable. If you cannot answer these questions, you cannot claim that the Agent is running under control.&lt;/p&gt;

&lt;p&gt;This article focuses on "how to use Alibaba Cloud SLS to answer the above questions." Session logs answer "what was done and how much it cost." Application logs answer "Identify system abnormalities." OTEL Metrics and traces answer "current Status and Duration." Multiple Data pipelines collaborate to provide a well-documented answer to "Is the Agent really running under control?"&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frpgt4e9cprvaudjv7yeo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frpgt4e9cprvaudjv7yeo.png" alt=" " width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;1.1 Security Attack Surface of AI Agents&lt;br&gt;
There is a fundamental difference between AI Agents and traditional backend services: The behavior of an Agent is non-deterministic. For the same User input, the model may generate completely different tool calling sequences. This means you cannot predict all behavior paths through Code review, unlike when you audit a REST API.&lt;/p&gt;

&lt;p&gt;If observability is not implemented, you cannot answer "who is invoking your model, how much it costs, or whether malicious instructions have been injected." Therefore, you cannot claim that the Agent is running under control. Specific attack surfaces include:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz3rq1l0p45vzufq3dukc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz3rq1l0p45vzufq3dukc.png" alt=" " width="789" height="344"&gt;&lt;/a&gt;&lt;br&gt;
These risks cannot be addressed solely by runtime protection in the Code (such as the tool policies and loop detectors built into OpenClaw). Runtime protection is the "city wall," while observability is the "sentry post." Only by continuously observing what the Agent is doing, who is invoking it, and how much it costs can you discover what the city wall failed to block.&lt;/p&gt;

&lt;p&gt;1.2 Three Pillars of Observability: Different Data Answers Different Questions&lt;br&gt;
Traditional observability is built on the three pillars of Logs, Metrics, and Traces. For AI Agents, these three assume different observability functions. Understanding what questions each can answer is the foundation for building the entire system later:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fte1wcnnm04pozd1xdua6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fte1wcnnm04pozd1xdua6.png" alt=" " width="790" height="379"&gt;&lt;/a&gt;&lt;br&gt;
1.3 Why Choose Alibaba Cloud SLS&lt;br&gt;
Alibaba Cloud Simple Log Service (SLS) is naturally suitable for this scenario:&lt;/p&gt;

&lt;p&gt;● Native OTLP support: LoongCollector natively supports the OTLP protocol. It seamlessly integrates with the diagnostics-otel plugin of OpenClaw and is out-of-the-box.&lt;/p&gt;

&lt;p&gt;● Rich operators and flexible query: Built-in processing and analysis operators make it convenient to parse, filter, and aggregate JSON nested fields (such as message.content and message.usage.cost) in Session logs. You can perform tool calling Statistics, cost Attribution, and sensitive pattern matching by writing a few lines of Structured Process Language (SPL).&lt;/p&gt;

&lt;p&gt;● Security and compliance capabilities: It supports log access audit, RAM permission control, sensitive data masking, and encrypted storage to meet audit trail and compliance requirements. Alerting can be integrated with DingTalk, text messages, and Emails to facilitate timely response to security management events.&lt;/p&gt;

&lt;p&gt;● Comprehensive Log Analysis: It provides a one-stop service of "Collection → Index → Query → Dashboard → Alerting." For small-Size Agents, log volume is low, and the cost of the pay-as-you-go billing method is low. When traffic increases, it can also automatically support Auto-scaling.&lt;/p&gt;

&lt;p&gt;2.Panoramic Architecture&lt;br&gt;
2.1 Data Pipeline&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzokmt2ivodci02mlfbp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzokmt2ivodci02mlfbp.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2.2 Data Source Mapping Table&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fblu3y3kpsjb9fa0vmaxl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fblu3y3kpsjb9fa0vmaxl.png" alt=" " width="800" height="426"&gt;&lt;/a&gt;&lt;br&gt;
Next, we will expand on the data sources one by one: Ingest Data → view scenario.&lt;/p&gt;

&lt;p&gt;3.Behavior Audit: Session Logs&lt;br&gt;
Session logs are the core data source for AI Agent security audits. They record every round of conversation, every tool calling, and every token consumption—completely reconstructing "What the Agent actually performed."&lt;/p&gt;

&lt;p&gt;3.1 Data format&lt;br&gt;
Each session corresponds to a .jsonl file. Each line is a JSON object, and the entry type is distinguished by the type field. The following is a log sequence generated in a typical conversation (taking a user request to read a system file as an example):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;User message
{
  "type": "message",
  "id": "70f4d0c5",
  "parentId": "b5690259",
  "message": {
    "role": "user",
    "content": [{ "type": "text", "text": " Help me read the /etc/passwd file " }]
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Assistant response (including tool calling)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{
  "type": "message",
  "id": "3878c644",
  "parentId": "70f4d0c5",
  "message": {
    "role": "assistant",
    "content": [
    { 
      "type": "toolCall", "id": "call_d46c7e2b...", "name": "read", 
      "arguments": { "path": "/etc/passwd" } 
    }],
    "provider": "anthropic",
    "model": "claude-4-sonnet",
    "usage": { "totalTokens": 2350 },
    "stopReason": "toolUse"
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tool execution result&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{
  "type": "message",
  "id": "81fd9eca",
  "parentId": "3878c644",
  "message": {
    "role": "toolResult",
    "toolCallId": "call_d46c7e2b...",
    "toolName": "read",
    "content": [{ "type": "text", "text": "root:x:0:0:root:/root:/bin/bash\n..." }],
    "isError": false
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Assistant final response (stopReason is stop)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{
  "type": "message",
  "id": "a025ab9e",
  "parentId": "81fd9eca",
  "message": {
    "role": "assistant",
    "content": [{ "type": "text", "text": "The content of the file `/etc/passwd` is as follows (excerpt): root:x:0:0:..." }],
    "usage": { "totalTokens": 12741, "cost": { "total": 0.0401 } },
    "stopReason": "stop"
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From an audit perspective, the above sample (a round of user → assistant toolCall → toolResult → assistant stop) can already answer several key questions: Who (user) asked the Agent to do what (the read tool reads /etc/passwd), which model the Agent used (claude-4-sonnet), how much it cost ($0.0401), and what the result was (successfully read the content of /etc/passwd).&lt;/p&gt;

&lt;p&gt;3.2 Connect to Simple Log Service (SLS)&lt;br&gt;
LoongCollector collection configuration&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq1x4c6l4vld78jnap9tz.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq1x4c6l4vld78jnap9tz.jpeg" alt=" " width="800" height="511"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;SLS index configuration&lt;br&gt;
Configure the following field indexes for the session-audit Logstore in the SLS console:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpur06c4sxegz7t6hf14p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpur06c4sxegz7t6hf14p.png" alt=" " width="789" height="684"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F238yd0bdecuz1add21eo.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F238yd0bdecuz1add21eo.jpeg" alt=" " width="800" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.3 Audit scenario: Sensitive data leakage detection&lt;br&gt;
After the Agent reads files or executes commands via tools, the returned content is recorded in the toolResult entry. If the returned content contains sensitive data such as API keys, AKs, private keys, or passwords, it means that this data has entered the Agent's context—it may be "remembered" by the model and leaked in subsequent conversations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;type: message and message.role : toolResult 
  | extend content = cast(json_extract(message, '$.content')  as array&lt;span class="nt"&gt;&amp;lt;json&amp;gt;&lt;/span&gt;) 
  | project content | unnest 
  | extend content_type = json_extract_scalar(content, '$.type'), content_text = json_extract_scalar(content, '$.text') 
  | where content_type = 'text' | project content_text 
  | where content_text like '%BEGIN RSA PRIVATE KEY%' or content_text like '%password%' or content_text like '%ACCESS_KEY%' or regexp_like(content_text, 'LTAI[a-zA-Z0-9]{12,20}')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.4 Audit scenario: Skills calling audit&lt;br&gt;
When a skill file (such as SKILL.md) is read by the read tool, it is recorded in the content of the Assistant message with type: "toolCall", name: "read", and arguments.path. You can calculate statistics on which skills are called, the number of calls, and the most recent call time based on the path for compliance and usage analysis.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;type: message and message.role : assistant and message.stopReason : toolUse
  | extend content = cast(json_extract(message, '$.content')  as array&lt;span class="nt"&gt;&amp;lt;json&amp;gt;&lt;/span&gt;) 
  | project content, timestamp | unnest 
  | extend content_type = json_extract_scalar(content, '$.type'), content_name = json_extract_scalar(content, '$.name'), skill_path = json_extract_scalar(content, '$.arguments.path') 
  | project-away content 
  | where content_type = 'toolCall' and content_name = 'read' and skill_path like '%SKILL.md' 
  | stats cnt = count(*), latest_time = max(timestamp) by skill_path | sort cnt desc 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8otdyjv4mq7o8j22zy1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8otdyjv4mq7o8j22zy1.png" alt=" " width="800" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.5 Audit scenario: High-risk tool calling monitoring&lt;br&gt;
OpenClaw's tool permission system (Tool Policy Pipeline + Owner-only encapsulation) has already implemented control at runtime, but the observability layer should monitor independently of runtime protection—in case the policy configuration is incorrect, the observability layer is the last chance for Search. The definition of important tools is divided into two categories based on Scenarios.&lt;/p&gt;

&lt;p&gt;Scenario 1: Tools prohibited by default in Gateway HTTP&lt;/p&gt;

&lt;p&gt;When invoked via the gateway POST /tools/invoke, the following tools are denied by default, because their threat is too high or they cannot complete normally on the non-interactive HTTP interface:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpdm71by8944ce25xhno7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpdm71by8944ce25xhno7.png" alt=" " width="789" height="320"&gt;&lt;/a&gt;&lt;br&gt;
whatsapp_login  Interactive flow: Requires terminal QR code scanning, etc., and will suspend without response on HTTP&lt;br&gt;
Scenario 2: Tools that require explicit approval from ACP&lt;/p&gt;

&lt;p&gt;ACP (Automation Control Plane) is the automation entry point. The following tools are not allowed to Pass silently; they must be explicitly approved by the User before they are executed:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdokqviinw0lpsg02k6jg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdokqviinw0lpsg02k6jg.png" alt=" " width="789" height="296"&gt;&lt;/a&gt;&lt;br&gt;
Monitoring the invocation of the above tools (and their equivalent Names in the log) in Session logs can detect abnormal or unauthorized behaviors. If a tool is still successfully invoked in the Gateway HTTP scenario, configuration bypass may exist, which you need to troubleshoot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;type: message and message.role : assistant and message.stopReason : toolUse
  | extend content = cast(json_extract(message, '$.content')  as array&lt;span class="nt"&gt;&amp;lt;json&amp;gt;&lt;/span&gt;) 
  | project content, timestamp | unnest | extend content_type = json_extract_scalar(content, '$.type'), content_name = json_extract_scalar(content, '$.name'), content_arguments = json_extract(content, '$.arguments') 
  | project-away content 
  | where content_type = 'toolCall' and content_name in ('exec', 'write', 'edit', 'gateway', 'whatsapp_login', 'cron', 'sessions_spawn', 'sessions_send', 'spawn', 'shell', 'apply_patch')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.6 Audit Scenario: Cost Attribution&lt;br&gt;
Each Assistant message carries usage (containing totalTokens, input, output, cacheRead, and cacheWrite) as well as provider and model. Aggregating totalTokens by provider and model can answer "where the usage is spent." If the upstream provides usage.cost.total, you can also use the same method to aggregate by provider and model for cost Attribution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;type: message and message.role : assistant 
  | stats totalTokens= sum(cast("message.usage.totalTokens" as BIGINT)), inputTokens= sum(cast("message.usage.input" as BIGINT)), outputTokens= sum(cast("message.usage.output" as BIGINT)), cacheReadTokens= sum(cast("message.usage.cacheRead" as BIGINT)), cacheWriteTokens= sum(cast("message.usage.cacheWrite" as BIGINT)) by "message.provider", "message.model"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;4.O&amp;amp;M Observation: Application Logs&lt;br&gt;
The role of application logs is different from Session logs. Session logs record Agent actions (audit-oriented), while application logs record the System running Status (O&amp;amp;M-oriented)—Is the Gateway Started Normally? Did the Webhook report errors? Is the MSMQ stacked?&lt;/p&gt;

&lt;p&gt;4.1 Data Format&lt;br&gt;
OpenClaw Gateway uses the tslog library to write structured JSONL logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{
  "0": "{\"subsystem\":\"gateway/channels/telegram\"}",
  "1": "webhook processed chatId=123456 duration=2340ms",
  "_meta": {
    "logLevelName": "INFO",
    "date": "2026-02-27T10:00:05.123Z",
    "name": "openclaw",
    "path": {
      "filePath": "src/telegram/webhook.ts",
      "fileLine": "142"
    }
  },
  "time": "2026-02-27T10:00:05.123Z"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key fields:&lt;br&gt;
● _meta.logLevelName: log level (TRACE / DEBUG / INFO / WARN / ERROR / FATAL)&lt;br&gt;
● _meta.path: source code File Path and line number, used for precise positioning&lt;br&gt;
● Numeric key "0": bindings in JSON format, usually containing subsystem (such as gateway/channels/telegram)&lt;br&gt;
● Numeric key "1" and subsequent: log message text&lt;/p&gt;

&lt;p&gt;Log files scroll by Day (openclaw-YYYY-MM-DD.log), are automatically cleaned up every 24 hours, and have a single file limit of 500 MB.&lt;/p&gt;

&lt;p&gt;4.2 Ingest into Simple Log Service&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frceakh8joxvq91aajfxk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frceakh8joxvq91aajfxk.png" alt=" " width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For indexes, it is recommended to establish field indexes for _meta.logLevelName, _meta.date, _meta.path.filePath, "0" (subsystem bindings), and "1" (message text).&lt;/p&gt;

&lt;p&gt;4.3 Fault Dashboard by Subsystem&lt;br&gt;
Application logs are aggregated by abnormal levels (WARN, ERROR, FATAL) and subsystems as dimensions, which makes it easy to see which type of abnormal concentrates in which widget.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;_meta.logLevelName: ERROR or _meta.logLevelName: WARN or _meta.logLevelName: FATAL
  | project subsystem = "0.subsystem", loglevel = "_meta.logLevelName" 
  | stats cnt = count(1) by loglevel, subsystem 
  | sort loglevel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftetouu9hw3dt4wcypb90.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftetouu9hw3dt4wcypb90.png" alt=" " width="800" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;4.4 Typical Security Audit Scenarios and Log Samples&lt;br&gt;
Scenario 1: WebSocket unauthorized connection (unauthorized)&lt;/p&gt;

&lt;p&gt;Security audit value: When a WebSocket connection is denied during the authentication phase, a WARN log is generated, which facilitates the discovery of unauthorized access caused by token errors, expiration, or forgery. During auditing, follow these points: subsystem: gateway/ws indicates that the log comes from the WS layer. In the message content, conn= indicates the connection ID, remote= indicates the client IP, client= indicates the client ID (such as openclaw-control-ui or webchat), and reason=token_mismatch indicates a token mismatch (expired, incorrect, or forged). If the same remote triggers a large number of unauthorized attempts with reason as token_mismatch within a short period, it may be a dictionary attack or misappropriation attempt. If the client is a known legitimate client but still fails frequently, the issue is likely a configuration or token rotation issue, and you need to troubleshoot from the O&amp;amp;M side.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{
  "0": "{\"subsystem\":\"gateway/ws\"}",
  "1": "unauthorized conn=e32bf86b-c365-4669-a496-5a0be1b91694 remote=127.0.0.1 client=openclaw-control-ui webchat vdev reason=token_mismatch",
  "_meta": { "logLevelName": "WARN", "date": "2026-02-27T07:46:20.727Z" },
  "time": "2026-02-27T07:46:20.728Z"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scenario 2: HTTP tool calling denied or execution failed&lt;/p&gt;

&lt;p&gt;Security audit value: Failed or alert logs for POST /tools/invoke can reveal who is attempting to execute prohibited important tools or triggering permission or sandbox exceptions during execution. During auditing, follow these points: subsystem: tools-invoke allows you to quickly filter such events. The exception type (such as EACCES, ENOENT, or path) in the message content can distinguish between "unauthorized access to sensitive paths" and "configuration or path errors". For example, "open '/etc/shadow'" in the following example clearly points to an attempt to read a sensitive file. You need to combine this with Session logs to locate the caller.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{
  "0": "{\"subsystem\":\"tools-invoke\"}",
  "1": "tool execution failed: Error: EACCES: permission denied, open '/etc/shadow'",
  "_meta": { "logLevelName": "WARN", "date": "2026-02-27T10:00:07.000Z" },
  "time": "2026-02-27T10:00:07.000Z"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scenario 3: Connection or Request processing failed&lt;/p&gt;

&lt;p&gt;Security audit value: Connection resets and parsing errors can expose abnormal client behavior, malformed requests, or man-in-the-middle interference. During auditing, follow these points: subsystem: gateway indicates that the log comes from the gateway core (WS or request processing). The message content distinguishes between two categories. "request handler failed: Connection reset by peer" is mostly caused by peer disconnection or network interruption. You can check whether the errors occur in bursts based on time or conn (suspected scan or DoS attacks). "parse/handle error: Invalid JSON" indicates that the request body is invalid, which may be a maliciously constructed malformed package or a compatibility issue. When a large number of such errors appear from the same source within a short period, you should prioritize troubleshooting attacks or abnormal clients.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{
  "0": "{\"subsystem\":\"gateway\"}",
  "1": "request handler failed: Connection reset by peer",
  "_meta": { "logLevelName": "ERROR", "date": "2026-02-27T10:00:08.000Z" },
  "time": "2026-02-27T10:00:08.000Z"
}

{
  "0": "{\"subsystem\":\"gateway\"}",
  "1": "parse/handle error: Invalid JSON",
  "_meta": { "logLevelName": "ERROR", "date": "2026-02-27T10:00:08.100Z" },
  "time": "2026-02-27T10:00:08.100Z"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scenario 4: Security audit category (device access upgrade, etc.)&lt;/p&gt;

&lt;p&gt;Security audit value: Device pairing and permission upgrades leave an audit trail of "who, from what role or permission, upgraded to what role or permission, from which IP, and what authentication type". During auditing, focus on the structured fields in the message content: reason=role-upgrade indicates that the event is triggered by role promotion. device= indicates the device ID. ip= indicates the client IP, which can be used for comparison with known management IPs. roleFrom=[] roleTo=owner indicates an upgrade from no role to owner, which is a highly sensitive operation. auth=token indicates the Authentication Type used. If the same IP or device upgrades frequently during non-working hours, or if the number of entries with roleTo as owner increases abnormally, you should prioritize troubleshooting whether unauthorized access or account compromise has occurred.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{
  "0": "{\"subsystem\":\"gateway\"}",

  "1": "security audit: device access upgrade requested reason=role-upgrade device=abc-123 ip=192.168.1.1 auth=token roleFrom=[ ] roleTo=owner scopesFrom=[ ] scopesTo=[...] client=control conn=conn-1",

  "_meta": { "logLevelName": "WARN", "date": "2026-02-27T10:00:09.000Z" },
  "time": "2026-02-27T10:00:09.000Z"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scenario 5: FATAL and core exceptions&lt;/p&gt;

&lt;p&gt;Security audit value: FATAL indicates that core features are unavailable, which may be caused by tampered configurations, dependency failures, or critical runtime errors. You need to immediately troubleshoot whether the issue is related to intrusion or misconfiguration. During auditing: Filter _meta.logLevelName = 'FATAL' in the error dashboard. Combine subsystem and the message content of "1" to locate the specific component and the cause of the error. If FATAL is accompanied by keywords such as "bind", "config", or "listen", you need to prioritize troubleshooting the exposed surface and configuration consistency. It is recommended that you configure real-time alerting (such as every minute, cnt &amp;gt; 0, push to DingTalk or text messages) to ensure an immediate response.&lt;/p&gt;

&lt;p&gt;5.Real-time monitoring and alerting: OTEL telemetry&lt;br&gt;
Session logs and application logs rely mainly on management events and audit trails, which are suitable for conditional retrieval and post-event attribution. From the perspective of the observability system, if you want to master aggregation metrics, trends, and request traces (such as cost/usage trends, session health degree, and duration and dependency of a single request), you need to use OpenTelemetry (OTEL) Metrics (counter, histogram, gauge) and Traces (distributed traces, latency, and invocation relationships). Together with logs, they form the complete observability capability of "logs + metrics + traces."&lt;/p&gt;

&lt;p&gt;5.1 Access Simple Log Service (SLS)&lt;br&gt;
OpenClaw has a built-in diagnostics-otel plugin (version 26.2.19 or later). It supports exporting Metrics, Traces, and Logs via the OpenTelemetry Protocol (OTLP)/HTTP (Protobuf) protocol.&lt;/p&gt;

&lt;p&gt;Enable the plugin&lt;br&gt;
Execute the command openclaw plugins enable diagnostics-otel to start the plugin. View the plugin status using the openclaw plugins list command. The expected status is loaded.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdz1c02qhrqekf2j96r94.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdz1c02qhrqekf2j96r94.png" alt=" " width="800" height="182"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Configure ~/.openclaw/openclaw.json&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{
  "plugins": {
    "allow": ["diagnostics-otel"],
    "entries": {
      "diagnostics-otel": { "enabled": true }
    }
  },
  "diagnostics": {
    "enabled": true,
    "otel": {
      "enabled": true,
      "endpoint": "https://127.0.0.1:4318",
      "protocol": "http/protobuf",
      "serviceName": "openclaw-gateway",
      "traces": true,
      "metrics": true,
      "logs": true,
      "sampleRate": 1,
      "flushIntervalMs": 60000
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create collection configuration&lt;br&gt;
In the SLS console, create logstores: otlp-logs and otlp-traces. Create metricstore: otlp-metrics, and the corresponding collection configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{
    "aggregators": [
        {
            "detail": {},
            "type": "aggregator_opentelemetry"
        }
    ],
    "inputs": [
        {
            "detail": {
                "Protocals": {
                    "HTTP": {
                        "Endpoint": "127.0.0.1:4318",
                        "ReadTimeoutSec": 10,
                        "ShutdownTimeoutSec": 5,
                        "MaxRecvMsgSizeMiB": 64
                    },
                    "GRPC": {
                        "MaxConcurrentStreams": 100,
                        "Endpoint": "127.0.0.1:4317",
                        "ReadBufferSize": 1024,
                        "MaxRecvMsgSizeMiB": 64,
                        "WriteBufferSize": 1024
                    }
                }
            },
            "type": "service_otlp"
        }
    ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;5.2 What data is exported&lt;br&gt;
To answer observability requirements such as "usage and cost," "entry stability," and "queue and session health," OpenClaw exports Metrics and Traces via OTEL. The following provides an overall description and details table (metric name, type, and function) categorized by requirements.&lt;/p&gt;

&lt;p&gt;Cost and usage metrics&lt;br&gt;
It is directly related to Large Language Model (LLM) invocation costs and is the core of fee control. By monitoring token consumption, estimated fees, run duration, and context usage, you can master the cost of each model invocation and discover waste caused by improper configuration or inefficient use.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxr65qw53dg9julr4iep0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxr65qw53dg9julr4iep0.png" alt=" " width="789" height="283"&gt;&lt;/a&gt;&lt;br&gt;
openclaw.cost.usd generates data only when the upstream model.usage management event provides costUsd.&lt;/p&gt;

&lt;p&gt;Webhook processing metrics&lt;br&gt;
Webhook is an important entry point for OpenClaw to interact with external systems. By monitoring the quantity of received requests, fault counts, and processing duration, you can discover external invocation abnormalities in time and ensure integration stability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffur32wpz1i4wlmgpz0um.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffur32wpz1i4wlmgpz0um.png" alt=" " width="789" height="222"&gt;&lt;/a&gt;&lt;br&gt;
Message queue metrics&lt;br&gt;
The message queue is a transit station for job processing. By following the enqueue/dequeue quantity, queue depth, and wait time, you can determine whether the system is congested or whether jobs are backlogged. This facilitates adjusting resources or troubleshooting bottlenecks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8lgs0vttp6cqssbahbr5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8lgs0vttp6cqssbahbr5.png" alt=" " width="789" height="394"&gt;&lt;/a&gt;&lt;br&gt;
Session management metrics&lt;br&gt;
Changes in session status and the quantity of stuck sessions reflect interaction health. Monitoring metrics such as stuck sessions and retries allows you to quickly discover conversations trapped in infinite loops or abnormal statuses, improving observability and troubleshooting efficiency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7i5egepboil1tufbl6ii.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7i5egepboil1tufbl6ii.png" alt=" " width="789" height="259"&gt;&lt;/a&gt;&lt;br&gt;
Trace Span&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxecx0lgfxtaypfw8usaj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxecx0lgfxtaypfw8usaj.png" alt=" " width="789" height="235"&gt;&lt;/a&gt;&lt;br&gt;
5.3 Data value analysis&lt;br&gt;
Scenario: Usage and cost distribution&lt;br&gt;
Answer: Which models and Providers are the usage and money mainly spent on? Is the recent Token consumption trend normal, or is there a sudden surge? How is the cumulative usage ranked by model or Provider? When the Token growth rate is abnormal, you can perform further analysis combined with Session logs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;# Token consumption growth rate (alerts can be set: such as exceeding N tokens/min)sum(rate(openclaw_tokens[10m]))

# Token consumption trend (by model)
sum(rate(openclaw_tokens[5m])) by (openclaw_model)

# Cumulative Tokens (by Provider)
sum(openclaw_tokens) by (openclaw_provider)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv0oj17r2vdrtaznzicce.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv0oj17r2vdrtaznzicce.jpeg" alt=" " width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Scenario: Session stuck and execution too long&lt;br&gt;
Answer: Are there currently stuck sessions or sessions with no progress? What are the frequency and time periods of stuck occurrences? Does the single Agent execution duration (P95/P99) exceed expectations, or are there long tails?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;# Stuck sessions (Alert: &amp;gt; 0)sum(rate(openclaw_session_stuck[5m]))

# Execution duration P95 (Alert: such as &amp;gt; 5 minutes)
histogram_quantile(0.95, sum(rate(openclaw_run_duration_ms_bucket[5m])) by (le))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scenario: Webhook Error Rate and processing latency&lt;br&gt;
Answer: What are the Request volume and fault counts of Webhooks for each channel, and is the Error Rate within an acceptable range? Have the quantiles (P95/P99) of single Webhook processing duration and Agent execution duration deteriorated? What are the differences in latency distribution by channel or by model? When the Error Rate or latency is abnormal, you can combine application logs to search for specific faults by Webhook subsystem.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;# Webhook Error Rate (Alert: such as &amp;gt; 5%)sum(rate(openclaw_webhook_error[5m])) / sum(rate(openclaw_webhook_received[5m]))

# Execution duration P99 (by model)histogram_quantile(0.99, sum(rate(openclaw_run_duration_ms_bucket[5m])) by (le, openclaw_model))

# Webhook processing duration P95 (by channel)histogram_quantile(0.95, sum(rate(openclaw_webhook_duration_ms_bucket[5m])) by (le, openclaw_channel))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scenario: Queue backlog and wait time&lt;br&gt;
Answer: Are the depth and enqueue/dequeue rates of each queue lane healthy? Is the wait time (P95/P99) of Jobs in the queue lengthening, or is there a backlog Trend? Which lanes are most prone to congestion? This facilitates detecting bottlenecks and adjusting resources before User experience deteriorates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;# Queue depth (by lane)histogram_quantile(0.95, sum(rate(openclaw_queue_depth_bucket[5m])) by (le, openclaw_lane))

# Queue wait time P95 (by lane)
histogram_quantile(0.95, sum(rate(openclaw_queue_wait_ms_bucket[5m])) by (le, openclaw_lane))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;6.Multi-source interaction: Composite troubleshooting flow&lt;br&gt;
The previous sections demonstrated the independent value of each Data Pipeline. However, what truly embodies "keeping the Agent running under control" is the ability of multiple observable data Pipelines to work collaboratively.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkifie2yisvnxcqb5oaxn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkifie2yisvnxcqb5oaxn.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The key to this flow lies in each Data Pipeline answering questions at different layers, and none is dispensable:&lt;/p&gt;

&lt;p&gt;● Only OTEL without Session logs: You know the cost is soaring, but you do not know who, or what was done.&lt;br&gt;
● Only Session logs without OTEL: You can audit behaviors but cannot perceive the Status from a holistic view.&lt;br&gt;
● Only application logs: You can see System Errors but do not know the business behavior of the Agent.&lt;/p&gt;

&lt;p&gt;7.Summary&lt;br&gt;
To answer "Is your OpenClaw truly running under control?", you need to answer four questions simultaneously: who is triggering the invocation, how much it costs, what operations were performed (especially high-risk tools), and whether the behavior is traceable and auditable. Relying solely on runtime protection (tool policies, loop detection, and so on) is insufficient to claim control. You must establish a continuous observability system and use data to answer the above questions.&lt;/p&gt;

&lt;p&gt;Based on Alibaba Cloud Simple Log Service (SLS), this topic unifies OpenClaw's three types of observable data—Session audit logs, application logs, and OTEL metrics and traces—into SLS to form a complete "logs + metrics + traces" capability. Session logs answer "What did the Agent do and how much did it cost." Application logs answer "Where is the system abnormal." OTEL answers "Current status and duration." By using LoongCollector file collection and OTLP direct ingestion, the system achieves a one-stop closed loop of collection, indexing, query, dashboard, and alerting. It also utilizes the audit, permission, and masking capabilities of SLS to meet compliance requirements.&lt;/p&gt;

&lt;p&gt;In practice, the three data pipelines should be used collaboratively. OTEL alerting detects anomalies. Application logs are used to narrow down the scope and locate the subsystem and session. Then, Session logs are used to reconstruct the complete behavior chain and take response measures. Only through the interaction of the three sources can a verifiable audit and O&amp;amp;M closed loop be formed—from "there is an anomaly" to "where the problem is" and then to "what exactly the Agent did"—truly allowing the Agent to run under control.&lt;/p&gt;

</description>
      <category>openclaw</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Announcing One-time File Collection for LoongCollector</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Wed, 01 Apr 2026 01:56:45 +0000</pubDate>
      <link>https://dev.to/observabilityguy/announcing-one-time-file-collection-for-loongcollector-52bd</link>
      <guid>https://dev.to/observabilityguy/announcing-one-time-file-collection-for-loongcollector-52bd</guid>
      <description>&lt;p&gt;This article introduces LoongCollector’s new one time file collection feature for fast, reliable, and automated batch ingestion of historical or static files.&lt;br&gt;
Have you ever encountered such a scenario: You need to quickly migrate history logs, backfill data, or process a batch of static files, but are troubled by the inconvenience of traditional collection tools that "monitor constantly and only collect incremental data"? The one-time file collection launched by LoongCollector is a solution tailored for this type of requirement.&lt;/p&gt;

&lt;p&gt;LoongCollector is a next-generation data collector launched by Alibaba Cloud Simple Log Service (SLS) that combines performance, stability, and programmability. It is designed to build the next-generation observability pipeline. LoongCollector extends and integrates the observability technology stack, changing the single-scenario limit of traditional log collectors, and supports the collection, processing, routing, and sending of Logs, Metrics, Traces, Events, and Profiles.&lt;/p&gt;

&lt;p&gt;Commercial version: &lt;a href="https://www.alibabacloud.com/help/en/sls/what-is-sls-loongcollector/" rel="noopener noreferrer"&gt;https://www.alibabacloud.com/help/en/sls/what-is-sls-loongcollector/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Open source version: &lt;a href="https://github.com/alibaba/loongcollector" rel="noopener noreferrer"&gt;https://github.com/alibaba/loongcollector&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Different from regular continuous collection, the one-time file collection configuration will scan matching files once, complete reading, and automatically end after it starts, without the need for manual monitoring. It applies to scenarios such as history file migration, data backfilling, and temporary batch processing. It not only saves resources but also ensures complete data upload.&lt;/p&gt;

&lt;p&gt;1.Stable, controllable, and traceable cloud-based batch automated data collection&lt;br&gt;
Before the one-time file collection capability was released, LoongCollector (and its predecessor iLogtail) also provided a "history file collection" solution (Reference: Import history logs). Compared with the old solution, the new one-time file collection configuration is simpler and faster, possesses stronger batch processing capabilities and a clearer lifecycle, and improves stability and observability through finer-granularity checkpoints.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdwec2l84ydkc8mnz01of.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdwec2l84ydkc8mnz01of.png" alt=" " width="800" height="604"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The new version of one-time collection upgrades static data collection from "standalone manual operation" to "cloud-based batch automation," making it more stable, controllable, and traceable. How are these advantages specifically realized? Let us introduce them one by one below.&lt;/p&gt;

&lt;p&gt;1.1 Understanding the execution logic&lt;br&gt;
1.1.1 One-time collection configuration&lt;br&gt;
What is "one-time" collection configuration?&lt;br&gt;
The collection pipelines of LoongCollector can be divided into two categories:&lt;/p&gt;

&lt;p&gt;● Continuous: Runs constantly, continuously discovers and collects new content (typically such as input_file).&lt;/p&gt;

&lt;p&gt;● One-time: Executes only once after starting, and ends after the collection is completed (typically such as input_static_file_onetime).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0an4jm9ioo41oqg2lx8y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0an4jm9ioo41oqg2lx8y.png" alt=" " width="800" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The scenarios for the two types of pipelines can be summarized as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhn3i9qhvqtf9n70ayvi4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhn3i9qhvqtf9n70ayvi4.png" alt=" " width="789" height="161"&gt;&lt;/a&gt;&lt;br&gt;
How to distinguish one-time collection configuration&lt;br&gt;
On the client side, the "Toggle" for the one-time pipeline is global.ExecutionTimeout.&lt;/p&gt;

&lt;p&gt;● When global.ExecutionTimeout exists in the configuration, LoongCollector will detect the pipeline as one-time and compute its time-to-live (TTL).&lt;/p&gt;

&lt;p&gt;● In addition to global.ExecutionTimeout, the inputs plugin also needs to be a one-time input plugin (usually ending with _onetime). Otherwise, the configuration will not take effect. In this topic, we use the input_static_file_onetime plugin to execute "one-time file collection."&lt;/p&gt;

&lt;p&gt;The comparison sample is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;# Normal file collection
enable: true
inputs:
  - Type: input_file
    FilePaths:
      - /var/log/*.log
flushers:
  - Type: flusher_stdout
    OnlyStdout: true
    Tags: true

# One-time file collection
enable: true
global:
  ExcutionTimeout: 3600
inputs:
  - Type: input_static_file_onetime
    FilePaths:
      - /var/log/history/*.log
flushers:
  - Type: flusher_stdout
    OnlyStdout: true
    Tags: true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Execution window and expiration mechanism of one-time&lt;br&gt;
To provide a comprehensive overview of the one-time collection pipeline, you need to align the configuration lifecycle on the server-side/console side with the execution and reliability mechanisms on the client side. You can understand it as follows:&lt;/p&gt;

&lt;p&gt;● Server-side/Console side: Decides when the configuration is distributed and how long the configuration is retained (impacting "which machines can obtain the configuration and how long the machines can obtain the configuration").&lt;/p&gt;

&lt;p&gt;● Client side: Decides how the configuration runs after the configuration is obtained, how long the configuration runs, and how to resume collection from breakpoints (impacting "whether the collection can be completed, and whether data is missed or duplicated after a restart").&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0b3s1hqm05qrhid9z27v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0b3s1hqm05qrhid9z27v.png" alt=" " width="800" height="286"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Server side: Distribution window, execution window, and retention period&lt;br&gt;
One-time collection configurations usually contain three key time points on the console side:&lt;/p&gt;

&lt;p&gt;Configuration distribution window: Distributes configurations only to machines that have reported heartbeats within a period after the configuration creation (5 minutes; updating the configuration refreshes the window).&lt;br&gt;
Configuration execution window: After the configuration takes effect, the maximum time allowed for the configuration to run is the execution timeout of the configuration (that is, global.ExecutionTimeout; default: 10 minutes; range: 10 minutes to 1 week).&lt;br&gt;
Configuration retention period: The server-side retains the configuration for a period for tracing or reuse (7 days).&lt;br&gt;
If machines are added to a group after configuration creation, they may miss the initial distribution window. When the data volume is large, you must increase the ExecutionTimeout in advance to prevent the collection from being interrupted because the execution window time limit is reached before the configuration collection is completed.&lt;/p&gt;

&lt;p&gt;Client side: Execution and expiration&lt;br&gt;
Timeout range and Default Value: The unit of global.ExecutionTimeout is seconds, and the range is limited to 600 to 604,800 (10 minutes to 1 week).&lt;br&gt;
Expiration behavior: For one-time configurations, the client computes and records the expiration time (start + ExecutionTimeout). When the configuration expires, the client cleans up the expired configuration file and removes the status record of the configuration.&lt;br&gt;
Whether configuration updates trigger a "Rerun" (to avoid erroneous collection or duplicate collection): When a one-time configuration is updated, the client combines the following factors to determine whether "re-execution" is required: If global.ForceRerunWhenUpdate is true, the client forces a rerun whenever any change occurs in the configuration. If global.ForceRerunWhenUpdate is false (default), the client determines whether to rerun based on whether the hash of inputs and ExecutionTimeout have changed. If neither has changed, the client does not rerun the configuration and continues to use the original expiration time. Otherwise, the client treats the configuration as a "new one-time configuration."&lt;br&gt;
One of the design goals of one-time is "avoiding duplicate execution of the same configuration." Therefore, the update policy aims to achieve controllable reruns.&lt;/p&gt;

&lt;p&gt;1.1.2 One-time file collection&lt;br&gt;
"Snapshot Semantics" of one-time file collection&lt;br&gt;
The core semantics of input_static_file_onetime can be summarized in three points:&lt;/p&gt;

&lt;p&gt;Search for files once at startup: The client scans the matching paths at startup and solidifies the "list of matching files existing at that time" into the checkpoint. Subsequently added files will not be included in the current collection target.&lt;br&gt;
Read only the file size at the startup moment: Each file records an initial size. During the collection process, even if the file is appended with data, the client only reads up to the initial size (to avoid uncontrollable duplication or missed collection caused by reading while writing).&lt;br&gt;
Support rotation positioning: The file fingerprint contains information such as dev, inode, sig_hash, and sig_size. sig_hash and sig_size come from the signature of up to 1024 bytes at the beginning of the file. When file rotation causes the path to change, the client attempts to search by dev+inode in the folder and continues to read, avoiding missed collection as much as possible.&lt;br&gt;
Reliability of one-time file collection (checkpoint mechanism)&lt;br&gt;
One-time file collection records "configuration-level status + file-level progress" through checkpoints to support restart, upgrade, and abnormal recovery, and to avoid duplicate collection as much as possible.&lt;/p&gt;

&lt;p&gt;Configuration-level checkpoint&lt;br&gt;
This file records the core information of the one-time configuration (such as config_hash, expire_time, inputs_hash, and excution_timeout). This file is used to recover the time-to-live (TTL) and update policy judgment of the one-time configuration after a restart. The path is usually located at /etc/ilogtail/checkpoint/onetime_config_info.json.&lt;/p&gt;

&lt;p&gt;File-level checkpoint&lt;br&gt;
This file records the execution progress of one-time file collection and the status of each file. The path is usually located at: /etc/ilogtail/checkpoint/input_static_file/{config_name}@0.json.&lt;/p&gt;

&lt;p&gt;Field description (aligned with the actual stored JSON):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvobgsl8gvuqr3z24ef66.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvobgsl8gvuqr3z24ef66.png" alt=" " width="787" height="949"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{
  "config_name" : "xxxx",
  "expire_time" : 1768550944,
  "file_count" : 1,
  "files" : 
    [
      {
        "dev" : 2051,
        "filepath" : "/var/log/tmpfs.log",
        "finish_time" : 1768550345,
        "inode" : 2888304,
        "size" : 1282,
        "start_time" : 1768550345,
        "status" : "finished"
      }
    ],
  "finish_time" : 1768550345,
  "input_index" : 0,
  "start_time" : 1768550344,
  "status" : "finished"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Resource usage and throughput control&lt;br&gt;
One-time file collection is a native input plugin (implemented in C++). This feature shares the reader system with regular file collection and possesses good throughput capacity. The theoretical limit performance of single-threaded collection for single-line Text logs can reach 300 MB/s. At the same time, "controllable" constraints are imposed on resource usage:&lt;/p&gt;

&lt;p&gt;● Single-threaded sequential execution: All input_static_file_onetime collection configurations are uniformly scheduled by the StaticFileServer module inside LoongCollector. The overall process is single-threaded loop processing (different inputs are assigned time slices in the loop) to avoid uncontrolled resource usage caused by excessive concurrency.&lt;/p&gt;

&lt;p&gt;● Sending rate limiting (flusher_sls.MaxSendRate): Use the advanced parameter MaxSendRate of the SLS Outputs to perform Rate Limit on sending. The unit is B/s. When MaxSendRate &amp;gt; 0, the sending queue enables the rate limiter, thereby reducing the impact on network bandwidth and SLS write quotas.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgi7t4p8zqiyguvx5bwlr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgi7t4p8zqiyguvx5bwlr.png" alt=" " width="800" height="470"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2.Quick Start&lt;br&gt;
SLS has published the one-time file Collection capability. You can experience the new feature in just three steps:&lt;/p&gt;

&lt;p&gt;1.Log on to the SLS console. On the Logtail configuration Page, select "One-time Logtail Configuration" and Click "Add Logtail Configuration".&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9lhkkc7wmsg9hkktgw9z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9lhkkc7wmsg9hkktgw9z.png" alt=" " width="800" height="315"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2.Select "One-time File Collection - Host".&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxr7v9345g8o0dbttgapd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxr7v9345g8o0dbttgapd.png" alt=" " width="800" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.Fill in the file collection configuration (consistent with the configuration of regular file collection). Configure processing plugins as needed and save. For more detailed descriptions and parameter explanations, refer to the official documentation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo40vzsdhwohs3r7uhztu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo40vzsdhwohs3r7uhztu.png" alt=" " width="800" height="535"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After saving, you can see that the data is collected:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flqcu6jkgfepph017815r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flqcu6jkgfepph017815r.png" alt=" " width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can also view the complete collection configuration in the configuration details:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvrdyekaf3wygak6xyt1a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvrdyekaf3wygak6xyt1a.png" alt=" " width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.Best Practices&lt;br&gt;
3.1 Scenario 1: Large-scale machine group backfilling large amounts of files&lt;/p&gt;

&lt;p&gt;Hypothetical scenario:&lt;br&gt;
● Because of an accidental network disconnection for too long, exceeding the local fault tolerance limit of LoongCollector, 1,000 nodes need to backfill data. Each node needs to backfill about 10 GB.&lt;/p&gt;

&lt;p&gt;● The target Logstore has 256 shards. The write limit for each shard is about 5 MB/s.&lt;/p&gt;

&lt;p&gt;● The daily traffic of each machine is about 1 MB/s.&lt;/p&gt;

&lt;p&gt;If you directly use default parameters to apply the one-time file collection configuration, the following may occur:&lt;/p&gt;

&lt;p&gt;The write rate surges instantly, triggering shard write quota errors.&lt;br&gt;
Backfill traffic occupies daily collection traffic.&lt;br&gt;
Backlog at the sender causes the one-time Job to fail to complete within the ExecutionTimeout.&lt;br&gt;
It is recommended to perform two-step control:&lt;/p&gt;

&lt;p&gt;Step 1: Rate limiting (MaxSendRate)&lt;br&gt;
Estimate roughly based on available quota: The remaining available write capacity is about (256 × 5 - 1,000 × 1 = 280) MB/s. Averaged to each machine, it is about 0.28 MB/s (≈ 286 KB/s ≈ 286,720 B/s), rounded to about 290,000 B/s. You can set MaxSendRate to about 290000 (B/s) for rate limiting.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fumakhroffqqqoc358ddn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fumakhroffqqqoc358ddn.png" alt=" " width="800" height="412"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Step 2: Increase execution timeout (ExecutionTimeout)&lt;br&gt;
At a sending rate of 286 KB/s, backfilling 10 GB requires at least about 10 GB / 286 KB/s ≈ 36,663 s ≈ 10.2 h. It is recommended to set ExecutionTimeout to 86400 (about 1 day) to leave enough margin for collection.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxl9vei66vn4bjb1k0925.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxl9vei66vn4bjb1k0925.png" alt=" " width="800" height="218"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Summary: ExecutionTimeout: 86400 + MaxSendRate: 290000. This allows large-scale backfilling to be completed while minimizing the impact on daily online collection.&lt;/p&gt;

&lt;p&gt;3.2 Scenario 2: Only backfill data from a certain time period in the file&lt;br&gt;
Hypothetical scenario (disregarding quota, only discussing "avoiding duplication"):&lt;/p&gt;

&lt;p&gt;● The edge zone encountered a network abnormality for an extended period, exceeding the LoongCollector local fault tolerance limit, resulting in the loss of approximately 12 hours of Data.&lt;/p&gt;

&lt;p&gt;● There are multiple rotated files on the edge zone, and many files are only partially missing.&lt;/p&gt;

&lt;p&gt;● The log is a single-line JSON, containing&lt;/p&gt;

&lt;p&gt;{"timestamp":1768556120,"message":"hello world","level":"INFO"}&lt;br&gt;
One-time file collection is executed in units of "file snapshots." If you recollect directly, it is likely that the time segments that have already been reported will be recollected as well.&lt;/p&gt;

&lt;p&gt;Solution: Add the UNIX timestamp filter processing plugin processor_timestamp_filter_native (combined with processor_parse_json_native/processor_parse_timestamp_native if necessary) to the one-time Collection pipeline to retain only events within the Target time range, thereby achieving "precise recollection."&lt;/p&gt;

&lt;p&gt;The console configuration diagram is as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdgygf06pjqsvw7xuggmd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdgygf06pjqsvw7xuggmd.png" alt=" " width="800" height="352"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3atecm3d02yc2berthgr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3atecm3d02yc2berthgr.png" alt=" " width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.3 Scenario 3: The one-time collection configuration needs to be modified (to avoid polluting the target dataset)&lt;br&gt;
One-time Collection is "executed immediately upon dispatch." If a logic error exists in the initial configuration, even if the configuration is Updated immediately, some unexpected data may have already been generated, causing new and old data to mix and impact analysis.&lt;/p&gt;

&lt;p&gt;Suggested practice:&lt;/p&gt;

&lt;p&gt;Create a one-time configuration for the first time, and find that the output does not meet expectations.&lt;br&gt;
Update the one-time configuration (you can set ForceRerunWhenUpdate: true to trigger a Forced Rerun and interrupt the previous Collection Task), and verify whether the newly collected data format is correct. If the requirements are not met, retry repeatedly.&lt;br&gt;
Use a query statement to Filter out unexpected Data, and clean it up through SLS soft delete (Sample document: Simple Log Service soft delete).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foisizi52wxcg7mjga6iz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foisizi52wxcg7mjga6iz.png" alt=" " width="800" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwhynoep027qq5trgqql9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwhynoep027qq5trgqql9.png" alt=" " width="800" height="380"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn9wvbe2yubt02g2n5ldv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn9wvbe2yubt02g2n5ldv.png" alt=" " width="800" height="253"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this way, you can retain only the collection result corresponding to the "final correct configuration" to avoid affecting subsequent analysis.&lt;/p&gt;

&lt;p&gt;4.Summary&lt;br&gt;
One-time file collection is suitable for scenarios such as historical data migration, network disconnection recollection, and temporary batch processing. After the configuration is dispatched, it is executed based on the "Start time file snapshot." With checkpoints to ensure recoverability and observability, and combined with ExecutionTimeout and MaxSendRate to provide a double safety net of "duration + traffic," you can steadily backfill the static data without disturbing the continuous online collection. You are welcome to try it out and provide feedback!&lt;/p&gt;

</description>
      <category>automation</category>
      <category>dataengineering</category>
      <category>news</category>
      <category>tooling</category>
    </item>
    <item>
      <title>Giant Network’s Supernatural Squad Partners with Alibaba Cloud to Create a New Paradigm for Cloud-Native Gaming</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Tue, 31 Mar 2026 05:26:08 +0000</pubDate>
      <link>https://dev.to/observabilityguy/giant-networks-supernatural-squad-partners-with-alibaba-cloud-to-create-a-new-paradigm-for-2g0g</link>
      <guid>https://dev.to/observabilityguy/giant-networks-supernatural-squad-partners-with-alibaba-cloud-to-create-a-new-paradigm-for-2g0g</guid>
      <description>&lt;p&gt;This article shows how Giant Network’s Supernatural Squad runs cloud native on Alibaba Cloud using ACK and OpenKruiseGame to achieve elastic, low latency, highly reliable gameplay at massive scale.&lt;/p&gt;

&lt;p&gt;Running on the cloud since day one; Reaching over 10 million DAU (Daily Active Users) within a year of launch; Supporting millions of concurrent players at peak times with zero major failures; This isn't science fiction—it's the reality of cloud-native implementation co-authored by Giant Network and Alibaba Cloud.&lt;/p&gt;

&lt;p&gt;The "Cloud-Native First" Strategy of Supernatural Squad&lt;br&gt;
In January 2025, Giant Network launched the multiplayer team-based adventure game Supernatural Squad. With its innovative "Chinese-style micro-horror + multiplayer cooperation" gameplay, it quickly became a smash hit. Recently, the game announced that its DAU surpassed 10 million, and that it had climbed to fourth place on the iOS Game Bestseller list. Most notably, since the day its servers opened, this game has never been deployed on physical machines or traditional virtual machines—it has run on a cloud-native architecture from day one.&lt;/p&gt;

&lt;p&gt;For most gaming companies, a "hit at launch" is a bittersweet challenge. Traffic surges arrive quickly and recede slowly, while traditional architectures are "clunky":&lt;/p&gt;

&lt;p&gt;● Game servers (such as battle and room servers) are deployed on fixed servers, and expansion takes days.&lt;/p&gt;

&lt;p&gt;● Resources must be reserved long-term to handle peaks, leading to significant waste during idle periods.&lt;/p&gt;

&lt;p&gt;● Version updates rely on scripts, making canary releases difficult; a single error often requires a "full-server rollback."&lt;/p&gt;

&lt;p&gt;● Logs are scattered and monitoring is fragmented, meaning fault isolation can take hours.&lt;/p&gt;

&lt;p&gt;● Security is weak, making the game vulnerable to DDoS attacks.&lt;/p&gt;

&lt;p&gt;● Data layer bottlenecks are prominent: issues like battle settlement delays, leaderboard lag, and player data loss occur frequently.&lt;/p&gt;

&lt;p&gt;The Supernatural Squad team knew that if they followed the old model, they might "fall on the road to success."&lt;/p&gt;

&lt;p&gt;So, they chose a more difficult but far-reaching path: fully embracing cloud-native.&lt;/p&gt;

&lt;p&gt;By deeply integrating ACK (Container Service for Kubernetes), ESS (Auto Scaling), NLB (Network Load Balancer), OpenKruiseGame (OKG), SLS (Log Service), ARMS (Application Real-Time Monitoring Service), Alibaba Cloud Native Protection, and cloud-native databases PolarDB and Tair (Redis-compatible), Giant Network built a next-generation game infrastructure. This system is highly elastic, highly available, low-cost, intelligent, secure, and high-performance. Today, with DAU exceeding 10 million, this technical framework has become a benchmark case for "cloud-native transformation" in the gaming industry.&lt;/p&gt;

&lt;p&gt;High Elasticity × Low Latency × Zero Failure: Decoding the Cloud-Native Foundation&lt;br&gt;
Supernatural Squad built an industry-leading cloud-native game server architecture based on Alibaba Cloud ACK and OpenKruiseGame (OKG). It achieves zero-downtime and seamless delivery through blue-green deployments and in-place upgrades. By utilizing OKG and multi-NLB resource pools, it covers all major lines (BGP, China Telecom, China Unicom, and China Mobile), achieving automated network mapping across multiple carriers. Combining HPA (Horizontal Pod Autoscaler) with OKG’s graceful shutdown mechanism, the game balances cost and user experience. Using the ACK Koordinator component, it implements CPU Burst and fine-grained QoS scheduling, significantly improving cluster resource utilization. Through the bidirectional perception of infrastructure and business status, it creates an "automated O&amp;amp;M closed-loop driven by business semantics," realizing a next-generation backend that is highly elastic, available, high-performance, and secure. While significantly reducing O&amp;amp;M (Operations &amp;amp; Maintenance) pressure, it has achieved institutionalized and sustainable cost optimization.&lt;/p&gt;

&lt;p&gt;At the network level, as a competitive mobile game extremely sensitive to latency, Supernatural Squad relied on Alibaba Cloud to build a next-generation cloud network architecture featuring "cloud-edge collaboration, multi-carrier compatibility, and elastic consolidation." Through OKG and NLB, it achieves concurrent access across four lines (China Telecom, China Unicom, China Mobile, and BGP), allowing players nationwide to automatically match with the optimal link. Its innovative "static network + dynamic computing" model achieves rapid expansion at 50 nodes per minute, launching thousands of battle servers within 15 minutes to completely eliminate queues. At the same time, by leveraging Alibaba Cloud Express Connect, core systems such as accounts and payments in on-premises data centers are directly connected to the Shanghai VPC intranet, constructing a hybrid cloud hub with millisecond synchronization and financial-grade security. Furthermore, by using Shared Bandwidth Packages to aggregate the project's public network egress, the architecture significantly reduces costs while simplifying O&amp;amp;M. This provides an elastic "bandwidth reservoir" for player interactions and high-frequency state synchronization, realizing a peak experience of zero lag and zero waiting for tens of millions of players competing together.&lt;/p&gt;

&lt;p&gt;At the data layer, cloud-native PolarDB and Tair (Redis-compatible) have built an elastic and stable player archiving solution, supporting high-concurrency login and read/write operations for tens of millions of players. Leveraging the storage-compute separation and elasticity of the PolarDB cloud-native database, the system supports automatic scaling during game events and enables second-level backups and rollbacks of player data, significantly reducing database O&amp;amp;M costs. Furthermore, PolarDB Serverless supports automatic scaling (up and down), allowing for second-level adjustments of computing resources based on real-time changes in user traffic. By automatically increasing resources during peak periods and reducing them during off-peak times, it ensures that the game environment always operates at its optimal state. Based on Alibaba Cloud Tair (Redis-compatible), the system supports ultra-high concurrency access. Serving as the core for real-time leaderboards, battle state caching, and matchmaking pools, it leverages multi-threading and persistent memory optimization to achieve a single-instance QPS of over one million, enabling millisecond-level ranking refreshes, instantaneous settlements, and seamless recovery from disconnections.&lt;/p&gt;

&lt;p&gt;As millions of players flood Supernatural Squad, DDoS attacks have become a critical risk impacting the user experience. To address this, Giant Network collaborated with Alibaba Cloud to build a high-performance, intelligent protection system based on a cloud-native security architecture. This solution leverages Alibaba Cloud's native anti-DDoS capabilities to achieve millisecond-level identification and precise scrubbing of terabit-level DDoS attacks through one-click integration, requiring no architectural modifications, and offering industry-leading protection. Even in high-concurrency scenarios such as version updates and major tournaments, the system maintains over 99.99% service availability, truly realizing the goal of "zero perception of attacks and zero interruption during switching." In the face of sudden traffic surges, the system supports the automatic elastic scaling of defense bandwidth and dynamic resource allocation, preventing service disruptions caused by capacity shortages. In addition, by integrating the Security Incident Center, the operations team can monitor attack events in real-time, analyze attack types and characteristics, and quickly deploy customized game protocol protection rules based on AI-driven policy recommendations, significantly enhancing response efficiency and defense accuracy. From efficient scrubbing to intelligent decision-making, Alibaba Cloud—with its core values of "stability, efficiency, and security"—has constructed a resilient digital shield for Supernatural Squad, ensuring smooth competitive play for tens of millions of players, while setting a new benchmark for cloud-native security in the gaming industry.&lt;/p&gt;

&lt;p&gt;For the competitive real-time interactive game Supernatural Squad, simply being operational is just the starting point; achieving clear visibility and accurate diagnosis is the key to guaranteeing a smooth experience for so many concurrent players. The operations team moved away from traditional fragmented monitoring tools in favor of a lightweight, standardized, and deeply integrated observability system built on Alibaba Cloud Simple Log Service (SLS), CloudMonitor (CMS) Prometheus Service, and Grafana Service.&lt;/p&gt;

&lt;p&gt;This system leverages Prometheus to collect resource levels and core business metrics—such as concurrent users and matchmaking duration—in real-time under million-level PCU, ensuring precise monitoring without data loss during high-concurrency periods.&lt;br&gt;
It utilizes SLS to aggregate full-link logs, supporting second-level reconstruction of behavior paths by RequestID or Player ID, which—combined with SQL analysis and custom rules—enables map error statistics and tracking of abnormal operations.&lt;br&gt;
Finally, it employs Grafana to create a unified panoramic dashboard that integrates metrics and log data, allowing for one-click navigation to SLS to view associated logs during alerts, thereby realizing a closed-loop system where "metrics discover issues and logs locate root causes," compressing fault response times from hours to minutes and fully leveraging the advantages of cloud-native observability and collaboration.&lt;br&gt;
From "Running" to "Winning": Redefining Observability&lt;br&gt;
For a real-time interactive game, "running" is just the starting point; "seeing clearly and investigating accurately" is the key to a smooth experience. The O&amp;amp;M team moved away from fragmented tools to a standardized, deeply integrated observability system based on SLS, CloudMonitor (CMS) Prometheus, and Grafana:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcj578vgd7fb11e5dvb3l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcj578vgd7fb11e5dvb3l.png" alt=" " width="789" height="405"&gt;&lt;/a&gt;&lt;br&gt;
In the face of the global gaming market's extreme pursuit of high concurrency, low latency, and rapid iteration, OpenKruiseGame (OKG)—the cloud-native game server management solution “born for games” created by Alibaba Cloud—is becoming the core engine driving the industry’s smooth architectural upgrades. Addressing the management challenges unique to the heterogeneity of gaming workloads, OKG provides a one-stop management system that spans fine-grained configuration, automated network access, and business state awareness. It not only significantly lowers the barrier to cloud-native transformation for game developers, but also—through its global multi-region consistent delivery capabilities—empowers developers to overcome geographical constraints and achieve rapid, agile deployment and global expansion.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftg47uvesl8dviyr9g11n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftg47uvesl8dviyr9g11n.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cloud-native is no longer exclusive to Internet applications; it is the inevitable choice for next-generation gaming infrastructure.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>cloudnative</category>
    </item>
    <item>
      <title>Leap in Log Collection Efficiency: Comprehensive Upgrade from iLogtail to LoongCollector</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Wed, 18 Mar 2026 05:18:00 +0000</pubDate>
      <link>https://dev.to/observabilityguy/leap-in-log-collection-efficiency-comprehensive-upgrade-from-ilogtail-to-loongcollector-4pea</link>
      <guid>https://dev.to/observabilityguy/leap-in-log-collection-efficiency-comprehensive-upgrade-from-ilogtail-to-loongcollector-4pea</guid>
      <description>&lt;p&gt;This article introduces LoongCollector, a comprehensive upgrade from iLogtail that boosts log collection efficiency, pipeline flexibility, and overall performance in observable data management.&lt;/p&gt;

&lt;p&gt;Review of iLogtail Development History&lt;br&gt;
● In 2013, the first version of iLogtail was introduced alongside the Apsara 5K system. The Apsara 5K project is a distributed operating system project that schedules 5,000 computers to work together as a supercomputer. At that time, the purpose of developing iLogtail was simple: to collect log data distributed across thousands of machines into a central data warehouse for easy access and analysis. In this phase, iLogtail focuses on basic log collection. Its typical technical features include real-time log collection via inotify-driven change detection, Apsara log parsing and structuring, real-time transmission of logs to remote storage, and basic self-monitoring log reporting.&lt;/p&gt;

&lt;p&gt;● In 2015, Alibaba began to migrate its group businesses to the cloud. Double 11 in 2019 announced that the core system was fully migrated to the cloud. During this period, the users of iLogtail have expanded from Alibaba Cloud to the entire group, which puts forward higher requirements for the richness and stability of its processing capabilities. In this context, iLogtail developed excellent capabilities, such as multi-level feedback queues for fault isolation, checkpoint mechanisms to prevent log loss, multi-tenant management capabilities, and more comprehensive log processing capabilities.&lt;/p&gt;

&lt;p&gt;● In 2017, with the official commercialization of SLS and the launch of ACK, the number of iLogtail users showed geometric growth, and new requirements sprang up. Meanwhile, the usage scenarios of iLogtail expanded from hosts to containers. To adapt to the changes in the new environment, iLogtail evolved a Go plugin system. With the support of this subsystem, iLogtail quickly supported features such as container log collection and automatic tagging of Kubernetes metadata. iLogtail also began to expand support for access to time series and tracing data.&lt;/p&gt;

&lt;p&gt;● In 2022, iLogtail was fully open-sourced and upgraded to v1.0.0. This marks the maturity of iLogtail. iLogtail has transformed from a single log collector to an observable data collector with full features. The 1.0 series delivered complete support for common containers at runtime, making it suitable for cloud-native environments. Thanks to the efforts of members in the open-source community, it enriched its output support for downstream ecosystems. Keeping pace with the times, it added support for data access to the fourth pillar of observability data - Profiling.&lt;/p&gt;

&lt;p&gt;● In 2024, on its 2nd open-source anniversary, iLogtail released v2.0.0. This version combined community contributions and adapted to market changes. Compared with the 1.0 series, this version significantly improved in usability, performance, and reliability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7zu6y8hgdzh41921d1jw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7zu6y8hgdzh41921d1jw.png" alt=" " width="800" height="358"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From iLogtail to LoongCollector: Beyond Renaming&lt;br&gt;
In 2025, iLogtail was officially upgraded to LoongCollector, marking a new era in the field of log collection and processing. LoongCollector has achieved comprehensive upgrades in log scenarios and has been deeply optimized in terms of functionality, performance, and stability, thus providing users with more efficient, flexible, and reliable log management solutions. Next, we will introduce the upgrades of LoongCollector in detail.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqo9lw895s7dtmmqgmmsb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqo9lw895s7dtmmqgmmsb.png" alt=" " width="765" height="110"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Consolidated Foundation - High Performance and Flexible Pipeline&lt;br&gt;
Overview&lt;br&gt;
An overall architecture upgrade was performed on iLogtail, especially for the main program in C++. By introducing the concept of pipeline, the input, processing, and output capabilities achieve completely plugin-based integration, supporting free combination of capabilities to meet the above requirements.&lt;/p&gt;

&lt;p&gt;In LoongCollector, each collection task corresponds to a collection configuration, which describes how to collect, process, and send the required observable data. In terms of code implementation, each configuration maps to a pipeline in memory, and its general form is as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foht1lwauc20qa3nybebl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foht1lwauc20qa3nybebl.png" alt=" " width="800" height="163"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LoongCollector supports various pipeline forms, aiming to meet the needs of different users for log collection and processing and flexibly adapt to various application scenarios. The specific supported pipeline forms include:&lt;/p&gt;

&lt;p&gt;● C++ Input plugin + C++ native plugin&lt;/p&gt;

&lt;p&gt;This combination allows users to take advantage of the high-performance features of C++ for log data input and processing. This solution is particularly suitable for scenarios where a large number of logs need to be processed in real time. It can significantly reduce latency and improve performance.&lt;/p&gt;

&lt;p&gt;● C++ Input plugin + SPL plugin&lt;/p&gt;

&lt;p&gt;The SPL (SLS Processing Language) plugin provides an intuitive and powerful way to analyze and process data. This combination not only improves the ability to handle complex data but also simplifies the user experience.&lt;/p&gt;

&lt;p&gt;● C++ Input plugin + Golang extended plugin&lt;/p&gt;

&lt;p&gt;By combining the C++ Input plugin with the Golang extended plugin, users can take full advantage of both. The C++ plugin provides high-performance collection capabilities during data collection, while the Golang plugin adds flexibility to data processing.&lt;/p&gt;

&lt;p&gt;● Golang Input plugin + Golang extended plugin&lt;/p&gt;

&lt;p&gt;The biggest advantage of the Golang Input plugin is its support for multiple data sources, including Systemd, Kafka, and Win event. This combination adapts to a wide variety of data sources to the greatest extent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foibf6p8ziw13owuygqif.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foibf6p8ziw13owuygqif.jpeg" alt=" " width="800" height="634"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hot Load Isolation for Pipeline&lt;br&gt;
LoongCollector uses a bus mode to divide threads by function. Specifically, according to the pipeline architecture, LoongCollector has three worker threads: the Input Runner thread, the Processor Runner thread, and the Flusher Runner thread, which are responsible for running the input plugins, processing plugins, and output plugins of all pipelines, respectively. These threads are connected through buffer queues. To ensure fairness and isolation between pipelines, LoongCollector further adopts the following designs:&lt;/p&gt;

&lt;p&gt;● Within each worker thread, each pipeline is allocated a corresponding time slice according to priority.&lt;/p&gt;

&lt;p&gt;● Each pipeline has its own independent processing and send queues.&lt;/p&gt;

&lt;p&gt;Based on the preceding description, the bus mode diagram of LoongCollector is as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsr26t43o6d3gklgnv1dh.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsr26t43o6d3gklgnv1dh.jpeg" alt=" " width="800" height="321"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, the bus mode is bound to pose greater challenges to the isolation in multi-tenancy scenarios. Isolation through threads is the simplest, but multiple threads inevitably lead to a doubling of resource usage, which is unacceptable for an observable data collector.&lt;/p&gt;

&lt;p&gt;LoongCollector is deeply optimized in the overall scheduling of the collection configuration and the scheduling of the Flusher thread resource allocation, ensuring the multi-tenancy capability to the greatest extent in the bus mode.&lt;/p&gt;

&lt;p&gt;When collection configurations are changed, iLogtail uses the Stop The World method. All collection configurations are suspended, reloaded, and then restarted. If multiple teams or businesses share the same iLogtail instance, they will interfere with each other. For example, if Service A and Service B share the same iLogtail instance, continuous debugging of collection configurations by Service A will inevitably affect the collection of Service B during the debugging phase.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy27w3iwvmk7wp7fxgt2a.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy27w3iwvmk7wp7fxgt2a.jpeg" alt=" " width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LoongCollector further optimizes the lifecycle management of pipelines. Only pipelines that have changed are replaced, and unchanged pipelines remain intact. This minimizes the impact of collection configuration changes and avoids the impact of Stop The World on the whole.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqzaeghvwug1jwkuitn0b.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqzaeghvwug1jwkuitn0b.jpeg" alt=" " width="800" height="369"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Continuous Breakthrough - Steady Improvement in Collection Performance for Core Scenarios&lt;br&gt;
CPU Reduced by an Average of 35% and Memory by 10%&lt;br&gt;
In the single-line mode, resource usage for the same traffic is compared. The lower the bar value, the better.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F33z08ow1ec66q32yw1gj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F33z08ow1ec66q32yw1gj.png" alt=" " width="361" height="217"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffrbtjqbeoad49dfnbt7b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffrbtjqbeoad49dfnbt7b.png" alt=" " width="361" height="217"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can see that LoongCollector has greater advantages over iLogtail in both CPU and memory usage. In particular, the CPU usage of LoongCollector is lower than that of iLogtail by 0.15C on average. In terms of memory, the advantage of LoongCollector is not obvious in low-traffic scenarios, but it achieves approximately a 10% reduction in memory usage in high-traffic scenarios.&lt;/p&gt;

&lt;p&gt;Maximum Collection Rate in Core Scenarios Increased by an Average of 80%&lt;br&gt;
File collection&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgvrwdrlxfjigd51ga57z.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgvrwdrlxfjigd51ga57z.jpeg" alt=" " width="360" height="217"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk9ugguomv3vikdrcgc18.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk9ugguomv3vikdrcgc18.jpeg" alt=" " width="361" height="217"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As shown in the figure, in typical file collection scenarios, LoongCollector significantly outperforms iLogtail in both single-threaded and multi-threaded scenarios, with an average improvement of 40% in single-threaded scenarios and 80% in single-line scenarios. In multi-threaded scenarios, the improvement is more significant, with an average increase of 80%.&lt;/p&gt;

&lt;p&gt;Standard output&lt;br&gt;
● By refactoring the standard output collection plugin, a new plugin input_container_stdio is introduced. This plugin supports log rotation queues, significantly enhancing the stability of standard output collection.&lt;/p&gt;

&lt;p&gt;● In terms of performance, the new plugin performs well. In containerd scenarios, the collection performance improves by 200%. In Docker scenarios, the collection performance improves by 100%.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frgcb7jo7h31wwur0mw2j.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frgcb7jo7h31wwur0mw2j.jpeg" alt=" " width="361" height="237"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;● Regarding resource usage, in containerd scenarios, the new plugin reduces the CPU usage by 20% and memory by 25%. In Docker scenarios, the CPU usage is decreased by 25% and the memory by 20%.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkw2d5mpsd7oabs1v5jh8.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkw2d5mpsd7oabs1v5jh8.jpeg" alt=" " width="361" height="227"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbmgg690thrvw5vnwdpcr.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbmgg690thrvw5vnwdpcr.jpeg" alt=" " width="361" height="217"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Upgraded Stability - Better Self-monitoring&lt;br&gt;
LoongCollector Instance Monitoring&lt;br&gt;
LoongCollector provides comprehensive instance monitoring features to ensure that users can grasp the usage of system resources and instance exception information. Through intuitive monitoring dashboards, users can view the usage of resources such as CPU, memory, and network in real time, making the running status of each instance transparent. The system can also automatically trigger alerts to promptly notify users of exceptions, helping them quickly locate and handle problems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffnodaf54upv2z5gmphia.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffnodaf54upv2z5gmphia.jpeg" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;File Collection Monitoring&lt;br&gt;
In file collection scenarios, LoongCollector is equipped with powerful data monitoring capabilities, including comprehensive monitoring of the collection directory usage and collection latency. Users can quickly check key information, such as the current number of files in each directory on the overview page, so as to find potential file backlog issues in time. Additionally, on the details page, the self-monitoring system also provides a more in-depth analysis feature to help users identify latency issues during file collection.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj19jfeujld070xpx97cq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj19jfeujld070xpx97cq.png" alt=" " width="800" height="316"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pipeline Details Monitoring&lt;br&gt;
For multi-collection configuration scenarios, the pipeline details monitoring feature of LoongCollector comprehensively displays key data such as the duration and exception information of each collection configuration. Users can clearly see the processing duration of each pipeline stage in this interface, easily identifying performance bottlenecks and potential optimization points. Meanwhile, self-monitoring also records exceptions that occur in each step to help users quickly locate problems and make adjustments. Through in-depth monitoring and analysis of the pipeline process, users can more effectively optimize log collection and processing strategies to improve the performance and reliability of the entire log processing system.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhbst1dbm2kqtxiksl0ki.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhbst1dbm2kqtxiksl0ki.png" alt=" " width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Network Exception Isolation - Tolerating Single-zone Network Exceptions&lt;br&gt;
Another isolation issue in bus mode is the sending exception isolation of pipelines. For example, if a network exception occurs in a region, all pipelines configured with an SLS output plugin and sent to the region will experience sending failures. In the bus mode, as the Flusher Runner thread is globally shared, it will retrieve the data to be sent from the send queue and push it to the sink queue regardless of whether the pipeline has sending exceptions. As a result, requests from the pipeline with sending exceptions are repeatedly sent, which occupies limited network I/O resources and affects the sending of other normal pipelines.&lt;/p&gt;

&lt;p&gt;To isolate pipeline sending exceptions in the bus mode, we add a traffic distribution mechanism to LoongCollector to control the behavior of Flusher Runner threads to retrieve data from the send queue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwc41a77enuko4d0q6jta.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwc41a77enuko4d0q6jta.png" alt=" " width="769" height="316"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Traffic control is performed based on three dimensions: Zone, Project, and Logstore.&lt;/p&gt;

&lt;p&gt;● Zone throttling: handles network/server issues&lt;/p&gt;

&lt;p&gt;● Project throttling: addresses project quota issues&lt;/p&gt;

&lt;p&gt;● Logstore throttling: manages shard quota issues&lt;/p&gt;

&lt;p&gt;Each throttler uses an adaptive throttling algorithm based on the network congestion control algorithm AIMD (Additive Increase, Multiplicative Decrease). When a sending failure occurs, the sending concurrency is quickly reduced; when sending succeeds, the concurrency is gradually increased. To avoid exceptions caused by network jitter, statistics on the sending status of a batch of data over a period are counted to prevent frequent concurrency fluctuations.&lt;/p&gt;

&lt;p&gt;This strategy ensures that when a network exception occurs in a sending target, the data packets allowed to be sent by the target can quickly decrease, minimizing the impact of the problematic target on other normal targets. If the network is interrupted, the hibernation period can minimize unnecessary sending, and resume data sending promptly when the network is restored.&lt;/p&gt;

&lt;p&gt;In the following example, when LoongCollector simultaneously sends collected data to regions A and B, the network exception in region B does not affect the data collection and sending of region A.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0szzbveygl9tpqrctr32.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0szzbveygl9tpqrctr32.png" alt=" " width="800" height="227"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Automatic Detection of Network Quality - Coping with Network Fluctuations&lt;br&gt;
SLS endpoints are divided into internal endpoints and public endpoints. If the internal endpoint is fixed, data sending will be blocked once the internal network fails. Considering this situation, LoongCollector automatically detects the network quality of SLS endpoints. Once the network quality is detected to be poor, it will automatically switch to another endpoint.&lt;/p&gt;

&lt;p&gt;As shown in the figure, LoongCollector sends data over the internal network. If an internal network exception occurs, LoongCollector automatically switches to the public endpoint for data sending. Once the internal network is restored, LoongCollector automatically reverts to the internal endpoint. This ensures data sending stability in the case of a single-network exception. The traffic graph shows that endpoint switching causes no traffic fluctuation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb5t9jc5s6l9bprd9fatw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb5t9jc5s6l9bprd9fatw.png" alt=" " width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Seamless Migration - Complete Existing Migration Solution Without Interruptions&lt;br&gt;
Seamless Upgrade from iLogtail to LoongCollector in Host Scenarios&lt;br&gt;
For more information about seamless upgrade from iLogtail to LoongCollector, see the upgrade documentation. The previous collection configurations and checkpoints will not be lost, just like a restart. LoongCollector is fully compatible with all the configurations of iLogtail.&lt;/p&gt;

&lt;p&gt;Upgrade in Kubernetes Scenarios Without Interruptions&lt;br&gt;
Component resource-level upgrade management&lt;/p&gt;

&lt;p&gt;● Uninstalling and reinstalling at the component level will definitely result in a period of service unavailability.&lt;/p&gt;

&lt;p&gt;● Resource-level control can minimize the duration of service unavailability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fegsd91f2vja26ngx4ts6.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fegsd91f2vja26ngx4ts6.jpeg" alt=" " width="800" height="679"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Adopting affinity control to implement seamless switching of Daemonset from logtail-ds to loongcollector-ds&lt;/p&gt;

&lt;p&gt;The following figure shows the effect.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4swqp2eiupus444hpqmh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4swqp2eiupus444hpqmh.png" alt=" " width="800" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ensuring zero data loss or interruption on a single node&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn25ic6p29vombn8e8z08.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn25ic6p29vombn8e8z08.png" alt=" " width="800" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Logtail-ds has a checkpoint mechanism. When a single-node logtail-ds stops, the offset information collected from the file is persisted to the node's checkpoint. When loongcollector-ds is enabled, it reads the offset from the checkpoint first and then continues to collect data from the offset. This ensures that data is not duplicated or lost.&lt;/p&gt;

&lt;p&gt;Kubernetes component update&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu9oz51msenf7z7bz79vt.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu9oz51msenf7z7bz79vt.jpeg" alt=" " width="612" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It can be seen that before and after the component upgrade, both standard output collection and file collection show very stable data trends, with no data interruption or duplication.&lt;/p&gt;

&lt;p&gt;New Tag Processing Capability - From Disorder to Unified Management&lt;br&gt;
Tags are important data for iLogtail to identify log metadata in log collection. However, iLogtail has some problems in the processing of tag data.&lt;/p&gt;

&lt;p&gt;● Inconsistent tag data sources: iLogtail adds most metadata to tags by default. For a small amount of other metadata (such as inode), it uses the Boolean parameter in the pipeline configuration to determine whether to add them to tags.&lt;/p&gt;

&lt;p&gt;● Users cannot rename or delete tags.&lt;/p&gt;

&lt;p&gt;● C++ and Go have completely different mechanisms for handling tags, with no unified solution.&lt;/p&gt;

&lt;p&gt;LoongCollector optimizes the overall tag processing to address the above issues.&lt;/p&gt;

&lt;p&gt;● Input-level tag data is separately processed and controlled by each Input plugin.&lt;/p&gt;

&lt;p&gt;● In container scenarios, all tags of the new standard output collection plugin are the same as those of the file collection.&lt;/p&gt;

&lt;p&gt;● The tag processing plugins can be used to add, delete, and rename metadata at the instance level. The tag processing features of C++ are the same as those of Go. This ensures that all pipeline types can use the full tag processing capabilities.&lt;/p&gt;

&lt;p&gt;The sample code is used to process file Input tags and tags at the instance level.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{
  "configName": "taiye-file-test-new",
  "inputs": [{
    "Type": "input_file",
    "FilePaths": [
      "/home/**/test.log"
    ],
    "EnableContainerDiscovery": true,
    "CollectingContainersMeta": true,
    "ContainerFilters": {
      "IncludeEnv": {
        "aliyun_logs_taiye-file-test": "/home/test.log"
      }
    },
        // Handle the Input tags
    "Tags": {
      "K8sNamespaceTagKey": "my-namespace",
      "ContainerIpTagKey": ""
    }
  }],
  "flushers": [{
    "Type": "flusher_sls",
    "Endpoint": "cn-hangzhou-intranet.log.aliyuncs.com",
    "Logstore": "taiye-file-test-new",
    "Region": "cn-hangzhou",
    "TelemetryType": "logs"
  }],
  "global": {
        // Rename the HOST_NAME tag
    "PipelineMetaTagKey": {
      "HOST_NAME": "taiye-123"
    },
        // Enable instance tag processing
    "EnableProcessorTag": true
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;em&gt;tag&lt;/em&gt;&lt;em&gt;: namespace and __tag&lt;/em&gt;&lt;em&gt;: hostname tags are renamed correctly, and the __tag&lt;/em&gt;_:_container_ip tag is deleted.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvssunhm3b8axatbfj3yo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvssunhm3b8axatbfj3yo.png" alt=" " width="800" height="169"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What's More -- LoongCollector Will Bring a New Experience of Full-stack Collection of Observable Data&lt;br&gt;
LoongCollecoter, built on iLogtail's high-performance pipeline as a base, integrates Prometheus metric collection and eBPF data collection into the iLogtail pipeline to fully upgrade the collection capabilities and realize OneAgent-based observable data collection. Stay tuned for more upcoming features.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>loongcollector</category>
    </item>
    <item>
      <title>Alibaba Cloud Observability and Datadog Release OpenTelemetry Go Automatic Instrumentation Tool</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Wed, 18 Mar 2026 02:57:17 +0000</pubDate>
      <link>https://dev.to/observabilityguy/alibaba-cloud-observability-and-datadog-release-opentelemetry-go-automatic-instrumentation-tool-4gbc</link>
      <guid>https://dev.to/observabilityguy/alibaba-cloud-observability-and-datadog-release-opentelemetry-go-automatic-instrumentation-tool-4gbc</guid>
      <description>&lt;p&gt;Alibaba Cloud and Datadog jointly released an open-source OpenTelemetry Go tool enabling zero-code tracing via compile-time injection.&lt;/p&gt;

&lt;p&gt;In the cloud-native observability realm, OpenTelemetry has become the de facto standard. Compared to Java, which possesses mature bytecode enhancement technology, the Go language, as a static compiled language, has long lacked a mature, low-intrusion automatic instrumentation solution. Current existing solutions mainly include:&lt;/p&gt;

&lt;p&gt;1.eBPF: powerful but mainly leans towards the system call level. The processing of application-layer context (such as HTTP header propagation) is relatively complex.&lt;br&gt;
2.Manual tracking: significant code changes and high maintenance costs. It requires modifying business code and the invocation methods of dependency libraries, explicitly adding traces and metrics logic at various key points.&lt;/p&gt;

&lt;p&gt;To this end, the Alibaba Cloud Observability team and the Programming Language team explored the Go compile-time instrumentation solution and donated its core capabilities to the OpenTelemetry community, forming the opentelemetry-go-compile-instrumentation [1] project. With the joint efforts of companies such as Datadog and Quesma, we published the first Preview version V0.1.0 [2].&lt;/p&gt;

&lt;p&gt;How it Works&lt;br&gt;
The core of the automatic instrumentation tool lies in utilizing the -toolexec parameter of the Go compiler. The -toolexec intercepts the Go command and replaces it with our instrumentation tool. In this way, before the code is compiled, we have the opportunity to analyze and modify it. The entire process can be summarized into two phases:&lt;/p&gt;

&lt;p&gt;1.Dependency Analysis&lt;br&gt;
Before compilation starts, the tool analyzes the build flow of the application (go build -n) and detects third-party libraries used in the project, such as net/http, grpc, and redis. Then, it automatically generates a file named otel.runtime.go and imports the corresponding Hook code (monitoring logic, referred to as Hook code hereafter) into the build dependencies.&lt;/p&gt;

&lt;p&gt;2.Code Injection&lt;br&gt;
When the compiler processes the target function, the tool uses -toolexec to intercept the compilation, and then modifies the code of the target function. It inserts a segment of Trampoline Code at the function entry, and the Trampoline Code jumps to the pre-written Hook function.&lt;/p&gt;

&lt;p&gt;● Before entering the function (Before): The Hook records the start time, fetches context information (such as HTTP headers), and starts a span.&lt;/p&gt;

&lt;p&gt;● Function execution: Execute the original business logic.&lt;/p&gt;

&lt;p&gt;● After exiting the function (After): The Hook catches the return value or Panic, ends the span, and records the duration.&lt;/p&gt;

&lt;p&gt;The advantage of this method is zero runtime overhead (except for the necessary running time of monitoring logic). Because the instrumentation is directly compiled into the binary file, it does not require switching between Kernel space and user space, such as in eBPF, nor does it require loading at start, such as with a Java agent.&lt;/p&gt;

&lt;p&gt;HTTP Instrumentation Example&lt;br&gt;
Let's go through a simple HTTP example to see how it is used.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;
package main

import ...

func main() {
    http.HandleFunc("/greet", func(w http.ResponseWriter, r *http.Request) {
        w.Write([ ]byte("Hello, OpenTelemetry!"))
    })
    log.Fatal(http.ListenAndServe(":8080", nil))
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Manual Instrumentation&lt;br&gt;
You need to manually import the OpenTelemetry SDK, manually create a tracer, and manually start and end a span in the handler.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;package main

import...

func initTracer() func(context.Context) error {
  /* ... dozens of lines of initialization code... */
}

func main() {
    // 1. Initialize the tracer.
    shutdown := initTracer()
    defer shutdown(context.Background())
    // 2. Wrap the handler.
    handler := http.HandlerFunc(func(w http.ResponseWriter, r *http. Request) {
        // 3. Manually extract the context and start the span.
        tracer := otel.Tracer("demo-server")
        ctx, span := tracer.Start(r.Context(), "GET /greet")
        // 4. Ensure that the span ends.
        defer span.End()
        // 5. You may also need to manually record attributes.
        span.SetAttributes(attribute.String("http.method", "GET"))
        w.Write([]byte("Hello, OpenTelemetry! "))
        })
        // 6. ListenAndServe may also need to be wrapped...
        log.Fatal(http.ListenAndServe(":8080", handler))
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For microservices with hundreds or thousands of APIs, the cost of this modification is catastrophic.&lt;/p&gt;

&lt;p&gt;Automatic Instrumentation&lt;br&gt;
Download tool: You can go to the Release Page [2] to download.&lt;br&gt;
Compile application: ./otel-linux-amd64 go build -o myapp&lt;br&gt;
Configure and run: export OTEL_EXPORTER_OTLP_ENDPOINT="&lt;a href="http://localhost:4317" rel="noopener noreferrer"&gt;http://localhost:4317&lt;/a&gt;" export OTEL_SERVICE_NAME="my-app" ./myapp&lt;br&gt;
The compiler will silently "weave" the monitoring logic of HTTP requests into the application binary file. After the OpenTelemetry export endpoint (such as Jaeger or console) is configured, you can run the generated server. When the /greet API is accessed, tracing data is automatically generated and reported, containing information such as the URI of the request, duration, and status code.&lt;/p&gt;

&lt;p&gt;From Commercial to Open-source&lt;br&gt;
During the deep practice of eBPF technology, although we acknowledge its power, we also found that it is difficult to perfectly handle the application layer context. More importantly, we continuously received user feedback that everyone is troubled by the tedious manual tracking and high maintenance costs.&lt;/p&gt;

&lt;p&gt;To solve this pain point, we began to explore the Go compile-time automatic instrumentation solution, published it to Application Real-Time Monitoring Service (ARMS) of Alibaba Cloud Observability [3], continually iterated in this most rigorous "experimental field," and gradually evolved it into a mature solution. It not only achieves tracing analysis with zero code modification but also extends support for rich advanced features such as metric statistics, runtime monitoring, and even continuous profiling. It can even complete the event tracking of the enterprise internal SDK via custom extension features [4].&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F210fxp15evljwlxntpj4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F210fxp15evljwlxntpj4.png" alt=" " width="800" height="405"&gt;&lt;/a&gt;&lt;br&gt;
Tracing analysis&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgamm2p98hq5thzvptu2u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgamm2p98hq5thzvptu2u.png" alt=" " width="800" height="477"&gt;&lt;/a&gt;&lt;br&gt;
Continuous profiling&lt;/p&gt;

&lt;p&gt;This solution has been successfully verified by customers in many realms such as e-commerce, short plays, AI video, and automotive. After seeing the immense value it brings to users and verifying its stability and feasibility, we decided to contribute its core capabilities to the OpenTelemetry community, hoping it will become an inclusive technology. At the same time, we collaborated with Datadog, a top vendor in the observability realm, to jointly promote and finally facilitate the birth of this official project [1].&lt;/p&gt;

&lt;p&gt;Currently, the project is in an active development stage. We welcome everyone to try it out, provide feedback, and participate in contributions to jointly build a better cloud-native observability ecosystem.&lt;/p&gt;

&lt;p&gt;[1] OpenTelemetry Go compile instrumentation project: &lt;a href="https://github.com/open-telemetry/opentelemetry-go-compile-instrumentation" rel="noopener noreferrer"&gt;https://github.com/open-telemetry/opentelemetry-go-compile-instrumentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[2] Release link: &lt;a href="https://github.com/open-telemetry/opentelemetry-go-compile-instrumentation/releases/tag/v0.1.0" rel="noopener noreferrer"&gt;https://github.com/open-telemetry/opentelemetry-go-compile-instrumentation/releases/tag/v0.1.0&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[3] Alibaba Cloud ARMS Go agent Commercial Edition: &lt;a href="https://www.alibabacloud.com/help/en/arms/application-monitoring/user-guide/monitoring-the-golang-applications/" rel="noopener noreferrer"&gt;https://www.alibabacloud.com/help/en/arms/application-monitoring/user-guide/monitoring-the-golang-applications/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[4] Custom extension: &lt;a href="https://www.alibabacloud.com/help/en/arms/application-monitoring/use-cases/use-golang-agent-to-customize-scalability" rel="noopener noreferrer"&gt;https://www.alibabacloud.com/help/en/arms/application-monitoring/use-cases/use-golang-agent-to-customize-scalability&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
    </item>
    <item>
      <title>RUM-integrated End-to-End Tracing: Breaking the Mobile Observability Black Hole</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Wed, 18 Mar 2026 02:35:26 +0000</pubDate>
      <link>https://dev.to/observabilityguy/rum-integrated-end-to-end-tracing-breaking-the-mobile-observability-black-hole-144e</link>
      <guid>https://dev.to/observabilityguy/rum-integrated-end-to-end-tracing-breaking-the-mobile-observability-black-hole-144e</guid>
      <description>&lt;p&gt;This article introduces RUM-powered end-to-end tracing that connects mobile and backend traces to break the mobile observability black hole.&lt;/p&gt;

&lt;p&gt;1.Background: The Mobile "Observability Black Hole"&lt;br&gt;
With the rapid development of microservices models, server observability has become increasingly mature. Distributed tracing systems such as Jaeger, Zipkin, and SkyWalking allow developers to clearly observe how a request enters the gateway and propagates through multiple microservices. However, when we attempt to extend this trace to the mobile client, a significant gap emerges.&lt;/p&gt;

&lt;p&gt;● Correlation challenges: The mobile client and the server operate as silos, each with its own logging system. The client records the request initiation time and outcome, whereas the server retains the complete trace. Yet there is no reliable linkage between the two. When failures occur, engineers must manually correlate data using timestamps. This approach is inefficient, error-prone, and nearly infeasible under high concurrency.&lt;/p&gt;

&lt;p&gt;● Unclear failure boundaries: A common scenario illustrates this issue: A user reports an API timeout, but server metrics show all requests returning a normal 200 status code. The root cause could lie in the user's local network, the carrier's transmission quality, or a transient backend fluctuation. Because mobile and server observability systems are separated, fault boundaries cannot be identified, often leading to blame-shifting between teams.&lt;/p&gt;

&lt;p&gt;● Inability to reproduce issues: Mobile network environments are more complex than server environments. DNS resolution may be hijacked, SSL handshakes may fail due to compatibility issues, and retries or timeouts under poor network conditions are common. In traditional solutions, this critical contextual data is lost once the request completes. When issues occur intermittently, developers are unable to reconstruct execution paths or identify root causes, leaving them to react passively to repeated user complaints.&lt;/p&gt;

&lt;p&gt;These limitations make end-to-end tracing increasingly essential. A robust solution must treat the mobile client as the true origin of the distributed trace, ensuring that every user-initiated request is fully captured, accurately correlated, and continuously traced down to the lowest-level database calls. In this article, we present a best-practice implementation that demonstrates how to connect mobile and backend traces using Alibaba Cloud Real User Monitoring (RUM). This approach enables true end-to-end tracing and improves the efficiency of network request troubleshooting.&lt;/p&gt;

&lt;p&gt;2.Core Solution: Technical Implementation of End-to-End Tracing&lt;br&gt;
Core Idea&lt;br&gt;
End-to-end tracing means making the client the first hop of a distributed trace, so that the client and the server share the same trace ID.&lt;/p&gt;

&lt;p&gt;In traditional architectures, tracing starts at the server gateway. When a request reaches the gateway, the Application Performance Monitoring (APM) agent assigns a trace ID and propagates it across subsequent microservice calls. With end-to-end tracing, the trace origin is moved to the user's device. The mobile SDK generates a trace ID and injects it into the request headers, allowing the entire request path from user interaction to the underlying database to be correlated by a single identifier.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fejegzfjyqzonwxa05q7l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fejegzfjyqzonwxa05q7l.png" alt=" " width="800" height="469"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Four Key Stages of the Implementation&lt;br&gt;
The implementation consists of four tightly connected stages.&lt;/p&gt;

&lt;p&gt;Stage 1: Client-side Trace Identifier Generation&lt;br&gt;
When a user initiates a network request, the client SDK intervenes before the request is sent:&lt;/p&gt;

&lt;p&gt;1.Request interception: The SDK captures outgoing requests using the interception mechanism of the network library, such as an OkHttp Interceptor.&lt;/p&gt;

&lt;p&gt;2.Span creation: A span is created for the request, generating two identifiers:&lt;/p&gt;

&lt;p&gt;Trace ID (a 32-character hexadecimal string): the unique identifier for the entire trace.&lt;br&gt;
Span ID (a 16-character hexadecimal string): the identifier for the current hop.&lt;br&gt;
3.Start time recording: The request start timestamp is recorded for subsequent latency analysis.&lt;/p&gt;

&lt;p&gt;Stage 2: Protocol Encoding and Header Injection&lt;br&gt;
The generated trace identifiers must be encoded in a format that the server can interpret. This requires a shared propagation protocol, such as W3C Trace Context or Apache SkyWalking (sw8).&lt;/p&gt;

&lt;p&gt;The client SDK injects the encoded trace data into the HTTP request headers, which are sent along with the request.&lt;/p&gt;

&lt;p&gt;Stage 3: Network Transmission and Propagation&lt;br&gt;
The HTTP protocol inherently supports header propagation, which is the technical basis for trace context propagation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbiopcbx1yiar7gnaf8vb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbiopcbx1yiar7gnaf8vb.png" alt=" " width="789" height="198"&gt;&lt;/a&gt;&lt;br&gt;
Stage 4: Server-side Reception and Trace Continuation&lt;br&gt;
Once the request reaches the server, the APM agent continues the trace:&lt;/p&gt;

&lt;p&gt;1.Header parsing: Extract the trace ID and parent span ID from the traceparent or sw8 header.&lt;br&gt;
2.Context inheritance: Use the client-provided trace ID as the trace identifier instead of generating a new one.&lt;br&gt;
3.Child span creation: Create new spans for server-side processing, with their parent set to the client span.&lt;br&gt;
4.Propagation: Propagate the same trace ID in request headers when invoking downstream services.&lt;br&gt;
Through these four stages, every client-initiated request can be seamlessly linked with the server-side trace, forming a complete trace from the user's device to the database.&lt;/p&gt;

&lt;p&gt;Trace Propagation Protocols&lt;br&gt;
To ensure interoperability across systems, standardized trace propagation protocols are required. Two protocols are commonly used in practice.&lt;/p&gt;

&lt;p&gt;W3C Trace Context&lt;br&gt;
W3C Trace Context is an official W3C standard and provides the broadest compatibility.&lt;/p&gt;

&lt;p&gt;Header formats&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn5iy0s6eihkd4zj3ms9y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn5iy0s6eihkd4zj3ms9y.png" alt=" " width="789" height="139"&gt;&lt;/a&gt;&lt;br&gt;
Fields&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdr92ionsu6it4oseoxqh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdr92ionsu6it4oseoxqh.png" alt=" " width="789" height="283"&gt;&lt;/a&gt;&lt;br&gt;
APM support&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flh4hpetjbeph229m52my.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flh4hpetjbeph229m52my.png" alt=" " width="789" height="224"&gt;&lt;/a&gt;&lt;br&gt;
Apache SkyWalking (sw8)&lt;br&gt;
The sw8 protocol is the native propagation protocol of Apache SkyWalking and carries richer contextual data.&lt;/p&gt;

&lt;p&gt;Header formats&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq2d43zhezn23j42zvb0q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq2d43zhezn23j42zvb0q.png" alt=" " width="789" height="76"&gt;&lt;/a&gt;&lt;br&gt;
Fields&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqq0toepg38v9ojf51th2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqq0toepg38v9ojf51th2.png" alt=" " width="789" height="359"&gt;&lt;/a&gt;&lt;br&gt;
APM support&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F515vvgiozf6w113kco2e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F515vvgiozf6w113kco2e.png" alt=" " width="789" height="113"&gt;&lt;/a&gt;&lt;br&gt;
3.Case Study: End-to-End Troubleshooting of a Query API Timeout&lt;br&gt;
With the theory in place, this section walks through a real troubleshooting case to demonstrate how end-to-end tracing supports root cause analysis.&lt;/p&gt;

&lt;p&gt;Background&lt;br&gt;
We constructed a slow request scenario based on an open source code library. The architecture is shown below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbf7lixlexjbt9xfwfxvs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbf7lixlexjbt9xfwfxvs.png" alt=" " width="727" height="136"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In daily use, we found a specific page loaded extremely slowly, resulting in a poor user experience. An initial assessment suggested that an API response was slow, but further analysis was required to identify where the latency occurred and why. We then leveraged the end-to-end tracing capabilities of Alibaba Cloud RUM to identify the root cause step by step.&lt;/p&gt;

&lt;p&gt;Step 1: Identify the Abnormal Request in the Cloud Monitor Console&lt;br&gt;
Log on to the Alibaba Cloud Management Console and go to Cloud Monitor 2.0 Console &amp;gt; Real User Monitoring &amp;gt; Your application &amp;gt; API Requests. This view provides performance statistics for all API requests.&lt;/p&gt;

&lt;p&gt;After sorting by Slow Response Percentage, we identified the problematic endpoint.&lt;/p&gt;

&lt;p&gt;The data shows that /java/products has an abnormally high response time, averaging over 40 seconds. This is far beyond normal expectations and sufficient to explain the slow page load.&lt;/p&gt;

&lt;p&gt;With the suspicious API identified, the next step is to examine its trace to determine where the time is being spent.&lt;/p&gt;

&lt;p&gt;Step 2: Track the Server-side Trace&lt;br&gt;
Click View Trace for the API operation to go to the trace details page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft2jrs0813m1xj5wxdxkl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft2jrs0813m1xj5wxdxkl.png" alt=" " width="800" height="535"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the core value of end-to-end tracing: The complete trace from the mobile client to the backend service can be viewed in a single place.&lt;/p&gt;

&lt;p&gt;From the waterfall view, we can see that:&lt;/p&gt;

&lt;p&gt;● After the mobile client initiates the request, the trace continues seamlessly into the backend service.&lt;/p&gt;

&lt;p&gt;● The majority of the latency occurs in the /products endpoint.&lt;/p&gt;

&lt;p&gt;● The endpoint takes more than 40 seconds to return a response.&lt;/p&gt;

&lt;p&gt;For deeper analysis in server-side application monitoring, we record the trace ID: c7f332f53a9f42ffa21ef6c92f029c15.&lt;/p&gt;

&lt;p&gt;Step 3: Analyze the Server-side Trace&lt;br&gt;
Go to Application Monitoring &amp;gt; Backend application &amp;gt; Trace Explorer. Query the trace using the recorded trace ID.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5zab3eyc9uuybbi441xf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5zab3eyc9uuybbi441xf.png" alt=" " width="800" height="519"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The backend trace reconstructs the execution flow of the /products API operation:&lt;/p&gt;

&lt;p&gt;● HikariDataSource.getConnection: executed 6 times, total 3 ms. Database connections are retrieved from the connection pool six times, taking 3 ms in total, indicating that this is not a bottleneck.&lt;/p&gt;

&lt;p&gt;● postgres: executed 6 times, total 2 ms. These are lightweight PostgreSQL operations and do not form a bottleneck.&lt;/p&gt;

&lt;p&gt;● SELECT postgres.products: Executed 1 + 5 times, total 42,290 ms (about 42.3 s). This is the key finding: The same SQL query related to products is executed five times, averaging about 8 seconds per execution.&lt;/p&gt;

&lt;p&gt;● This confirms that the latency is dominated by SQL execution rather than connection handling or network overhead.&lt;/p&gt;

&lt;p&gt;Step 4: Analyze the Slow SQL&lt;br&gt;
Click the final span, and view the executed SQL statements in the details panel on the right:&lt;/p&gt;

&lt;p&gt;-- Initial query: Get all product data&lt;br&gt;
SELECT * FROM products&lt;br&gt;
-- N additional queries per product (N+1 pattern)&lt;br&gt;
SELECT * FROM reviews, weekly_promotions WHERE productId = ?&lt;br&gt;
The root cause begins to surface:&lt;/p&gt;

&lt;p&gt;Initial query: SELECT * FROM products is executed to retrieve all product records. This query completes quickly.&lt;br&gt;
Repeated per-product queries: An additional SELECT * FROM reviews, weekly_promotions WHERE productId = ? query is executed for each product.&lt;br&gt;
This is a classic N+1 query problem. Compounding the issue, weekly_promotions is a sleepy view, where heavy operations are performed for each query. Since a large number of products exist, the cumulative time consumed reaches 42 seconds.&lt;/p&gt;

&lt;p&gt;The thread name http-nio-7001-exec-3 is recorded for further verification using profiling data.&lt;/p&gt;

&lt;p&gt;Step 5: Validate the Conclusion with Profiling Data&lt;br&gt;
Go to Application Diagnostics &amp;gt; Continuous Profiling to view the profiling data of the backend service.&lt;/p&gt;

&lt;p&gt;Filter the data by the recorded thread, and the execution time distribution shows:&lt;/p&gt;

&lt;p&gt;● sun.nio.ch.Net.poll(FileDescriptor, int, long) accounts for nearly 100% of total time.&lt;/p&gt;

&lt;p&gt;● The thread is spending most of its time waiting for data from the PostgreSQL socket.&lt;/p&gt;

&lt;p&gt;The profiling results fully align with the trace analysis: The thread is blocked on slow SQL queries.&lt;/p&gt;

&lt;p&gt;Step 6: Summarize the Root Cause&lt;br&gt;
Based on the above investigation, the root cause is clear:&lt;/p&gt;

&lt;p&gt;Root cause: N+1 queries combined with a sleepy view&lt;/p&gt;

&lt;p&gt;1.The application code exhibits an N+1 query pattern:&lt;br&gt;
Initial query: SELECT * FROM products (1 execution)&lt;br&gt;
Per-product query: SELECT * FROM reviews, weekly_promotions WHERE productId = ? (N executions)&lt;/p&gt;

&lt;p&gt;2.weekly_promotions is a sleepy view with inherently time-consuming query logic.&lt;/p&gt;

&lt;p&gt;3.The combination causes the API response time to exceed 40 seconds.&lt;/p&gt;

&lt;p&gt;4.Summary&lt;br&gt;
End-to-end tracing eliminates the observability black hole between the client and the server. By injecting standardized trace headers on the mobile client, we establish a unified tracing workflow in which mobile requests and server-side traces share the same trace ID, enabling quick correlation. Issues can be accurately located, with latency clearly visible at every hop from the user's device to the database. This clearly defines fault boundaries and eliminates blame-shifting between client and server teams. As a result, performance improvements are driven by real trace data rather than assumptions. The Alibaba Cloud RUM SDK offers a non-intrusive solution to collect performance, stability, and user behavior data on Android. Developers can get started quickly by following the Android application integration guide. Beyond Android, RUM also supports Web, mini programs, iOS, and HarmonyOS, enabling unified monitoring and analysis across multiple platforms. For support, join the RUM Support Group (DingTalk Group ID: 67370002064).&lt;/p&gt;

&lt;p&gt;References&lt;br&gt;
Android application integration: &lt;a href="https://www.alibabacloud.com/help/en/arms/user-experience-monitoring/monitor-android-applications" rel="noopener noreferrer"&gt;https://www.alibabacloud.com/help/en/arms/user-experience-monitoring/monitor-android-applications&lt;/a&gt;&lt;br&gt;
Java application instance monitoring: &lt;a href="https://www.alibabacloud.com/help/en/arms/application-monitoring/user-guide/instance-monitoring" rel="noopener noreferrer"&gt;https://www.alibabacloud.com/help/en/arms/application-monitoring/user-guide/instance-monitoring&lt;/a&gt;&lt;br&gt;
Continuous profiling: &lt;a href="https://www.alibabacloud.com/help/en/arms/application-monitoring/user-guide/enable-continuous-profiling" rel="noopener noreferrer"&gt;https://www.alibabacloud.com/help/en/arms/application-monitoring/user-guide/enable-continuous-profiling&lt;/a&gt;&lt;/p&gt;

</description>
      <category>observability</category>
      <category>rum</category>
    </item>
    <item>
      <title>Fast and Cost-effective: The New Version of SLS LogReduce, an Intelligent Engine That Discovers Patterns from Massive Logs</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Wed, 18 Mar 2026 02:16:46 +0000</pubDate>
      <link>https://dev.to/observabilityguy/fast-and-cost-effective-the-new-version-of-sls-logreduce-an-intelligent-engine-that-discovers-292j</link>
      <guid>https://dev.to/observabilityguy/fast-and-cost-effective-the-new-version-of-sls-logreduce-an-intelligent-engine-that-discovers-292j</guid>
      <description>&lt;p&gt;This article introduces the new version of Alibaba Cloud SLS LogReduce, an intelligent log analysis engine that discovers patterns from massive logs in real time with zero index overhead.&lt;/p&gt;

&lt;p&gt;Logs record the execution path of every request, every exception, and every line of Code. However, when the log volume expands from tens of thousands to hundreds of millions per day, traditional keyword search and manual filter methods become inadequate. The new version of LogReduce is designed to solve this dilemma. It can automatically discover log categories and extract log templates from massive logs, freeing engineers from "needle in a haystack" troubleshooting.&lt;/p&gt;

&lt;p&gt;1.Why Intelligent LogReduce Is Needed&lt;br&gt;
1.1 Cognitive Dilemma in "Log Floods"&lt;br&gt;
With distributed Systems becoming increasingly complex today, a typical microservices model may contain dozens or even hundreds of service components, and each component continuously generates logs. Statistics show that a medium-sized internet application can generate a Log Volume of several TB per Day. Facing such massive Data, traditional log analysis methods face severe challenges:&lt;/p&gt;

&lt;p&gt;Information overload: When Alerting is triggered, the engineer opens the log system and faces an overwhelming stream of logs. Which information is critical? Which is noise? Making judgments relies entirely on experience.&lt;/p&gt;

&lt;p&gt;Keyword Dependency: Traditional methods rely on preset keywords (such as ERROR and Exception) to Filter. However, the problem is that unexpected abnormal patterns may be completely ignored.&lt;/p&gt;

&lt;p&gt;Context fragmentation: Even if suspicious logs are found, understanding their meanings still requires a large amount of context information. The same type of issue may appear thousands of times in slightly different forms, making it difficult to summarize manually.&lt;/p&gt;

&lt;p&gt;1.2 Evolution of SLS LogReduce&lt;br&gt;
Alibaba Cloud Simple Log Service (SLS) is a cloud-native observability and analysis platform for logs. It provides Users with one-stop services such as Log Collection, storage, query, and Analysis. As one of the core capabilities of log analysis, SLS launched the LogReduce feature (hereinafter referred to as "old version LogReduce") early in its development to help users automatically extract patterns from massive logs.&lt;/p&gt;

&lt;p&gt;The old version LogReduce adopts an "LogReduce at ingestion" architecture. It pre-calculates the clustering index when logs are ingested and maps each log to the corresponding pattern. The advantage of this method is that the clustering is comprehensive, but it also brings additional index storage costs, which may become a burden for large-scale log scenarios.&lt;/p&gt;

&lt;p&gt;The "new version LogReduce" introduced in this topic is an architectural upgrade to the old version and adopts a new "LogReduce at query" design. It no longer requires pre-establishing clustering indexes. Instead, it calculates log patterns in real time when a User initiates a query, thereby achieving zero additional index Traffic, more flexible Analysis capabilities, and better cost-efficiency.&lt;/p&gt;

&lt;p&gt;1.3 From "Viewing Logs" to "Understanding Logs"&lt;br&gt;
The core idea of the new version LogReduce is: Let machines automatically discover patterns in logs.&lt;/p&gt;

&lt;p&gt;LogReduce is based on a key Insight: Although a system may generate a huge volume of logs, they often originate from a limited number of log output statements. The logs generated by each log output statement have the same format and can be represented by the same "log template."&lt;/p&gt;

&lt;p&gt;For example, the following three logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;Got exception while serving block-123 to /10.251.203.149
Got exception while serving block-456 to /10.251.203.150
Got exception while serving block-789 to /10.251.203.151
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Can be summarized into one template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;Got exception while serving &lt;span class="nt"&gt;&amp;lt;BLOCK_ID&amp;gt;&lt;/span&gt; to /&lt;span class="nt"&gt;&amp;lt;IP&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where  and  are variable parts that change with each log. The rest are constants that remain unchanged in the same class of logs.&lt;/p&gt;

&lt;p&gt;Through this abstraction, thousands of logs that would otherwise need to be reviewed one by one are compressed into a few log categories. Engineers can first locate issues at the log template level and then drill down to View specific log samples. This is exactly the cognitive upgrade brought by LogReduce.&lt;/p&gt;

&lt;p&gt;2.Core Design Concepts&lt;br&gt;
2.1 Zero Index Traffic: Lightweight Cost Advantage&lt;br&gt;
Compared with the old version LogReduce, the biggest architectural advantage of the new version LogReduce is zero additional index Traffic.&lt;/p&gt;

&lt;p&gt;The old version LogReduce needs to pre-calculate clustering indexes during Data Ingestion, which means that each log generates additional index storage costs. For a large LogStore, this cost can be considerable.&lt;/p&gt;

&lt;p&gt;The new version LogReduce adopts a completely different policy: It calculates log templates in real time during queries based on existing field indexes. This "LogReduce at query" method avoids the storage overhead caused by pre-indexing and allows the clustering Result to reflect the latest log Data instantly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feb4zlg28bnrfb4xoxrb6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feb4zlg28bnrfb4xoxrb6.png" alt=" " width="789" height="150"&gt;&lt;/a&gt;&lt;br&gt;
2.2 Intelligent Sampling: Balancing Precision and Performance&lt;br&gt;
When the Log Volume is particularly large (for example, there are tens of millions of logs within a time window), analyzing the entire dataset is neither practical nor necessary. The new version LogReduce has a built-in Intelligent sampling policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;// Sampling policy: When the Log Volume exceeds the threshold, automatic downsampling is performed
const sampleQuery = logCount &amp;gt; 50000 
 ? `|sample-method='bernoulli' ${getSampleNumber(logCount, 50000)}`
 : ''
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The sampling algorithm uses Bernoulli Sampling to ensure that each log record has an equal probability of being selected, thereby ensuring the representativeness of the sampling results. In the model building phase, the system samples up to 50,000 log records for pattern search. In the result matching phase, the system samples up to 200,000 log records for pattern matching and statistics.&lt;/p&gt;

&lt;p&gt;This stratified sampling design allows the system to maintain response times within seconds when processing massive amounts of data, without significantly impacting clustering effectiveness.&lt;/p&gt;

&lt;p&gt;2.3 Intelligent Variable Detection: Beyond Simple Pattern Matching&lt;br&gt;
One of the core challenges of LogReduce is to accurately distinguish between "variable parts" and "constant parts" in logs. The new version of LogReduce uses a more intelligent variable detection algorithm that can handle various complex scenarios:&lt;/p&gt;

&lt;p&gt;Numeric variables: Automatically detects numeric patterns such as numbers, IP addresses, and port numbers, and supports range statistics.&lt;/p&gt;

&lt;p&gt;Enumeration variables: For variables with limited values (such as status codes and Service Names), the system automatically calculates the Top N value distribution.&lt;/p&gt;

&lt;p&gt;Composite variables: For complex variable patterns (such as UUIDs and Trace IDs), the system intelligently detects their borders.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;//Variable summary statistics | extend var_summary = summary_log_variables(variables_arr, '{"topk": 10}')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The variable summary (var_summary) not only records value samples of variables but also contains variable type inference (range / enum / gauge) and distribution statistics, laying a foundation for subsequent in-depth analysis.&lt;/p&gt;

&lt;p&gt;3.Technical Implementation Highlights&lt;br&gt;
3.1 SPL Operator-driven Clustering Pipeline&lt;br&gt;
The core computation logic of the new version of LogReduce is implemented by using Structured Process Language (SPL) of SLS, forming a complete clustering pipeline:&lt;/p&gt;

&lt;p&gt;3.1.1 Phase 1: Model Building&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;*
| stats content_arr = array_agg("Content")
| extend ret = get_log_patterns(
content_arr,
ARRAY['separator list'],     
cast(null as array(varchar)),
cast(null as array(varchar)),
'{"threshold": 3, "tolerance": 0.1, "maxDigitRatio": 0.1}'
)
| extend model_id = ret.model_id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;get_log_patterns is the core pattern fetching operator. It accepts a set of log content and automatically searches for log templates within the content by using clustering algorithms. The algorithm parameters include:&lt;/p&gt;

&lt;p&gt;• threshold: The minimum support value for detecting whether a token at a specific position is a variable. The larger the threshold, the less likely the token is determined to be a variable.&lt;/p&gt;

&lt;p&gt;• tolerance: The toleration for variable detection. The smaller the toleration, the more likely frequently appearing tokens are determined to be constants. We recommend using the default value.&lt;/p&gt;

&lt;p&gt;• maxDigitRatio: The maximum ratio threshold of numeric characters.&lt;/p&gt;

&lt;p&gt;3.1.2 Phase 2: Pattern Matching&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;* 
| extend ret = match_log_patterns('${modelId}', "Content")
| extend pattern_id = ret.pattern_id,
  pattern = ret.pattern,          pattern_regexp = ret.regexp, variables = ret.variables
| stats event_num = count(1), hist = histogram(time_bucket_id)
   by pattern_id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;match_log_patterns matches each log record with the searched patterns and fetches the following information:&lt;/p&gt;

&lt;p&gt;• pattern_id: The ID of the pattern.&lt;/p&gt;

&lt;p&gt;• pattern: The log template.&lt;/p&gt;

&lt;p&gt;• pattern_regexp: The regular expression of the pattern.&lt;/p&gt;

&lt;p&gt;• variables: The specific values of the variable parts.&lt;/p&gt;

&lt;p&gt;3.1.3 Phase 3: Comparative Analysis (Optional)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;| extend ret = merge_log_patterns('${modelId1}', '${modelId2}')
| extend model_id = ret.model_id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For comparative analysis scenarios, merge_log_patterns can merge the clustering models of two time ranges, thereby comparing them in a unified pattern space to detect new, disappeared, or changed log patterns.&lt;/p&gt;

&lt;p&gt;3.2 Frontend Rendering: High-performance Big Data Display&lt;br&gt;
In terms of frontend implementation, the core challenge faced by the LogReduce widget is: How to efficiently render and interact with a large number of clustering results?&lt;/p&gt;

&lt;p&gt;3.2.1 Virtual Scrolling and Paging&lt;br&gt;
Clustering results may contain hundreds or even thousands of log patterns. The system uses pagination, rendering only 15 records per page, combined with virtual scrolling technology to ensure the interface remains smooth:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;// Paging logic
const [currentPage, setCurrentPage] = useState&lt;span class="nt"&gt;&amp;lt;number&amp;gt;&lt;/span&gt;(1)const pageSize = 15
const pagedResult = useMemo(() =&amp;gt; {  
const startIndex = (currentPage - 1) * pageSize
return filteredResult.slice(startIndex, startIndex + pageSize)
}, [filteredResult, currentPage])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.2.2 Interaction Design for Highlighted Variables&lt;br&gt;
The variables in the log template need to be highlighted and allow you to view variable distribution upon clicking. The System implements a dedicated Highlight widget, which can:&lt;/p&gt;

&lt;p&gt;• Parse template strings and detect variable placeholders.&lt;/p&gt;

&lt;p&gt;• Generate an independent clickable area for each variable.&lt;/p&gt;

&lt;p&gt;• Display the distribution statistics of the variable after the variable is clicked. (Enumeration types display Top N values, and Numeric types display range distribution)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk38rsvlmb0z5a8ikav5g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk38rsvlmb0z5a8ikav5g.png" alt=" " width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.2.3 Dual Column Chart in Comparative View&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F07r3yohz4f3845ttvagj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F07r3yohz4f3845ttvagj.png" alt=" " width="413" height="285"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In comparative analysis mode, each log pattern needs to display the data distribution of two time ranges simultaneously. The System uses a dual-color column chart to meet this requirement:&lt;/p&gt;

&lt;p&gt;• Dark columns: Log count in the current Time Range (experiment group).&lt;/p&gt;

&lt;p&gt;• Light columns: Log count in the comparative Time Range (comparison group).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2bnbaqmfmpg6u5dp3vlo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2bnbaqmfmpg6u5dp3vlo.png" alt=" " width="800" height="316"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Through visual comparison, the User can intuitively discover:&lt;/p&gt;

&lt;p&gt;• New log patterns. (Present in the experiment group, absent in the comparison group)&lt;/p&gt;

&lt;p&gt;• Disappeared log patterns. (Absent in the experiment group, present in the comparison group)&lt;/p&gt;

&lt;p&gt;• Log patterns with significant quantity changes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8c5idstss39p9cp8ecku.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8c5idstss39p9cp8ecku.png" alt=" " width="745" height="508"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.3 Reverse Regular Expression Lookup: Bridging the Last Mile from Analysis to Query&lt;br&gt;
After a problem pattern is discovered on the LogReduce Page, how can you view all logs of this category?&lt;/p&gt;

&lt;p&gt;The new version of LogReduce solves this problem through regular expressions. Each log template automatically generates a corresponding regular expression (pattern_regexp). The User can copy this regular expression and use the regexp_like operator to perform a precise query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;* | SELECT * FROM log WHERE regexp_like(Content, 'Copied regular expression')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This design seamlessly connects Cluster Analysis with raw log query, allowing the User to, after a problem pattern is discovered, immediately drill down to view specific Log Details.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2dbsqab3aqowkuotj17.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2dbsqab3aqowkuotj17.png" alt=" " width="769" height="522"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;4.Typical Scenarios&lt;br&gt;
4.1 Scenario 1: Quickly Locate Fault Logs&lt;br&gt;
An e-commerce platform received a high volume of alerts during a promotional activity. The O&amp;amp;M engineer opens the LogReduce Page:&lt;/p&gt;

&lt;p&gt;1.Set the Time Range to 10 minutes after the alerts started.&lt;br&gt;
2.Filter out normal INFO-level logs. in the search statement: * and not LEVEL: INFO.&lt;br&gt;
3.View the clustering results and discover a new pattern: Got exception while serving &amp;lt;*&amp;gt; to /: Connection timeout.&lt;br&gt;
4.Click the pattern to view variable distribution, and discover that  is concentrated in the 10.251.xxx.xxx network segment.&lt;br&gt;
Determine that the issue might be a Network problem in the network segment, and immediately perform troubleshooting.&lt;br&gt;
The entire process takes less than 5 minutes, whereas traditional keyword search may require trying multiple keyword combinations, taking several times longer.&lt;/p&gt;

&lt;p&gt;4.2 Scenario 2: Post-Release Comparative Analysis&lt;br&gt;
The development team published a new Version and needs to evaluate the Impact on log patterns:&lt;/p&gt;

&lt;p&gt;1.Set the current Time Range to 1 hour after publishing.&lt;/p&gt;

&lt;p&gt;2.Set the comparison time range to the one-hour period before the release.&lt;/p&gt;

&lt;p&gt;3.View the comparison Results, and pay attention to the following situations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Newly appearing Error Log patterns&lt;/li&gt;
&lt;li&gt;Disappearing log patterns (some problems may have been fixed)&lt;/li&gt;
&lt;li&gt;Patterns with significantly changed Quantity
4.For suspicious new patterns, Click to View log samples for further Analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;4.3 Scenario 3: Multi-module Group Analysis&lt;br&gt;
When a Logstore contains logs from multiple modules, you can use the group clustering feature:&lt;/p&gt;

&lt;p&gt;1.Select the aggregation field as Component or ServiceName.&lt;br&gt;
2.The System will first group by module, and then perform clustering independently within each group.&lt;br&gt;
3.By using the group view, you can quickly detect which module generated abnormal logs.&lt;/p&gt;

&lt;p&gt;This layered Analysis method is particularly suitable for log analysis of large-scale Systems, avoiding mutual interference between logs of different modules.&lt;/p&gt;

&lt;p&gt;5.Thoughts on Algorithm Design&lt;br&gt;
5.1 Why Choose "clustering at query time"?&lt;br&gt;
When designing the new version of LogReduce, we faced a key architecture decision: whether to pre-compute clustering indexes during writing or to perform real-time computing during querying?&lt;/p&gt;

&lt;p&gt;We eventually chose the latter, mainly based on the following considerations:&lt;/p&gt;

&lt;p&gt;Flexibility: The pre-computation method requires defining clustering fields and parameters in advance, which are difficult to change once configured. However, computing at query time allows Users to dynamically select clustering fields, filter conditions, and Time Ranges, providing greater flexibility.&lt;/p&gt;

&lt;p&gt;Cost-effectiveness: Not all logs require Cluster Analysis. The pre-computation method processes all logs uniformly, incurring unnecessary costs. Computing at query time is "pay-as-you-go," consuming resources only when Analysis is truly needed.&lt;/p&gt;

&lt;p&gt;Algorithm evolution: The clustering algorithm is a realm of continuous optimization. Computing at query time allows us to upgrade the algorithm at any time. New analyses automatically benefit from the latest algorithm improvements without reprocessing Historical Data.&lt;/p&gt;

&lt;p&gt;5.2 The Art of Sampling: How to Balance Efficiency and Precision&lt;br&gt;
Sampling is one of the key designs of the new version of LogReduce. A natural concern is: Will sampling miss important log patterns?&lt;/p&gt;

&lt;p&gt;Our policy is "phased sampling":&lt;/p&gt;

&lt;p&gt;Pattern discovery phase: 50,000 logs are sampled to discover patterns. Because the Quantity of log patterns is usually far less than the Quantity of logs (this is the basic assumption of LogReduce), a sampling of 50,000 logs is usually sufficient to discover the vast majority of patterns.&lt;/p&gt;

&lt;p&gt;Pattern matching phase: 200,000 logs are sampled for Statistics. The sampling in this phase mainly affects the precision of Quantity Statistics, rather than pattern discovery.&lt;/p&gt;

&lt;p&gt;Variable Statistics phase: For each pattern, the Top 10 variable values are retained. This is sufficient for Users to understand the distribution features of the variables.&lt;/p&gt;

&lt;p&gt;Practice has shown that this stratified sampling policy can provide sufficiently accurate clustering Results in the vast majority of scenarios, while maintaining second-level query responses.&lt;/p&gt;

&lt;p&gt;6.Summary and Outlook&lt;br&gt;
The new version of LogReduce represents a paradigm shift in the realm of log analysis: from passive keyword search to active pattern discovery; from manual, line-by-line troubleshooting to intelligent categorization.&lt;/p&gt;

&lt;p&gt;Its core value lies in:&lt;/p&gt;

&lt;p&gt;1.Improved efficiency: Compressing millions of logs into hundreds of log categories allows engineers to quickly grasp the full picture of logs.&lt;/p&gt;

&lt;p&gt;2.Enhanced insight: Automatically detecting newly appearing or disappearing log patterns, and discovering changes that are difficult to detect manually.&lt;/p&gt;

&lt;p&gt;3.Cost optimization: The design of zero extra index Traffic ensures that Cluster Analysis is no longer a cost burden.&lt;/p&gt;

&lt;p&gt;4.Flexible analysis: Supports various analysis modes such as Comparative Analysis and group clustering to meet the needs of different scenarios.&lt;/p&gt;

&lt;p&gt;Looking ahead, LogReduce has more possibilities:&lt;/p&gt;

&lt;p&gt;• Integration with outlier detection: Automatically detects log patterns with sudden increases or decreases in quantity to provide early warnings for potential issues.&lt;/p&gt;

&lt;p&gt;• Integration with LLMs: Uses large language models to understand log semantics, assist in analyzing log templates, and provide more intelligent pattern classification and problem diagnosis.&lt;/p&gt;

&lt;p&gt;• Integration with UModel: Associating entities with LogSets allows users to view LogReduce results and build a more complete observability knowledge graph.&lt;/p&gt;

&lt;p&gt;We believe that as these capabilities continue to evolve, log analysis will transform from a tedious operational task into an intelligent System Insight tool.&lt;/p&gt;

</description>
      <category>sls</category>
      <category>ai</category>
    </item>
    <item>
      <title>Building a Navigation Map for Data Assets: Data Discovery and End-to-End Analysis in UModel</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Tue, 17 Mar 2026 03:13:37 +0000</pubDate>
      <link>https://dev.to/observabilityguy/building-a-navigation-map-for-data-assets-data-discovery-and-end-to-end-analysis-in-umodel-119f</link>
      <guid>https://dev.to/observabilityguy/building-a-navigation-map-for-data-assets-data-discovery-and-end-to-end-analysis-in-umodel-119f</guid>
      <description>&lt;p&gt;This article introduces UModel’s data discovery and end-to-end analysis capabilities, enabling unified metadata exploration, relationship mapping, and...&lt;br&gt;
1.Background Information&lt;br&gt;
Imagine you are standing in a vast library filled with tens of thousands of books, where the catalog of each book is scattered across different rooms and each room uses its own unique indexing system. If you want to find a book about service calls, you have to go back and forth between the APM room, Kubernetes room, and cloud resource room, and remember different search rules for each room.&lt;/p&gt;

&lt;p&gt;This is the real dilemma faced by many enterprises in the field of observability. UModel acts like an intelligent management system built for this chaotic library, which allows you to easily explore and understand the structure of the entire knowledge graph.&lt;/p&gt;

&lt;p&gt;1.1 What is UModel?&lt;br&gt;
UModel is a graph-based observable data modeling method designed to address core challenges in the collection, organization, and usage of observable data within enterprise-level environments. UModel employs a graph structure composed of nodes and links to describe the IT world, and implements unified representation, storage decoupling, and intelligent analysis of observable data through standardized data modeling.&lt;/p&gt;

&lt;p&gt;As the foundational data modeling framework for Alibaba Cloud's observable system, UModel provides enterprises with a set of common observable interaction languages that enable humans, programs, and AI to understand and analyze observable data, thereby building true full-stack observability capabilities.&lt;/p&gt;

&lt;p&gt;Core Concepts&lt;br&gt;
UModel employs fundamental graph theory concepts and uses nodes and links to form a directed graph for modeling IT systems:&lt;/p&gt;

&lt;p&gt;● Node: The core component is a Set (data collection), which represents a collection of homogeneous entities or data, such as EntitySet (entity set), MetricSet (metric set), and LogSet (log set). It also includes the Storage type for the Set, such as Simple Log Service (SLS), Prometheus, and MySQL.&lt;/p&gt;

&lt;p&gt;● Link: indicates the relationships between nodes, such as EntitySetLink (entity association), DataLink (data association), and StorageLink (storage association).&lt;/p&gt;

&lt;p&gt;● Field: defines constraints and properties for Sets and Links and encompasses over 20 configuration items, including names, types, constraint rules, and analysis features.&lt;/p&gt;

&lt;p&gt;1.2 What is a UModel Query?&lt;br&gt;
A UModel query is a dedicated interface in EntityStore for querying knowledge graph metadata. Using the .umodel query syntax, it enables exploration of EntitySet definitions, EntitySetLink relationships, and the complete knowledge graph structure. This provides robust support for data modeling analysis and schema management.&lt;/p&gt;

&lt;p&gt;Query Differentiation&lt;br&gt;
The following table describes the differences between UModel queries and queries of other types.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fin8idwsnkbkmca0clqd5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fin8idwsnkbkmca0clqd5.png" alt=" " width="789" height="150"&gt;&lt;/a&gt;&lt;br&gt;
The UModel query operates at the metadata layer, which helps users understand the structure and definitions of data models, rather than the specific runtime data.&lt;/p&gt;

&lt;p&gt;2.UModel Query&lt;br&gt;
2.1 Data Model&lt;br&gt;
Data Structure&lt;br&gt;
The data returned by a UModel query has a fixed five-field structure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv3jgawibi7qxgrrsery9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv3jgawibi7qxgrrsery9.png" alt=" " width="789" height="248"&gt;&lt;/a&gt;&lt;br&gt;
Note: metadata, schema, and spec are JSON-formatted strings. Use the json_extract_scalar function to extract values.&lt;/p&gt;

&lt;p&gt;2.2 Query Syntax&lt;br&gt;
Basic Query Syntax&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Basic query format
.umodel | [SPL operations...]
-- Query with constraints
.umodel | where &lt;span class="nt"&gt;&amp;lt;condition&amp;gt;&lt;/span&gt; | limit &lt;span class="nt"&gt;&amp;lt;count&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Core Query Patterns&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;List Queries - metadata enumeration
Query all UModel data:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Query all UModel data (not recommended for production environments):
.umodel
-- Paginated query
.umodel | limit 0, 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Filter by type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Query all EntitySet definitions
.umodel | where kind = 'entity_set' | limit 0, 10
-- Query all EntitySetLink definitions
.umodel | where kind = 'entity_set_link' | limit 0, 10
-- Query all link types (relationship definitions)
.umodel | where __type__ = 'link' | limit 0, 10
-- Query all node types (entity definitions)
.umodel | where __type__ = 'node' | limit 0, 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Filter by property:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Query the definition of an entity with a specific name
.umodel | where json_extract_scalar(metadata, '$.name') = 'acs.ecs.instance' | limit 0, 10
-- Query all definitions in a specific domain
.umodel | where json_extract_scalar(metadata, '$.domain') = 'apm' | limit 0, 10
-- Query definitions across multiple domains
.umodel | where json_extract_scalar(metadata, '$.domain') in ('acs', 'apm', 'k8s') | limit 0, 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;2.Graph Analysis - relationship exploration&lt;br&gt;
UModel supports metadata-driven graph computations for analyzing relationships between EntitySets:&lt;/p&gt;

&lt;p&gt;Basic graph query syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;.umodel | graph-match &lt;span class="nt"&gt;&amp;lt;path&amp;gt;&lt;/span&gt; project &lt;span class="nt"&gt;&amp;lt;output&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Concepts:&lt;/p&gt;

&lt;p&gt;In graph queries, two fundamental graph concepts are critical:&lt;/p&gt;

&lt;p&gt;Node type (label): represented as @ in UModel metadata graph queries. Example: &lt;a href="mailto:apm@entity_set"&gt;apm@entity_set&lt;/a&gt;.&lt;br&gt;
Node ID: represented as &lt;strong&gt;entity_id&lt;/strong&gt; in UModel metadata graph queries, formatted as kind::domain::name. Example: entity_set::apm::apm.service.&lt;br&gt;
Path queries in graphs use ASCII characters to represent the direction of relationships.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faq485b951573eo4cdb21.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faq485b951573eo4cdb21.png" alt=" " width="789" height="150"&gt;&lt;/a&gt;&lt;br&gt;
Query EntitySet relationships:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Query all relationships for a specific EntitySet
.umodel
| graph-match (s:"acs@entity_set" {__entity_id__: 'entity_set::acs::acs.ecs.instance'})
              -[e]-(d)
  project s, e, d | limit 0, 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Directional relationship queries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Incoming relationships (pointing to an EntitySet):
.umodel
| graph-match (s:"acs@entity_set" {__entity_id__: 'entity_set::acs::acs.ecs.instance'})
              &lt;span class="err"&gt;&amp;lt;&lt;/span&gt;--(d)
  project s, d | limit 0, 10
-- Outgoing relationships (originating from an EntitySet):
.umodel
| graph-match (s:"acs@entity_set" {__entity_id__: 'entity_set::acs::acs.ack.cluster'})
              --&amp;gt;(d)
  project s, d | limit 0, 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;2.3 Advanced Queries&lt;br&gt;
JSON path extraction&lt;br&gt;
Since UModel data is stored in JSON format, JSON functions are required for field extraction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Extract basic information
.umodel
| extend
    entity_name = json_extract_scalar(metadata, '$.name'),
    entity_domain = json_extract_scalar(metadata, '$.domain'),
    entity_description = json_extract_scalar(metadata, '$.description.zh_cn')
| project entity_name, entity_domain, entity_description | limit 0, 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Composite filtering with multiple conditions&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Query with complex conditions
.umodel
| where kind = 'entity_set'
  and json_extract_scalar(metadata, '$.domain') in ('apm', 'k8s')
  and json_array_length(json_extract(spec, '$.fields')) &amp;gt; 5
| extend
    entity_name = json_extract_scalar(metadata, '$.name'),
    field_count = json_array_length(json_extract(spec, '$.fields'))
| sort field_count desc
| limit 20
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aggregate analysis&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Count the number of EntitySets by domain
.umodel
| where kind = 'entity_set'
| extend domain = json_extract_scalar(metadata, '$.domain')
| stats entity_count = count() by domain
| sort entity_count desc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;2.4 Performance Optimization Recommendations&lt;br&gt;
Use Precise Filters&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Before optimization: broad scope
.umodel | where json_extract_scalar(metadata, '$.name') like '%service%'
-- After optimization: precise matching
.umodel | where kind = 'entity_set'
  and json_extract_scalar(metadata, '$.domain') = 'apm'
  and json_extract_scalar(metadata, '$.name') = 'apm.service'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pre-filtering&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Before optimization: late filtering
.umodel
| extend name = json_extract_scalar(metadata, '$.name')
| where name = 'apm.service'
-- After optimization: pre-filtering
.umodel
| where json_extract_scalar(metadata, '$.name') = 'apm.service'
| extend name = json_extract_scalar(metadata, '$.name')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Graph Query Optimization&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Before optimization: full graph search
.umodel | graph-match (s)-[e]-(d) project s, e, d
-- After optimization: specifying the start point.umodel
| graph-match (s:"apm@entity_set" {__entity_id__: 'entity_set::apm::apm.service'})
              -[e]-(d)
  project s, e, d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.Application Scenarios of UModel Queries&lt;br&gt;
UModel queries can address a wide range of practical challenges and provide robust support for data modeling, schema management, and knowledge graph analysis.&lt;/p&gt;

&lt;p&gt;3.1 Schema Exploration and Discovery&lt;br&gt;
Scenario Description&lt;br&gt;
In large-scale observability systems, hundreds of EntitySet definitions may be distributed across multiple domains. Users need to quickly identify what entity types are defined in the system and understand their basic information.&lt;/p&gt;

&lt;p&gt;Application Examples&lt;br&gt;
Explore all entity types:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- List all EntitySets with their basic information
.umodel
| where kind = 'entity_set'
| extend
    entity_name = json_extract_scalar(metadata, '$.name'),
    entity_domain = json_extract_scalar(metadata, '$.domain'),
    description = json_extract_scalar(metadata, '$.description.zh_cn')
| project entity_name, entity_domain, description
| sort entity_domain, entity_name
| limit 0, 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;View by domain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- View all entity definitions within a specific domain, such as APM
.umodel
| where kind = 'entity_set'
  and json_extract_scalar(metadata, '$.domain') = 'apm'
| extend
    entity_name = json_extract_scalar(metadata, '$.name'),
    description = json_extract_scalar(metadata, '$.short_description.zh_cn')
| project entity_name, description
| limit 0, 50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.2 Data Modeling and Analysis&lt;br&gt;
Scenario Description&lt;br&gt;
During data modeling optimization, you need to analyze information about existing EntitySets, including field complexity, primary key design, and index configuration, to identify the models that require optimization.&lt;/p&gt;

&lt;p&gt;Application Examples&lt;br&gt;
Analyze field complexity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Analyze the distribution of field counts across EntitySets by domain
.umodel
| where kind = 'entity_set'
| extend
    domain = json_extract_scalar(metadata, '$.domain'),
    entity_name = json_extract_scalar(metadata, '$.name'),
    field_count = json_array_length(json_extract(spec, '$.fields'))
| stats
    avg_fields = avg(field_count),
    max_fields = max(field_count),
    min_fields = min(field_count),
    entity_count = count()
  by domain
| sort entity_count desc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Identify complex entities:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Find EntitySets with the highest number of fields (potential candidates for optimization)
.umodel
| where kind = 'entity_set'
| extend
    entity_name = json_extract_scalar(metadata, '$.name'),
    domain = json_extract_scalar(metadata, '$.domain'),
    field_count = json_array_length(json_extract(spec, '$.fields'))
| sort field_count desc
| limit 20
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.3 Relationship Graph Analysis&lt;br&gt;
Scenario Description&lt;br&gt;
Mapping relationships between EntitySets are fundamental to building a complete knowledge graph. Graph queries enable the analysis of associations among entities, helping to uncover dependencies and connections within the data model.&lt;/p&gt;

&lt;p&gt;Application Examples&lt;br&gt;
Query all relationships of an entity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Query all relationships of a specific EntitySet, such as apm.service
.umodel
| graph-match (s:"apm@entity_set" {__entity_id__: 'entity_set::apm::apm.service'})
              -[e]-(d)
  project s, e, d
| limit 0, 50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Analyze relationship type distribution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Count the occurrences of each relationship type
.umodel
| where kind = 'entity_set_link'
| extend
    link_name = json_extract_scalar(metadata, '$.name'),
    link_type = json_extract_scalar(metadata, '$.link_type')
| stats limk_count = count() by link_type
| sort limk_count desc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Find specific relationships:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Find all relationship definitions of the runs_on type
.umodel
| where kind = 'entity_set_link'
  and json_extract_scalar(metadata, '$.link_type') = 'runs_on'
| extend
    link_name = json_extract_scalar(metadata, '$.name'),
    source = json_extract_scalar(metadata, '$.source'),
    target = json_extract_scalar(metadata, '$.target')
| project link_name, source, target
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.4 Metadata Quality Check&lt;br&gt;
Scenario Description&lt;br&gt;
Ensure the integrity and consistency of UModel metadata by identifying issues such as missing descriptions and undefined fields.&lt;/p&gt;

&lt;p&gt;Application Examples&lt;br&gt;
Check EntitySets with missing descriptions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Find EntitySets without descriptions in Chinese
.umodel
| where kind = 'entity_set'
  and (json_extract_scalar(metadata, '$.description.zh_cn') = ''
       or json_extract_scalar(metadata, '$.description.zh_cn') is null)
| extend
    entity_name = json_extract_scalar(metadata, '$.name'),
    domain = json_extract_scalar(metadata, '$.domain')
| project entity_name, domain
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify the integrity of field definitions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Identify EntitySets with no fields defined
.umodel
| where kind = 'entity_set'
  and (json_extract(spec, '$.fields') is null
       or json_array_length(json_extract(spec, '$.fields')) = 0)
| extend
    entity_name = json_extract_scalar(metadata, '$.name'),
    domain = json_extract_scalar(metadata, '$.domain')
| project entity_name, domain
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.5 Cross-domain Association Analysis&lt;br&gt;
Scenario Description&lt;br&gt;
In complex observability systems, entities from different domains, such as APM, Kubernetes, and cloud resources, may have cross-domain relationships. UModel queries can be used to analyze these cross-domain association patterns and understand how domains are interconnected.&lt;/p&gt;

&lt;p&gt;Application Examples&lt;br&gt;
Find cross-domain relationships:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Identify EntitySetLinks that connect different domains
.umodel
| where kind = 'entity_set_link'
| extend
    link_name = json_extract_scalar(metadata, '$.name'),
    source_domain = json_extract_scalar(spec, '$.src.domain'),
    target_domain = json_extract_scalar(spec, '$.dest.domain')
| where source_domain != target_domain
| project link_name, source_domain, target_domain
| limit 0, 50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Analyze inter-domain connectivity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Count the number of relationships between domains
.umodel
| where kind = 'entity_set_link'
| extend
    source_domain = json_extract_scalar(spec, '$.src.domain'),
    target_domain = json_extract_scalar(spec, '$.dest.domain')
| stats count = count() by source_domain, target_domain
| sort count desc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.6 Version and Evolution Analysis&lt;br&gt;
Scenario Description&lt;br&gt;
UModel schemas evolve as business develops. You need to track schema versioning and historical changes.&lt;/p&gt;

&lt;p&gt;Application Examples&lt;br&gt;
View schema version information:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- View the schema versions of all EntitySets
.umodel
| where kind = 'entity_set'
| extend
    entity_name = json_extract_scalar(metadata, '$.name'),
    schema_version = json_extract_scalar(schema, '$.version'),
    schema_url = json_extract_scalar(schema, '$.url')
| project entity_name, schema_version, schema_url
| limit 0, 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.7 Fast Locating and Retrieval&lt;br&gt;
Scenario Description&lt;br&gt;
Quickly locate specific EntitySets or relationship definitions within a large volume of metadata. Fuzzy match and term query are supported.&lt;/p&gt;

&lt;p&gt;Application Examples&lt;br&gt;
Fuzzy search by name:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Search for EntitySets with "service" in the name
.umodel
| where kind = 'entity_set'
  and json_extract_scalar(metadata, '$.name') like '%service%'
| extend
    entity_name = json_extract_scalar(metadata, '$.name'),
    domain = json_extract_scalar(metadata, '$.domain')
| project entity_name, domain
| limit 0, 20
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exact search for a specific entity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Find the complete definition of a specific EntitySet exactly
.umodel
| where json_extract_scalar(metadata, '$.name') = 'apm.service'
| limit 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;4.Summary&lt;br&gt;
UModel query, as a dedicated interface in EntityStore for querying knowledge graph metadata, provides robust support capabilities for observability data modeling. You can use UModel queries to implement the following features:&lt;/p&gt;

&lt;p&gt;1.Schema structure exploration: allows you to quickly understand all defined entity and relationship types within the system.&lt;br&gt;
2.Data model analysis: enables you to deeply examine field designs, primary key configurations, complexity, and other aspects of EntitySets.&lt;br&gt;
3.Relationship graph construction: allows you to use graph queries to analyze associations between entities and comprehend the topological structure of the knowledge graph.&lt;br&gt;
4.Quality check: allows you to verify the integrity and consistency of metadata.&lt;br&gt;
5.Cross-domain analysis: allows you to investigate association patterns across different domains.&lt;br&gt;
6.Fast retrieval: enables you to rapidly locate destination definitions within large volumes of metadata.&lt;/p&gt;

&lt;p&gt;These capabilities make UModel Query an indispensable tool for data modeling analysis, schema management, and knowledge graph exploration, providing a solid foundation for building and maintaining high-quality observability data models.&lt;/p&gt;

</description>
      <category>umodel</category>
      <category>ai</category>
    </item>
    <item>
      <title>Building a Unified Entity Search Engine by Using UModel for Observability Scenarios</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Tue, 17 Mar 2026 02:54:43 +0000</pubDate>
      <link>https://dev.to/observabilityguy/building-a-unified-entity-search-engine-by-using-umodel-for-observability-scenarios-k86</link>
      <guid>https://dev.to/observabilityguy/building-a-unified-entity-search-engine-by-using-umodel-for-observability-scenarios-k86</guid>
      <description>&lt;p&gt;This article introduces a unified entity search engine built on UModel and USearch, enabling efficient cross-domain querying, full-text search, and re.&lt;br&gt;
1.Background Information&lt;br&gt;
In observability systems, UModel defines a unified data model (schema). While a UModel query focuses on exploring knowledge graph metadata, an entity query is designed to query and retrieve specific entity instance data. Powered by the USearch engine, entity queries provide powerful capabilities such as full-text search, exact lookup, and conditional filtering, and support cross-domain and cross-entity-type joint queries.&lt;/p&gt;

&lt;p&gt;Unlike UModel queries, which deal with schema definitions, entity queries focus on runtime entity data, enabling users to quickly locate, retrieve, and analyze specific entity instances, such as service instances, pod instances, and host instances.&lt;/p&gt;

&lt;p&gt;1.1 Issues Resolved by Entity Queries&lt;br&gt;
In actual observability scenarios, we often need the following features:&lt;/p&gt;

&lt;p&gt;1.Quick entity locating: allows you to quickly find relevant entities based on keywords or property values.&lt;br&gt;
2.Cross-domain retrieval: allows you to perform unified searches across multiple domains, such as application performance management (APM), Kubernetes, and cloud resources.&lt;br&gt;
3.Precise query: allows you to query detailed information based on known entity IDs in batches.&lt;br&gt;
4.Conditional filtering: allows you to perform complex conditional filtering based on entity properties.&lt;br&gt;
5.Statistical analysis: allows you to aggregate, analyze, and compute entity data.&lt;/p&gt;

&lt;p&gt;By providing a unified interface through the USearch engine, Entity queries address the pain points of traditional multi-system querying and deliver efficient and flexible entity retrieval capabilities.&lt;/p&gt;

&lt;p&gt;1.2 Differences Among Three Query Types&lt;br&gt;
Three different types of queries exist in EntityStore.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Floqna89vl5a4bqzv64f4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Floqna89vl5a4bqzv64f4.png" alt=" " width="789" height="222"&gt;&lt;/a&gt;&lt;br&gt;
Entity query focuses specifically on entity instance data and is the most frequently used query method in daily O&amp;amp;M and troubleshooting.&lt;/p&gt;

&lt;p&gt;2.Introduction to Entity Query&lt;br&gt;
2.1Data Model&lt;br&gt;
Three-layer storage architecture&lt;br&gt;
USearch adopts a layered storage structure to ensure logical isolation and efficient querying of data:&lt;/p&gt;

&lt;p&gt;1.Workspace layer: the top-level isolation unit. Workspaces are fully isolated from each other.&lt;br&gt;
2.Domain layer: logical grouping at the business level, such as APM, Kubernetes, and Container Compute Service (ACS).&lt;br&gt;
3.EntityType: specific entity types that contain the actual entity data, such as apm.service and k8s.pod.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;Workspace: my-observability
├── Domain: apm
│   ├── EntityType: apm.service
│   ├── EntityType: apm.host  
│   └── EntityType: apm.instance
├── Domain: k8s
│   ├── EntityType: k8s.pod
│   ├── EntityType: k8s.node
│   └── EntityType: k8s.service
└── Domain: acs
    ├── EntityType: acs.ecs.instance
    └── EntityType: acs.rds.instance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Characteristics of data storage&lt;br&gt;
● Uniqueness guarantee: &lt;strong&gt;entity_id&lt;/strong&gt; is unique within the same EntityType.&lt;/p&gt;

&lt;p&gt;● Column-oriented storage: supports tabular data with multiple rows and columns and supports SPL-based statistical analysis&lt;/p&gt;

&lt;p&gt;● Index optimization: full-text indexing optimized for retrieval performance. Multi-keyword retrieval and ranking scoring are supported.&lt;/p&gt;

&lt;p&gt;● Time series support: allows you to query and filter data based on a time range and trace the status of entities and relationships at any point in time.&lt;/p&gt;

&lt;p&gt;2.2 Core Features of USearch&lt;br&gt;
Search capability&lt;br&gt;
USearch provides powerful full-text search capabilities and supports the following features:&lt;/p&gt;

&lt;p&gt;● Multi-type joint search: performs joint queries across multiple domains and entity types, with unified scoring and ranking.&lt;/p&gt;

&lt;p&gt;● Multi-keyword search scoring: calculates relevance scores based on information such as term weights and field weights.&lt;/p&gt;

&lt;p&gt;● Intelligent word segmentation: performs automatic word segmentation and relevance scoring to improve retrieval accuracy&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Search for all entities that contain "cart" in any domain
.entity with(domain='*', name='*', query='cart') 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scanning capabilities&lt;br&gt;
In addition to search mode, USearch supports scanning mode, which reads raw data and allows further filtering and computation through SPL. This is suitable for scenarios that require complex data processing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Scan the number of applications in the China (Hong Kong) region within the APM domain
.entity with(domain='apm', name='apm.service')
| where region_id = 'cn-hongkong'
| stats count = count() 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;2.3 Query Syntax&lt;br&gt;
Basic syntax structure&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;.entity with(
    domain='domain_pattern', -- Domain filter pattern
    name='type_pattern', -- Type filter pattern
    query='search_query', -- Query condition
    topk=10, -- Number of results to return
    ids=['id1','id2','id3'] -- Exact ID query
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Parameter description&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwl6gxmt6z7s394br8ppc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwl6gxmt6z7s394br8ppc.png" alt=" " width="789" height="320"&gt;&lt;/a&gt;&lt;br&gt;
fnmatch syntax: You can use wildcards to match characters. For example, you can use an asterisk (*) to match any character and use a question mark (?) to match a single character. For more information, see fnmatch documentation.&lt;/p&gt;

&lt;p&gt;Domain and type filter patterns&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Examples of matching patterns
.entity with(domain='ac*')            -- Domains starting with "ac"
.entity with(domain='a*c')             -- Domains starting with "a" and ending with "c"
.entity with(name='*instance')       -- Types ending with instance
.entity with(name='k8s.*')         -- All types in the Kubernetes domain
.entity with(domain='*', name='*')        -- All domains and types
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.Description of Query Patterns&lt;br&gt;
3.1 Exact ID Query&lt;br&gt;
When entity IDs are known, you can use the ids parameter for precise queries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Query entities with specific IDs
.entity with(
    domain='apm',
    name='apm.service',
    ids=['4567bd905a719d197df','973ad511dad2a3f70a']
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Applicable scenarios:&lt;/p&gt;

&lt;p&gt;● Query detailed information based on entity IDs from alerts&lt;/p&gt;

&lt;p&gt;● Verify the existence and status of specific entities&lt;/p&gt;

&lt;p&gt;● Batch query information about entities with known entity IDs&lt;/p&gt;

&lt;p&gt;3.2 Full-text Retrieval Mode&lt;br&gt;
Basic full-text search&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Simple keyword search
.entity with(query='web application')
-- Multi-term OR relationship (default behavior)
.entity with(query='kubernetes docker container')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Search features:&lt;/p&gt;

&lt;p&gt;● Multiple words are connected by an OR relationship, which means that the condition is satisfied if any one of the words appears.&lt;/p&gt;

&lt;p&gt;● All fields, including system fields and custom fields, are searched.&lt;/p&gt;

&lt;p&gt;● Automatic word segmentation and relevance scoring are performed.&lt;/p&gt;

&lt;p&gt;Phrase search&lt;br&gt;
Words connected by hyphens (-) must be matched exactly within the same field:&lt;/p&gt;

&lt;p&gt;-- Exact phrase match&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;.entity with(query='opentelemetry.io/name-fraud-detection')
-- Regular search (matches any individual word)
.entity with(query='opentelemetry.io/name cart')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Field-specific search&lt;br&gt;
Search within specific fields:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Search in the description field
.entity with(query='description:"error handling service"')
-- Search in custom properties
.entity with(query='cluster_name:production')
-- Search in labels
.entity with(query='labels.team:backend')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Logical condition combinations&lt;br&gt;
The AND, OR, and NOT logical operators are supported:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- AND: Both conditions must be satisfied.
.entity with(query='service_name:web AND status:running')
-- OR: Any condition is met.
.entity with(query='environment:prod OR environment:staging')
-- NOT: The condition on the left side is met, and the condition on the right side is not met.
.entity with(query='type:service NOT status:stopped')
-- Complex combination
.entity with(query='(cluster:prod OR cluster:staging) AND NOT status:maintenance')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Special character handling:&lt;/p&gt;

&lt;p&gt;● Queries that contain special characters, such as vertical bars (|) and colons (:), must be enclosed in double quotation marks (").&lt;/p&gt;

&lt;p&gt;● Example: query='description:"ratio is 1:2"'&lt;/p&gt;

&lt;p&gt;3.3 Multi-type Joint Search&lt;br&gt;
Joint queries across multiple domains and entity types are supported, with unified scoring and ranking.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Search for all entities that contain "cart" in any domain
.entity with(domain='*', name='*', query='cart')
-- Search for entities whose types contain "service" and properties contain "production" in all domains
.entity with(domain='*', name='*service*', query='production')
-- Search for entities whose properties contain "error" or "rate" in specific domains
.entity with(domain='apm', name='apm.*', query='error rate')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.4 Data Analysis with SPL&lt;br&gt;
Both in search mode and scan mode, SPL can be combined for more advanced data processing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Retrieve the number of applications in different languages within the APM domain in the China (Hongkong) region, sorted in descending order by the number of applications.
.entity with(domain='apm', name='apm.service')
| where region_id = 'cn-hongkong'
| stats count = count() by language
| project language, count
| sort count desc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Scoring and Sorting Mechanism
4.1 Relevance Scoring
USearch uses a multi-factor scoring algorithm:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Term frequency weight: the frequency of the keyword appearing in the document.&lt;br&gt;
Field weight: The importance weight of different fields. For example, the name field has a higher weight than the description field.&lt;br&gt;
Document length: Matches in shorter documents typically receive higher scores.&lt;br&gt;
Inverse document frequency: Rare terms are assigned higher weights.&lt;br&gt;
4.2 Sorting Rules&lt;br&gt;
By default, results are sorted in descending order of relevance score. When scores are equal, sorting falls back to timestamp:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Default relevance-based sorting
.entity with(query='web service error', topk=20)
-- Custom sorting using SPL
.entity with(query='kubernetes pod')
| sort __last_observed_time__ desc
| limit 50
-- Multi-field sorting
.entity with(domain='apm', name='apm.service')
| sort cluster asc, service_name asc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;5.Application Scenarios of Entity Queries&lt;br&gt;
5.1 Scenario 1: quick entity locating and retrieval&lt;br&gt;
Problem description: If an alert is generated online or you need to search for a specific entity, you must quickly locate the relevant entity instance.&lt;/p&gt;

&lt;p&gt;Solution: Select an appropriate query method based on the scenario:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Method 1: Perform a precise query by entity ID from the alert.
.entity with(
    domain='apm',
    name='apm.service',
    ids=['4567bd905a719d197df','973ad511dad2a3f70a']
)
-- Method 2: Perform a full-text search based on the keyword.
.entity with(query='user-service error', topk=10)
-- Method 3: Perform a field-specific exact match.
.entity with(query='service_name:user-service')
-- Method 4: Find services owned by a specific team by label.
.entity with(
    domain='apm',
    name='apm.service',
    query='labels.team:backend AND labels.language:java AND status:running'
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Outcome: You can quickly retrieve complete information about the problematic entities, including their status, properties, and labels. Multiple query methods are supported to meet diverse needs across different scenarios.&lt;/p&gt;

&lt;p&gt;5.2 Scenario 2: Cross-domain Joint Search&lt;br&gt;
Problem description: You need to search for entities that contain specific keyword across multiple domains (APM, Kubernetes, and cloud resources) to avoid switching between systems.&lt;/p&gt;

&lt;p&gt;Solution: Use multi-type joint search to perform queries across domains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Search for entities that contain "error" across all domains
.entity with(domain='*', name='*', query='error', topk=50)
-- Search for multiple entity types under domains with a specific prefix
.entity with(domain='apm*', name='*', query='error', topk=50)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Outcome: A unified interface is used to retrieve cross-domain entities, which breaks down data silos and improves query efficiency.&lt;/p&gt;

&lt;p&gt;5.3 Scenario 3: Conditional Filtering and Data Analysis&lt;br&gt;
Problem description: You need to identify the entities that meet specific conditions and perform statistical analysis to identify patterns or gain data insights.&lt;/p&gt;

&lt;p&gt;Solution: Integrate SPL for conditional filtering and aggregate analysis:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Find APM services in Java and collect statistics by cluster.
.entity with(domain='apm', name='apm.service')
| where language='java'
| stats count=count() by cluster
-- Query services that run in the production or staging environment.
.entity with(query='(environment:prod OR environment:staging) AND status:running')
| stats count=count() by environment, cluster
-- Retrieve the number of ARMS production applications in the APM domain across different regions, sorted in descending order by the number of applications.
.entity with(domain='apm', query='environment:prod')
| where telemetry_client='ARMS'
| stats service_count = count() by service, region_id
| project region_id, service, service_count
| sort service_count desc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Outcome: You can quickly identify problematic entities, perform data aggregation and analysis, and uncover data patterns.&lt;/p&gt;

&lt;p&gt;6.Performance Optimization Recommendations&lt;br&gt;
6.1 Use Exact Match&lt;br&gt;
Field-specific query are more efficient than full-text search:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- ❌ Full-text search (slow)
.entity with(query='production')
-- ✅ Field-specific query (fast)
.entity with(query='environment:production')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;6.2 Avoid Prefix Wildcards&lt;br&gt;
Suffix wildcards perform better than prefix ones:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- ❌ Prefix wildcard (slow)
.entity with(name='*service')
-- ✅ Suffix wildcard (fast)
.entity with(name='service*')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;6.3 Use Logical Operators Wisely&lt;br&gt;
Simple AND conditions outperform complex OR conditions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- ✅ Simple AND condition
.entity with(query='status:running AND cluster:prod')
--⚠️ Complex OR conditions (poor performance)
.entity with(query='name:a OR name:b OR name:c OR name:d')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;6.4 Set Appropriate topk&lt;br&gt;
Set the topk value based on the actual requirements to avoid returning unnecessary data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;-- Retrieve only the top 10 results
.entity with(query='error', topk=10)
-- Increase the value only when more results are needed
.entity with(query='error', topk=100)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;7.Summary&lt;br&gt;
As the core interface in EntityStore for querying entity instances, Entity query provides powerful retrieval and analytical capabilities for observability scenarios. Entity query allows you to implement the following features:&lt;/p&gt;

&lt;p&gt;1.Quick entity locating: You can use keywords, IDs, or conditions to find entities in an efficient manner.&lt;br&gt;
2.Cross-domain retrieval: You can query entity data across multiple domains by using a unified interface.&lt;br&gt;
3.Exact query: Exact query methods such as field-specific filtering and logical combinations are supported.&lt;br&gt;
4.Data analysis: You can combine with SPL to perform complex data filtering and statistical analysis.&lt;/p&gt;

&lt;p&gt;These capabilities make entity query an indispensable tool for daily O&amp;amp;M, troubleshooting, and data analysis, and provide a solid foundation for the effective use of observability data.&lt;/p&gt;

</description>
      <category>engine</category>
      <category>umodel</category>
    </item>
    <item>
      <title>Breaking Through the Key Bottlenecks in Observability: Ultimate Integration of Entities and Relationships</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Tue, 17 Mar 2026 02:37:33 +0000</pubDate>
      <link>https://dev.to/observabilityguy/breaking-through-the-key-bottlenecks-in-observability-ultimate-integration-of-entities-and-26pp</link>
      <guid>https://dev.to/observabilityguy/breaking-through-the-key-bottlenecks-in-observability-ultimate-integration-of-entities-and-26pp</guid>
      <description>&lt;p&gt;This article introduces EntityStore's graph querying in UModel, shifting observability from isolated entities to relationship-aware topology analysis.&lt;br&gt;
1.Bridging Observable Data with Relationship Graphs&lt;br&gt;
1.1 From Isolated Entities to Connected Networks&lt;br&gt;
In today's cloud-native worldview, we are accustomed to treating each component, such as service, container, middleware, and infrastructure, within a system as a separate entity for monitoring and management. We configure dashboards, set alerts, and track performance metrics for these individual components. However, this "individual-centric" perspective has a fundamental blind spot: it ignores the most essential characteristic of any system — connections. No entity exists in isolation. Instead, they form a vast, complex, and dynamically evolving graph of relationships through interactions such as calls, dependencies, and inclusion.&lt;/p&gt;

&lt;p&gt;Traditional monitoring and querying tools, whether based on SQL or Search Processing Language (SPL), are fundamentally designed to process two-dimensional, tabular data. While they excel at answering questions about individuals ("What is the CPU usage of this pod?"), they struggle when faced with relational inquiries, such as "Which downstream services will be affected by the failure of this service?" and "Which intermediate services must be traversed to reach the core database?". Answering such questions typically requires complex JOIN operations, multi-step queries, or even manual reconstruction using offline architecture diagrams. This approach is not only inefficient but often impractical in systems with deep dependency chains and intricate topologies. We may possess comprehensive data about every individual "point," yet lack a clear map of the critical "lines" connecting them.&lt;/p&gt;

&lt;p&gt;1.2 Our Approach: Integrating Graph Querying&lt;br&gt;
To address this challenge, our solution centers on treating the graph as a core component of the observability data model. We believe that the true nature of a system is inherently graph-like. Therefore, its querying and analysis should also be conducted in a way that best reflects this essence—through graph queries.&lt;/p&gt;

&lt;p&gt;To realize this vision, we have built EntityStore at the core of the UModel architecture. It features an innovative dual-storage design and maintains two dedicated logstores: &lt;strong&gt;entity&lt;/strong&gt; and &lt;strong&gt;topo&lt;/strong&gt;. The former stores detailed properties of individual entities, and the latter stores the topological relationships among entities. Together, they form a real-time, queryable digital twin graph of the entire observability system.&lt;/p&gt;

&lt;p&gt;Based on this foundation, we provide three progressively powerful graph querying capabilities, designed to meet diverse user needs, from beginners to experts:&lt;/p&gt;

&lt;p&gt;● graph-match: designed for common path-finding scenarios, with intuitive syntax that allows users to express queries like sentences (example: A calls C through B) to quickly identify specific call chains.&lt;/p&gt;

&lt;p&gt;● graph-call: encapsulates the most frequently used graph algorithms (such as neighbor discovery and direct relationship query) into functional APIs. Users can focus on intent (example: "Find all neighbors of A within 3 hops) without needing to understand underlying implementation details.&lt;/p&gt;

&lt;p&gt;● Cypher: incorporates the industry-standard graph query language and delivers the most comprehensive and powerful graph querying capabilities. It supports arbitrarily complex pattern matching, multi-hop traversals, and aggregation analysis, which makes it the ultimate tool for resolving complex graph problems.&lt;/p&gt;

&lt;p&gt;This integrated solution aims to deliver powerful graph analytics capabilities in a low-barrier, engineering-friendly manner to every O&amp;amp;M and development engineer.&lt;/p&gt;

&lt;p&gt;1.3 Core Value: Unlocking New Dimensions of System Insights&lt;br&gt;
The introduction of graph querying capabilities not only adds a new query syntax, but also unlocks an entirely new dimension for understanding and analyzing systems.&lt;/p&gt;

&lt;p&gt;● Global fault impact analysis (analysis of the impact scope of problems): When a fault occurs, a single graph query can rapidly trace all potential downstream propagation paths and identify affected business components. This enables real-time decision-making and helps prioritize incident response and mitigation efforts.&lt;/p&gt;

&lt;p&gt;● End-to-end root cause tracing: In contrast to impact analysis, when a backend service exhibits exceptions, graph traversal can move upstream to locate the originating business request or recent change, which enables precise root cause identification.&lt;/p&gt;

&lt;p&gt;● Architecture health and compliance auditing: Graph queries allow validation of runtime architectures against intended designs. For example, you can query "unauthorized cross-domain service calls" or "whether a core data service is relied on by unauthorized applications", which enables continuous architectural governance.&lt;/p&gt;

&lt;p&gt;● Security and permission path analysis: In security audits, you can trace the complete access path from a user to a resource, verifying that each layer of authorization complies with security policies and mitigating risks of data leakage.&lt;/p&gt;

&lt;p&gt;In summary, graph querying elevates our perception of systems from a mere collection of points to a structured network of interconnected components. It empowers us to ask and answer questions grounded in the actual relationships within the system, unlocking unprecedented depth of insight. It unlocks efficient fault diagnostics, architectural governance, and security assurance in increasingly complex environments.&lt;/p&gt;

&lt;p&gt;2.Concepts Related to Graph Queries&lt;br&gt;
2.1Key Concepts&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fih4h8h1uvqmzd0mubp56.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fih4h8h1uvqmzd0mubp56.png" alt=" " width="789" height="464"&gt;&lt;/a&gt;&lt;br&gt;
Collaboration relationships:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;UModel (knowledge graph)
├── EntitySet: apm.service (type definition)
│   ├── Entity: user-service (instance 1)
│   ├── Entity: payment-service (instance 2)
│   └── Entity: order-service (instance 3)
├── EntitySet: k8s.pod (type definition)
│   ├── Entity: web-pod-123 (instance 1)
│   └── Entity: api-pod-456 ((instance 2)
└── EntitySetLink: service_runs_on_pod (relationship definition)
    ├── Relation: user-service -&amp;gt; web-pod-123
    └── Relation: payment-service -&amp;gt; api-pod-456
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;EntityStore uses Simple Log Service (SLS) Logstore resources to implement features such as data writing and consumption. When an EntityStore is created, the following Logstore assets are automatically created:&lt;/p&gt;

&lt;p&gt;● ${workspace}__entity: for writing entity data.&lt;/p&gt;

&lt;p&gt;● ${workspace}__topo: for writing relationship data.&lt;/p&gt;

&lt;p&gt;The graph query methods described in this article focus specifically on querying the relationship data written to ${workspace}__topo. They support capabilities such as multi-hop path analysis, entity adjacency analysis, and custom topology pattern recognition.&lt;/p&gt;

&lt;p&gt;Note: The graph query methods introduced in this article are based on the low-level querying layer of the Cloud Monitor 2.0 high-level PaaS API. They are intended for advanced users who require highly customized and flexible query patterns. If you need only simple association lookup, information query, and other capabilities, we recommend that you use the high-level PaaS API, which is more user-friendly.&lt;/p&gt;

&lt;p&gt;2.2 Overview&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ud5lkopmxplz6mlf50f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ud5lkopmxplz6mlf50f.png" alt=" " width="800" height="548"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.Concepts Related to Graph Queries&lt;br&gt;
Before you dive into the use of graph queries, it is essential to understand the foundational concepts. The core idea behind graph querying is to model data as a graph structure: Entities are represented as nodes, and relationships are represented as edges. Each node has a label and properties. The label identifies the type of the node, and properties store detailed information about the node. Similarly, each edge has a type and properties. The type indicates the category of the relationship, and properties store additional information about the relationship.&lt;/p&gt;

&lt;p&gt;3.1 Syntax for Describing Nodes and Edges&lt;br&gt;
In a graph query, a specific syntax is used to describe nodes and edges:&lt;/p&gt;

&lt;p&gt;● Node: represented by using parentheses (()).&lt;br&gt;
● Edge: represented by using square brackets ([]).&lt;br&gt;
● Format: : {Property key-value pair}&lt;/p&gt;

&lt;p&gt;Basic syntax examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;// Any node
()

// A node with a specific label
(:"apm@apm.service")           // The graph-match syntax
(:`apm@apm.service`)           // The Cypher syntax

// A node with a label and properties
(:"apm@apm.service" { __entity_type__: 'apm.service' })

// A named variable node
(s:"apm@apm.service" { __entity_id__: '123456' })

// Any edge
[]

// A named edge
[edge]

// The edge with a specific type
[e:calls { __type__: "calls" }]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Syntax differences:&lt;/p&gt;

&lt;p&gt;● graph-match: In the SPL context, special characters must be enclosed in double quotes (").&lt;/p&gt;

&lt;p&gt;● Cypher: As a standalone syntax, labels are wrapped in backticks (`).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;// graph-match syntax
.topo | graph-match (s:"apm@apm.service" {__entity_id__: '123'})-[e]-(d)
        project s, e, d

// Cypher syntax（backtick string format: ``apm@apm.service``. Two backticks are used to escape the label.）
.topo | graph-call cypher(`
    MATCH (s:``apm@apm.service`` {__entity_id__: '35af918180394ff853be6c9b458704ea'})-[e]-(d)
    RETURN s, e, d
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;3.2 Path Syntax and Direction&lt;br&gt;
In graph queries, ASCII characters are used to represent the direction of relationships:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc7dldgw74lvb9idwwp7v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc7dldgw74lvb9idwwp7v.png" alt=" " width="789" height="150"&gt;&lt;/a&gt;&lt;br&gt;
3.3 Return Value Structure&lt;br&gt;
In the EntityStore model, node labels follow the domain@entity_type format. For example, &lt;a href="mailto:apm@apm.service"&gt;apm@apm.service&lt;/a&gt; represents a node in the apm domain with the entity type apm.service. This labeling convention not only clearly indicates the domain and type of the node but also enables fast filtering and querying by domain. Node properties include built-in system properties (such as &lt;strong&gt;entity_id&lt;/strong&gt;, &lt;strong&gt;domain&lt;/strong&gt;, and &lt;strong&gt;entity_type&lt;/strong&gt;) and custom properties (such as servicename and instanceid). The type of edge can also be represented by a string, such as calls, runs_on, and contains. Each edge also has corresponding property information.&lt;/p&gt;

&lt;p&gt;3.3.1 Node in JSON Format&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;{
  "id": "apm@apm.service:347150ad7eaee43d2bd25d113f567569",
  "label": "apm@apm.service", 
  "properties": {
    "__domain__": "apm",
    "__entity_type__": "apm.service",
    "__entity_id__": "347150ad7eaee43d2bd25d113f567569",
    "__label__": "apm@apm.service"
  }
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;3.3.2 Edge in JSON Format&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;{
  "startNodeId": "apm@apm.service:347150ad7eaee43d2bd25d113f567569",
  "endNodeId": "apm@apm.service.host:34f627359470c9d36da593708e9f2db7",
  "type": "contains",
  "properties": {
    "__type__": "contains"
  }
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The essence of graph querying lies in pattern matching: Users describe a graph pattern, and the system searches a graph for all subgraphs that match this pattern. A graph pattern can be expressed by using a path expression. The most basic form is (Node)-[Edge]-&amp;gt;(Node), which represents traversing from a source node to a destination node through an edge. Path expressions can be extended into more complex patterns. For example, (A)-[e1]-&amp;gt;(B)-[e2]-&amp;gt;(C) represents a two-hop path from A to C through B, whereas (A)-[*1..3]-&amp;gt;(B) indicates a variable-length path from A to B with 1 to 3 hops. This approach is intuitive and powerful, capable of describing relationships ranging from simple one-to-one connections to complex, multi-layered network paths.&lt;/p&gt;

&lt;p&gt;4.graph-match: Intuitive Path Querying&lt;br&gt;
graph-match is the most intuitive and user-friendly feature in graph querying. Its design philosophy is to allow users to express their query intent in a way that closely resembles natural language, and then the system automatically executes the query and returns the results. The syntax of graph-match is relatively simple. It consists of two parts: path description and result projection.&lt;/p&gt;

&lt;p&gt;A core characteristic of graph-match is that queries must start from a known starting point. This starting point requires both the label and the &lt;strong&gt;entity_id&lt;/strong&gt; property to be specified, which ensures that the system can quickly locate the exact entity. From a technical implementation perspective, this is a deliberate design choice: graph traversal is typically an operation with exponential complexity. Allowing queries to start from arbitrary patterns could lead to full graph scans, making performance unpredictable and unguaranteed. By requiring a specified starting point, the system can perform directed traversal from that point, effectively constraining the search space within a manageable scope.&lt;/p&gt;

&lt;p&gt;The path description syntax follows an intuitive directional expression: (A)-[e]-&amp;gt;(B) represents a directed edge from A to B. (A)&amp;lt;-[e]-(B) represents a directed edge from B to A. (A)-[e]-(B) represents a bidirectional edge (no direction enforced). You can assign variables to nodes and edges within the path. These variables can then be referenced in subsequent project statements. Paths can connect multiple nodes and edges to form a multi-hop traversal, such as (start)-[e1]-&amp;gt;(mid)-[e2]-&amp;gt;(end).&lt;/p&gt;

&lt;p&gt;A project statement is used to specify the content to be returned. The system can directly return the JSON objects of nodes or edges, or use dot notation to extract specific properties, such as node.&lt;strong&gt;entity_type&lt;/strong&gt; and edge.&lt;strong&gt;type&lt;/strong&gt;. The project statements also support field renaming, which allows returned fields to have more user-friendly names. This flexible output mechanism enables graph-match to meet both the needs of rapid exploration (by returning complete objects) and data analysis (by extracting specific fields).&lt;/p&gt;

&lt;p&gt;4.1 Practical Application Examples&lt;br&gt;
4.1.1 End-to-end Path Querying&lt;br&gt;
Query the complete call chain starting from a specific operation:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.topo |
  graph-match (s:"apm@apm.operation" {__entity_id__: '925f76b2a7943e910187fd5961125288'})
              &amp;lt;-[e1]-(v1)-[e2:calls]-&amp;gt;(v2)-[e3]-&amp;gt;(v3)
  project s, 
          "e1.__type__", 
          "v1.__label__", 
          "e2.__type__", 
          "v2.__label__", 
          "e3.__type__", 
          "v3.__label__", 
          v3&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Return results:&lt;br&gt;
● s: the start operation node&lt;br&gt;
● e1.type: the type of the first relationship&lt;br&gt;
● v1.label: the label of the intermediate node&lt;br&gt;
● v2, v3: the information about subsequent nodes&lt;/p&gt;

&lt;p&gt;4.1.2 Neighbor Node Statistics&lt;br&gt;
Count the distribution of neighbors for a specific service:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.topo |
  graph-match (s:"apm@apm.service" {__entity_id__: '0e73700c768a8e662165a8d4d46cd286'})
              -[e]-(d)   
  project eType="e.__type__", dLabel="d.__label__"
| stats cnt=count(1) by dLabel, eType
| sort cnt desc
| limit 20&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;4.1.3 Conditional Path Querying&lt;br&gt;
Find path destinations that meet specific criteria:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.topo |
  graph-match (s:"apm@apm.service.operation" {__entity_id__: '6f0bb4c892effff81538df574a5cfcd9'})
              &amp;lt;-[e1]-(v1)-[e2:runs_on]-&amp;gt;(v2)-[e3]-&amp;gt;(v3)
  project s, 
          "e1.__type__", 
          "v1.__label__", 
          "e2.__type__", 
          "v2.__label__", 
          "e3.__type__", 
          destId="v3.__entity_id__", 
          v3 
| where destId='9a3ad23aa0826d643c7b2ab7c6897591'
| project s, v3&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;4.1.4 Pod-to-node Relationship Chain&lt;br&gt;
Trace the full deployment hierarchy of a pod:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.topo |
  graph-match (pod:"k8s@k8s.pod" {__entity_id__: '347150ad7eaee43d2bd25d113f567569'})
              &amp;lt;-[r1:contains]-(node:"k8s@k8s.node")
              &amp;lt;-[r2:contains]-(cluster:"k8s@k8s.cluster")
  project 
    pod,
    node, 
    cluster,
    "r1.__type__",
    "r2.__type__"&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;4.1.5 Limitations of graph-match&lt;br&gt;
Despite its intuitive and user-friendly syntax, graph-match has several limitations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi8rfmvu0cam4m4gaztn3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi8rfmvu0cam4m4gaztn3.png" alt=" " width="789" height="283"&gt;&lt;/a&gt;&lt;br&gt;
5.graph-call: Functional Graph Operations&lt;br&gt;
graph-call provides a set of functional interfaces for graph querying. These functions encapsulate common graph operation patterns, enabling users to perform specific types of queries more efficiently. The design philosophy of graph-call is to provide declarative function APIs. Users need to only specify their intent and parameters, while the system handles and optimizes the underlying traversal algorithms.&lt;/p&gt;

&lt;p&gt;getNeighborNodes is the most commonly used graph-call function. It is used to obtain the neighbor nodes of a specified node. The signature of the function is getNeighborNodes(type, depth, nodeList), where the type parameter controls the type of traversal, the depth parameter controls the depth of the traversal, and the nodeList parameter specifies the starting node list. Valid values of the type parameter: sequence (directed traversal, preserving edge direction), sequence_in (returns only paths leading into the starting node), sequence_out (returns only paths originating from the starting node), and full (all directions traversal, regardless of the direction of the edge). This classification allows users to select the most appropriate traversal policy based on their business requirements.&lt;/p&gt;

&lt;p&gt;The depth parameter specifies the number of hops for traversal. In practice, we recommend that you do not set this value too large. A depth of 3 to 5 levels is typically sufficient to cover most scenarios. Excessively deep traversals can lead to performance degradation and return overly broad results that may lack practical significance due to excessive indirect associations. The nodeList parameter accepts an array of node descriptors. Each descriptor follows the same syntax as in graph-match, requiring both a label and &lt;strong&gt;entity_id&lt;/strong&gt;. getNeighborNodes performs traversal separately for each starting node and then merges the results before returning.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fid17n6zjbzuzblh3clyy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fid17n6zjbzuzblh3clyy.png" alt=" " width="800" height="583"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The returned result of getNeighborNodes contains four fields: srcNode (JSON object representing the source node), destNode (JSON object representing the destination node), relationType (relationship type), and srcPosition (source node position in the path, with -1 indicating a direct neighbor). The srcPosition field is particularly useful. It allows users to distinguish between direct and indirect relationships. During statistical analysis, results can be grouped by position to understand the distribution of relationships across different levels of the graph.&lt;/p&gt;

&lt;p&gt;The getDirectRelations function is used to batch query direct relationships between nodes. Unlike getNeighborNodes, getDirectRelations returns only direct connections and does not perform multi-hop traversals. This function is especially useful for batch checking relationships among multiple known nodes, such as checking whether a set of services has call relationships and verifying dependencies among a group of resources. The function takes a list of nodes as input and returns an array of relationships, with each relationship containing complete information about the nodes and edges.&lt;/p&gt;

&lt;p&gt;5.1 Practical Application Examples&lt;br&gt;
5.1.1 Obtain the Complete Neighbor Relationships of a Service&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Obtain all neighbors of a service within 2 hops.
.topo | graph-call getNeighborNodes(
  'full', 2,
  [(:"apm@apm.service" {__entity_id__: '0e73700c768a8e662165a8d4d46cd286'})]
)
| stats cnt=count(1) by relationType
| sort cnt desc&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;5.1.2 Upstream Impact Analysis for Failures&lt;br&gt;
Identify upstream services that may affect the destination service:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.topo | graph-call getNeighborNodes(
  'sequence_in', 3,
  [(:"apm@apm.service" {__entity_id__: '0e73700c768a8e662165a8d4d46cd286'})]
)
| where relationType in ('calls', 'depends_on')
| extend impact_level = CASE
    WHEN srcPosition = '-1' THEN 'direct'
    WHEN srcPosition = '-2' THEN 'secondary'
    ELSE 'indirect' END
| extend parsed_service_id = json_extract_scalar(srcNode, '$.id')
| project 
    upstream_service = parsed_service_id,
    impact_level,
    relation_type = relationType
| stats cnt=count(1) by impact_level, relation_type&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;5.1.3 Downstream Impact Analysis for Failures&lt;br&gt;
Identify downstream services affected by a failing service:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.topo | graph-call getNeighborNodes(
  'sequence_out', 3,
  [(:"apm@apm.service" {__entity_id__: 'failing-service-id'})]
)
| where relationType in ('calls', 'depends_on')
| extend affected_service = json_extract_scalar(destNode, '$.id')
| stats impact_count=count(1) by affected_service
| sort impact_count desc
| limit 20&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;5.1.4 Analysis of Cloud Resource Dependencies&lt;br&gt;
Analyze the network dependencies of an ECS instance:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.topo | graph-call getNeighborNodes(
  'sequence_out', 2,
  [(:"acs@acs.ecs.instance" {__entity_id__: 'i-bp1234567890'})]
)
| extend relation_category = CASE
    WHEN relationType in ('belongs_to', 'runs_in') THEN 'infrastructure'
    WHEN relationType in ('depends_on', 'uses') THEN 'dependency'
    WHEN relationType in ('connects_to', 'accesses') THEN 'network'
    ELSE 'other' END
| stats cnt=count(1) by relation_category
| sort cnt desc
| limit 0, 100&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;5.1.5 Query Direct Relationships between Nodes in Batches&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.topo | graph-call getDirectRelations(
  [
    (:"app@app.service" {__entity_id__: '347150ad7eaee43d2bd25d113f567569'}),
    (:"app@app.operation" {__entity_id__: '73ef19770998ff5d4c1bfd042bc00a0f'})
  ]
)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Example returned relationships:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;{
  "startNodeId": "app@app.service:347150ad7eaee43d2bd25d113f567569",
  "endNodeId": "app@app.operation:73ef19770998ff5d4c1bfd042bc00a0f", 
  "type": "contains",
  "properties": {"__type__": "contains"}
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The functional design of graph-call offers the advantage of clear query intent, enabling the system to optimize execution for specific patterns. However, this also means that it is only suitable for predefined query scenarios. For scenarios where you need to customize complex path patterns, Cypher remains the necessary choice. In practice, we recommend that you preferentially use the predefined functions of graph-call, and resort to the more flexible Cypher only when these predefined functions cannot meet your requirements.&lt;/p&gt;

&lt;p&gt;6.Cypher: A Powerful Declarative Query Language&lt;br&gt;
Cypher is the standard query language in the graph database domain. It is designed to combine the usability and declarative style of SQL with optimizations specifically tailored for graph structures. In EntityStore, Cypher provides the most powerful and flexible graph query capabilities, capable of handling a wide range of scenarios, from simple single-node queries to complex multi-hop traversals across large-scale networks.&lt;/p&gt;

&lt;p&gt;The syntax of Cypher follows a three-part structure: MATCH, WHERE, and RETURN. This structure is similar to the SELECT, FROM, and WHERE clauses of SQL but logically better aligned with the thinking pattern of graph queries. The MATCH clause describes the graph pattern to search for. The WHERE clause adds filtering conditions. The RETURN clause specifies what results to return. This structured syntax makes complex graph queries easy to read and maintain.&lt;/p&gt;

&lt;p&gt;The power of the MATCH clause lies in the graph pattern description it supports. You can define arbitrarily complex path patterns within MATCH, including multi-hop paths, optional paths, and path variables. The syntax for multi-hop paths is [*min..max], where the range is left-closed and right-open. For example, [*2..3] matches only exactly 2-hop paths. This syntax design allows users to flexibly control traversal depth, striking a balance between precision and performance. The MATCH clause also supports combining multiple path patterns: You can define several patterns simultaneously, and the system will return all subgraphs that match any of the specified patterns.&lt;/p&gt;

&lt;p&gt;The WHERE clause supports rich filtering conditions. You can apply various predicates on node and edge properties, including equality, substring matching, prefix/suffix checks, and range queries. The WHERE clause also supports logical operators (AND, OR, and NOT) and complex expressions. Compared with graph-match, the WHERE clause of Cypher is more flexible. It not only filters final results but also allows constraints on intermediate nodes along a path. This is especially useful for queries with complex path patterns.&lt;/p&gt;

&lt;p&gt;The RETURN clause provides fine-grained control over output. The system can return node objects, edge objects, and path objects, and extract specific property fields. The RETURN clause also supports aggregate functions (such as count, sum, and avg) and grouping operations, which enables Cypher to perform not only graph traversal but also graph analytics. Combined with the powerful data processing capabilities of SPL, the integration of Cypher and SPL enables a complete end-to-end workflow, from data querying to analytical computation.&lt;/p&gt;

&lt;p&gt;6.1 Basic query examples&lt;br&gt;
6.1.1 Single-node query&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Query all nodes of a specific type.
.topo | graph-call cypher(`
    MATCH (n {__entity_type__:"apm.service"})
    WHERE n.__domain__ STARTS WITH 'a' AND n.__entity_type__ = "apm.service"
    RETURN n
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Advantages over graph-match:&lt;br&gt;
● Complex filtering using the WHERE clause is supported.&lt;br&gt;
● MATCH can contain only nodes without specifying relationships.&lt;br&gt;
● More property queries (such as &lt;strong&gt;entity_type&lt;/strong&gt; and &lt;strong&gt;domain&lt;/strong&gt;) are supported.&lt;/p&gt;

&lt;p&gt;6.1.2 Relationship Query&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Query call relationships between services.
.topo | graph-call cypher(`
    MATCH (src:``apm@apm.service``)-[e:calls]-&amp;gt;(dest:``apm@apm.service``)
    WHERE src.cluster = 'production' AND dest.cluster = 'production'
    RETURN src.service, dest.service, e.__type__
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;6.2 Multi-hop Queries&lt;br&gt;
6.2.1 Basic Multi-hop Syntax&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Find call chains of 2 to 3 hops.
.topo | graph-call cypher(`
    MATCH (src {__entity_type__:"acs.service"})-[e:calls*2..4]-&amp;gt;(dest)
    WHERE dest.__domain__ = 'acs'
    RETURN src, dest, dest.__entity_type__
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note:&lt;br&gt;
● The multi-hop range is left-closed and right-open. For example, *2..4 indicates 2 hops or 3 hops.&lt;br&gt;
● *1..3 indicates 1 hop or 2 hops, but not 3 hops.&lt;/p&gt;

&lt;p&gt;6.2.2 Reachability Analysis&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Find reachable paths between services.
.topo | graph-call cypher(`
    MATCH (startNode:``apm@apm.service`` {service: 'gateway'})
          -[path:calls*1..4]-&amp;gt;
          (endNode:``apm@apm.service`` {service: 'database'})
    RETURN startNode.service, length(path) as hop_count, endNode.service
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;6.2.3 Impact Chain Analysis&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Analyze fault propagation paths.
.topo | graph-call cypher(`
    MATCH (failed:``apm@apm.test_service`` {status: 'error'})
          -[impact:depends_on*1..3]-&amp;gt;
          (affected)
    WHERE affected.__entity_type__ = 'apm.service'
    RETURN failed.service, 
           length(impact) as impact_distance,
           affected.service
    ORDER BY impact_distance ASC
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;6.2.4 Node Aggregation Statistics&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Count the number of services by domain.
.topo | graph-call cypher(`
    MATCH (src {__entity_type__:"apm.service"})-[e:calls*2..3]-&amp;gt;(dest)
    WHERE dest.__domain__ = 'apm'
    RETURN src, count(src) as connection_count
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Applicable scenarios:&lt;br&gt;
● Connected component analysis: Identify connected subgraphs in a graph&lt;br&gt;
● Centrality calculation: Identify key nodes in the network&lt;br&gt;
● Cluster detection: Detect clusters of tightly interconnected nodes&lt;/p&gt;

&lt;p&gt;6.2.5 Path Pattern Matching&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Find specific topological patterns.
.topo | graph-call cypher(`
    MATCH (src:``acs@acs.vpc.vswitch``)-[e1]-&amp;gt;(n1)&amp;lt;-[e2]-(n2)-[e3]-&amp;gt;(n3)
    WHERE NOT (src = n2 AND e1.__type__ = e2.__type__) 
        AND n1.__entity_type__ &amp;lt;&amp;gt; n3.__entity_type__ 
        AND NOT (src)&amp;lt;-[e1:``calls``]-(n1)
    RETURN src, e1.__type__, n1, e2.__type__, n2, e3.__type__, n3
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Applicable scenarios:&lt;br&gt;
● Security auditing: Detect abnormal network connection patterns&lt;br&gt;
● Compliance check: Verify the compliance of the network architecture&lt;br&gt;
● Pattern recognition: Identify specific system topology structures&lt;/p&gt;

&lt;p&gt;A key feature of Cypher is its support for queries based on custom entity properties. In graph-match, intermediate nodes can be filtered only by labels. In contrast, with Cypher, users can query and filter nodes based on any custom property of an entity. This feature allows Cypher to handle more fine-grained query requirements, such as finding all instances with CPU usage greater than 80%, or finding all resources belonging to a particular user.&lt;/p&gt;

&lt;p&gt;6.3 Custom Property Query Examples&lt;br&gt;
Querying based on custom entity properties is a core highlight of the full-featured Cypher. In standard queries, although entity details can be retrieved by using USearch, filtering by entity property during graph traversal is limited. The full-featured Cypher enables property-level querying and allows you to directly reference custom properties of entities within MATCH and WHERE clauses. The system automatically fetches detailed entity information from EntityStore and applies filters based on these properties. This design allows graph queries to go beyond mere traversal based on topological structure, but also enables intelligent filtering based on the actual properties of entities, greatly improving query accuracy.&lt;/p&gt;

&lt;p&gt;Multi-level path output is another key feature. In traditional graph queries, multi-hop queries usually return only the start and end points, and the intermediate path information may be lost. However, in scenarios such as troubleshooting and impact analysis, knowing the complete path is often more valuable than knowing just the start and end points. The full-featured Cypher supports returning path objects, which contain detailed information about all nodes and edges along the path. You can examine the complete link of data flows based on path objects. This capability is especially useful for analyzing fault propagation paths, tracing data flows, and understanding system architecture.&lt;/p&gt;

&lt;p&gt;6.3.1 Querying Based on Custom Entity Properties&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Use the custom properties of an entity to query data. (This is an example. The actual key-value properties are subject to the actual scenario.)
.topo | graph-call cypher(`
    MATCH (n:``acs@acs.alb.listener`` {listener_id: 'lsn-rxp57*****'})-[e]-&amp;gt;(d)
    WHERE d.vSwitchId CONTAINS 'vsw-bp1gvyids******' 
        AND d.user_id IN ['1654*******', '2'] 
        AND d.dns_name ENDS WITH '.com'
    RETURN n, e, d
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;6.3.2 Querying Based on Complex Property Conditions&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Use the complex properties of an entity to query data. (This is an example. The actual key-value properties are subject to the actual scenario.)
.topo | graph-call cypher(`
    MATCH (instance:``acs@acs.ecs.instance``)
    WHERE instance.instance_type STARTS WITH 'ecs.c6'
        AND instance.cpu_cores &amp;gt;= 4
        AND instance.memory_gb &amp;gt;= 8
        AND instance.status = 'Running'
    RETURN 
        instance.instance_id,
        instance.instance_type,
        instance.cpu_cores,
        instance.memory_gb,
        instance.availability_zone
    ORDER BY instance.cpu_cores DESC, instance.memory_gb DESC
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;6.4 Multi-level Path Output&lt;br&gt;
6.4.1 Return Complete Path Information&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Return the complete path information across multiple hops
.topo | graph-call cypher(`
    MATCH (n:``acs@acs.alb.listener``)-[e:``calls``*2..3]-()
    RETURN e
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Path result format:&lt;br&gt;
● An array of all edges in the path is returned.&lt;br&gt;
● Each edge contains the complete start and end nodes and property information.&lt;br&gt;
● Path length and path weight calculation are supported.&lt;/p&gt;

&lt;p&gt;6.5 Fine-grained Link Control for Connectivity Search&lt;br&gt;
6.5.1 Connection Analysis Across Network Layers&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Find connection paths from ECS instances to load balancers
.topo | graph-call cypher(`
    MATCH (start_node:``acs@acs.ecs.instance``)
          -[e*2..3]-
          (mid_node {listener_name: 'entity-test-listener-zuozhi'})
          -[e2*1..2]-
          (end_node:``acs@acs.alb.loadbalancer``)
    WHERE start_node.__entity_id__ &amp;lt;&amp;gt; mid_node.__entity_id__ 
        AND start_node.__entity_type__ &amp;lt;&amp;gt; mid_node.__entity_type__
    RETURN 
        start_node.instance_name, 
        e, 
        mid_node.__entity_type__, 
        e2, 
        end_node.instance_name
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;6.5.2 Service Mesh Connection Analysis&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;pre&amp;gt;&amp;lt;code&amp;gt;-- Analyze traffic paths in a microservices mesh
.topo | graph-call cypher(`
    MATCH (client:``apm@apm.service)
          -[request:calls]-&amp;gt;
          (gateway:``apm@apm.gateway)
          -[route:routes_to]-&amp;gt;
          (service:``apm@apm.service)
          -[backend:calls]-&amp;gt;
          (database:``middleware@database)
    WHERE client.environment='production'
        AND request.protocol='HTTP'
        AND route.load_balancer_type='round_robin'
    RETURN 
        client.service,
        gateway.gateway_name,
        service.service,
        database.database_name,
        request.request_count,
        backend.connection_pool_size
`)&amp;lt;/code&amp;gt;&amp;lt;/pre&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;6.5.3 Cascading Failure Analysis&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Analyze the cascading impact of service failures
.topo | graph-call cypher(`
    MATCH (failed_service:``apm@apm.service`` {service: 'load-generator'})
    MATCH (failed_service)-[cascade_path*1..4]-&amp;gt;(affected_service:``apm@apm.service``)
    RETURN 
        failed_service.service as root_cause,
        length(cascade_path) as impact_depth,
        affected_service.service as affected_service,
        cascade_path as dependency_chain
    ORDER BY impact_depth ASC
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;7.Typical Application Scenarios&lt;br&gt;
Graph queries are widely used in actual O&amp;amp;M and analysis scenarios. The following sections present several typical application patterns to help you better understand how to apply graph query capabilities in practical scenarios.&lt;/p&gt;

&lt;p&gt;7.1 Analyze Service Call Chains&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Analyze the call patterns of a specific service
.topo |
  graph-match (s:"apm@apm.service" {__entity_id__: 'abcdefg123123'})
              -[e:calls]-(d:"apm@apm.service")
  project 
    source_service="s.service",
    target_service="d.service", 
    call_type="e.__type__"
| stats call_count=count(1) by source_service, target_service
| sort call_count desc&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;7.2 Permission Chain Tracing&lt;br&gt;
In complex systems, understanding how user permissions propagate to resources is crucial for security auditing and compliance checks:&lt;/p&gt;

&lt;p&gt;-- Trace access paths from users to resources&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.topo |
  graph-match (user:"identity@user" {__entity_id__: 'user-123'})
              -[auth:authenticated_to]-&amp;gt;(app:"apm@apm.service")
              -[access:accesses]-&amp;gt;(resource:"acs@acs.rds.instance")
  project 
    user_id="user.user_id",
    app_name="app.service",
    resource_id="resource.instance_id",
    auth_method="auth.auth_method",
    access_level="access.permission_level"&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;7.3 Data Integrity Check&lt;br&gt;
7.3.1 Check Data Integrity&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.topo | graph-call cypher(`
    MATCH (n)-[e]-&amp;gt;(m)
    RETURN 
        count(DISTINCT n) as unique_nodes,
        count(DISTINCT e) as unique_edges,
        count(DISTINCT e.__type__) as edge_types
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;7.3.2 Identify Dangling Relationships&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Find relationships pointing to non-existent entities
.let topoData = .topo | graph-call cypher(`
        MATCH ()-[e]-&amp;gt;()
        RETURN e
    `)
    | extend startNodeId = json_extract_scalar(e, '$.startNodeId'), endNodeId = json_extract_scalar(e, '$.endNodeId'), relationType = json_extract_scalar(e, '$.type')
    | project startNodeId, endNodeId, relationType;
--$topoData
.let entityData = .entity with(domain='*', type='*') 
| project __entity_id__, __entity_type__, __domain__
| extend matchedId = concat(__domain__, '@', __entity_type__, ':', __entity_id__)
| join -kind='left' $topoData on matchedId = $topoData.endNodeId
| project matchedId, startNodeId, endNodeId, relationType
| extend status = COALESCE(startNodeId, 'Dangling')
| where status = 'Dangling';
$entityData&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;8.Data Integrity and Query Mode Selection&lt;br&gt;
When you use graph queries, data integrity is an issue that requires special attention. The graph query capability of EntityStore relies on three types of data: UModel (data model definitions), Entity (entity data), and Topo (topology relationship data). The data integrity of these three components directly affects query capabilities and results.&lt;/p&gt;

&lt;p&gt;8.1 Analysis of Data Missing Scenarios&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwli0v91cded3br01nsat.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwli0v91cded3br01nsat.png" alt=" " width="789" height="359"&gt;&lt;/a&gt;&lt;br&gt;
8.2 Pure-topo Mode&lt;br&gt;
It should be noted that the full-featured Cypher requires complete data across UModel, Entity, and Topo. If entity data is incomplete, although topological queries can be performed, filtering based on custom properties does not work. To address this issue, the system provides a pure-topo mode:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Standard mode (complete data required)
.topo | graph-call cypher(`
    MATCH (n:``acs@acs.alb.listener`` {ListenerId: 'lsn-123'})-[e]-&amp;gt;(d)
    WHERE d.vSwitchId CONTAINS 'vsw-456'
    RETURN n, e, d
`)

-- pure-topo mode (relies only on relationship data)
.topo | graph-call cypher(`
    MATCH (n:``acs@acs.alb.listener``)-[e]-&amp;gt;(d)
    RETURN n, e, d
`, 'pure-topo')&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Characteristics of pure-topo mode:&lt;br&gt;
● Advantages: The query speed is faster without relying on entity data.&lt;br&gt;
● Limitations: Custom properties of entities cannot be used for filtering.&lt;br&gt;
● Applicable scenarios: This mode is applicable to scenarios such as topology analysis and relationship verification.&lt;/p&gt;

&lt;p&gt;8.3 Policy for Selecting Query Modes&lt;br&gt;
When all three aspects of data are complete, you can use all the features of the full-featured Cypher, including queries based on custom properties and multi-level path output. If entity data is incomplete but Topo data is complete, the pure-topo mode can be used for querying. This mode offers faster query performance but supports only topology-based queries, and filtering by entity property is not available. If Topo data is incomplete but entity data is complete, graph queries cannot be performed. This is because graph queries depend on relationships. Without relationship data, graphs cannot be formed.&lt;/p&gt;

&lt;p&gt;In practice, you should select an appropriate query method based on data integrity. If data integrity is sufficient, you can preferentially adopt the full-featured Cypher to enjoy the convenience of property-level queries. If performance is the primary concern and only topology information is required, you can use the pure-topo mode. To check data integrity, we recommend that you first run simple test queries to check data integrity before executing complex queries.&lt;/p&gt;

&lt;p&gt;9.Performance optimization and best practices&lt;br&gt;
Graph queries are powerful, but performance can become a bottleneck when dealing with large volumes of data. You can adopt proper methods and optimization policies to significantly improve query performance and ensure that the system remains stable and responsive even under heavy loads.&lt;/p&gt;

&lt;p&gt;9.1 Query structure optimization&lt;br&gt;
9.1.1 Proper use of indexes&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- ❌ Before optimization: full table scan
.topo | graph-call cypher(`
    MATCH (n) WHERE n.service = 'web-app'
    RETURN n
`)

-- ✅ After optimization: using label-based indexing
.topo | graph-call cypher(`
    MATCH (n:``apm@apm.service`` {service: 'web-app'})
    RETURN n
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;9.1.2 Early filtering with conditions&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- ❌ Before optimization: late filtering
.topo | graph-call cypher(`
    MATCH (start)-[*1..5]-&amp;gt;(endNode)
    WHERE start.environment = 'production' AND endNode.status = 'active'
    RETURN start, endNode
`)

-- ✅ After optimization: early filtering
.topo | graph-call cypher(`
    MATCH (start {environment: 'production'})-[*1..5]-&amp;gt;(endNode {status: 'active'})
    RETURN start, endNode
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;9.2 Query Scope Control&lt;br&gt;
Precise control over the query scope is the most critical optimization policy:&lt;br&gt;
● Time range optimization: Use time fields to limit the query range.&lt;br&gt;
● Traversal depth limitation: Performance degrades significantly when the traversal depth exceeds five layers.&lt;br&gt;
● Exact starting point: Use specific entity_id instead of fuzzy match.&lt;br&gt;
● Traversal type selection: Select between sequence or full traversal based on actual requirements.&lt;/p&gt;

&lt;p&gt;9.3 Result Set Control&lt;br&gt;
9.3.1 Paging and Limits&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Use LIMIT to control the number of results
.topo | graph-call cypher(`
    MATCH (service:``apm@apm.service``)-[calls:calls]-&amp;gt;(target)
    WHERE calls.request_count &amp;gt; 1000
    RETURN service.service, target.service, calls.request_count
    ORDER BY calls.request_count DESC
    LIMIT 50
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;9.3.2 Result Sampling&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Sample large result sets
.topo | graph-call cypher(`
    MATCH (n:``apm@apm.service``)
    RETURN n.service
    LIMIT 100
`)
| extend seed = random()
| where seed &amp;lt; 0.1&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;9.4.Multi-hop Traversal Optimization&lt;br&gt;
9.4.1 Control Hop Depth&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Avoid excessively deep traversals
.topo | graph-call cypher(`
    MATCH (start)-[path*1..3]-&amp;gt;(endNode)
    WHERE length(path) &amp;lt;= 2
    RETURN path
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;9.4.2 Use Directional Optimization&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-- Reduce search space by using the relationship direction
.topo | graph-call cypher(`
    MATCH (start)-[calls:calls*1..3]-&amp;gt;(endNode)  -- Explicit direction
    WHERE start.__entity_type__ = 'apm.service'
    RETURN start, endNode
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;9.5 Best Practice Recommendations&lt;br&gt;
● Use of SPL for filtering: You can filter out unwanted results in a timely manner after graph queries.&lt;br&gt;
● Batch processing: For large-scale graph queries, batch processing can be used.&lt;br&gt;
● Result caching: For frequently queried paths, the results can be cached.&lt;br&gt;
● Query splitting: You can break down complex queries into multiple simple ones, and then combine the results by using SPL.&lt;/p&gt;

&lt;p&gt;10.FAQ&lt;br&gt;
10.1Edge Type Coincides with a Cypher Keyword&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.topo | graph-call cypher(`
    MATCH (s)-[e:``contains``]-&amp;gt;(d)
    WHERE s.__domain__ CONTAINS "apm"
    RETURN e
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The term contains is both a Cypher keyword and an edge type. In this case, as an edge label in Cypher syntax, it must be enclosed in backticks (`) to escape it. Furthermore, since the Cypher query is embedded within an SPL context, where backticks themselves need to be escaped, a single backtick in Cypher is represented as double backticks in SPL. Therefore, the edge type contains must be enclosed with double backticks to ensure correct parsing.&lt;/p&gt;

&lt;p&gt;10.2 Multi-hop Syntax Description&lt;br&gt;
-- Find call chains with 2 to 3 hops&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.topo | graph-call cypher(`
    MATCH (src {__entity_type__:"acs.service"})-[e:calls*2..4]-&amp;gt;(dest)
    WHERE dest.__domain__ = 'acs'
    RETURN src, dest, dest.__entity_type__
`)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note:&lt;br&gt;
● The multi-hop range is left-closed and right-open. For example, *2..4 indicates 2 hops or 3 hops.&lt;br&gt;
● *1..3 indicates 1 hop or 2 hops, but not 3 hops.&lt;/p&gt;

&lt;p&gt;Verify this conclusion:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.topo | graph-call cypher(`
    MATCH (s)-[e*1..3]-&amp;gt;(d)
    RETURN length(e) as len
`, 'pure-topo')
| stats cnt=count(1) by len
| project len, cnt&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;10.3 Cypher Relationship Abbreviation Not Supported&lt;br&gt;
✅ Supported syntax:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.topo | graph-call cypher(`
    MATCH (s)-[]-&amp;gt;(d)
    RETURN s
`, 'pure-topo')&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;❌ Unsupported syntax:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.topo | graph-call cypher('
MATCH (s)--&amp;gt;(d)
RETURN s
', 'pure-topo')&lt;/code&gt;&lt;/pre&gt;

</description>
      <category>ai</category>
      <category>integration</category>
    </item>
    <item>
      <title>Android Crash Monitoring: A Complete Troubleshooting Flow for Production Environment Crashes</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Fri, 13 Mar 2026 06:38:54 +0000</pubDate>
      <link>https://dev.to/observabilityguy/android-crash-monitoring-a-complete-troubleshooting-flow-for-production-environment-crashes-oo3</link>
      <guid>https://dev.to/observabilityguy/android-crash-monitoring-a-complete-troubleshooting-flow-for-production-environment-crashes-oo3</guid>
      <description>&lt;p&gt;This article demonstrates a complete production crash troubleshooting flow using Alibaba Cloud RUM, from alerting and stack analysis to user behavior tracing and root cause identification.&lt;br&gt;
I. Background: Why Is Crash Collection Necessary?&lt;br&gt;
Series Review: In the previous article In-depth Analysis of the Principles of Android Crash Capture and a Closed-loop Framework from Crash to Root Cause Identification, we deeply analyzed the technical insider details of crash collection—from the UncaughtExceptionHandler mechanism in the Java layer to signal processing and Minidump technology in the Native layer, and then to the symbolization principle of the obfuscation stack. We believe that everyone has obtained a clear understanding of "how crashes are caught."&lt;/p&gt;

&lt;p&gt;However, theory alone is not enough. This article will reproduce a production environment case to show how an Android developer, when encountering an online crash problem, can perform crash analysis and positioning through exception data and context collected by Real User Monitoring (RUM). It will take you through the complete flow of crash troubleshooting: from receiving alerts, viewing the console, analyzing the stack, and tracking user behavior, to locating the root cause.&lt;/p&gt;

&lt;p&gt;1.1 Case background&lt;br&gt;
An app V3.5.0 was published, which mainly optimized the loading performance of the product list. However, on the third day after the version was published, the team started to receive a large number of user complaints about unexpected app exits and crashes.&lt;/p&gt;

&lt;p&gt;Severity:&lt;/p&gt;

&lt;p&gt;● 10 + fold increase in crash rate&lt;/p&gt;

&lt;p&gt;● App store ratings drop&lt;/p&gt;

&lt;p&gt;● User uninstallation rate increased&lt;/p&gt;

&lt;p&gt;Final solution: Alibaba Cloud RUM SDK was integrated to collect crash data and locate the problem within two hours.&lt;/p&gt;

&lt;p&gt;II. Complete Troubleshooting Flow: From Alerting to Root Cause Positioning&lt;br&gt;
2.1 🔔 Step 1: Receive crash alerts&lt;br&gt;
After data integration, because alerting is configured, when the online crash rate rises significantly, the team developers will receive alerting notifications and follow the online problem immediately.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7d8fkawy5t7xoce05lsu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7d8fkawy5t7xoce05lsu.png" alt=" " width="688" height="561"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;app.name: xxx and crash | SELECT diff[1] AS "current value", diff[2] AS "yesterday's value", round(diff[3], 4) AS "ratio" FROM ( SELECT compare(cnt, 86400) AS diff FROM ( SELECT COUNT(*) AS cnt FROM log)) ORDER BY "current value" DESC
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;2.2 📊 Step 2: View crash overview - Lock the exception type&lt;br&gt;
Operation path: Console homepage → RUM → Find the corresponding app → Exception statistics&lt;/p&gt;

&lt;p&gt;By analyzing the exception list displayed in the console, we found that IndexOutOfBoundsException accounted for the vast majority of crashes and was definitely the main problem, and began to appear in large quantities after V3.5.0 was published.&lt;/p&gt;

&lt;p&gt;2.3 🔍 Step 3: Analyze the crash stack - Preliminary positioning&lt;br&gt;
Click to enter the IndexOutOfBoundsException details page for in-depth analysis. This verified our idea. Here, you can locate that the crash occurred to the newly published V3.5.0, and the page where it occurred is: ProductListActivity. The corresponding session ID is: 98e9ce65-c51a-40c4-9232-4b69849e5985-01. This information is used for our subsequent analysis of user behavior.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8gjpqi8vwyeclgq0fcjw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8gjpqi8vwyeclgq0fcjw.png" alt=" " width="800" height="317"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;View the crash stack and analyze key information:&lt;/p&gt;

&lt;p&gt;● The crash occurred on line 50 of the ProductListAdapter.onBindViewHolder() method.&lt;/p&gt;

&lt;p&gt;● Fault reason: Attempted to access the 6th element (index 5) of the List, but the list has only 5 elements.&lt;/p&gt;

&lt;p&gt;● This is a typical RecyclerView data inconsistency problem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2hnsj7y44530tt2uw6dp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2hnsj7y44530tt2uw6dp.png" alt=" " width="800" height="260"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Preliminary assumptions:&lt;/p&gt;

&lt;p&gt;● It may be that the data update timing is incorrect.&lt;/p&gt;

&lt;p&gt;● It may be multi-threaded concurrent modification of data.&lt;/p&gt;

&lt;p&gt;● It may be caused by rapid user operations.&lt;/p&gt;

&lt;p&gt;However, the root cause cannot be determined solely by the stack. You need to view the specific operation path of the user.&lt;/p&gt;

&lt;p&gt;2.4 🎯 Step 4: Track user behavior - Find the trigger path&lt;br&gt;
Operation path: Crash details page → Select the session ID corresponding to the crash → View the session trace of the session ID&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feg9s3ptm5aqnfzxwc8bn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feg9s3ptm5aqnfzxwc8bn.png" alt=" " width="800" height="315"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click the session details to view the user behavior path, combined with the page where the crash occurred. We identified the following operation path.&lt;/p&gt;

&lt;p&gt;Operation path:&lt;/p&gt;

&lt;p&gt;● Go to the ProductListActivity page.&lt;/p&gt;

&lt;p&gt;● Quickly click the refresh button three times consecutively, triggering an asynchronous update of the list (Note: A network request actually occurs here. Because we are reproducing it locally, an asynchronous update is used.)&lt;/p&gt;

&lt;p&gt;● Online request timing issues:&lt;/p&gt;

&lt;p&gt;The first asynchronous request returns several items, and the user scrolls to the 6th one.&lt;br&gt;
Subsequent requests return only five items and update the list data.&lt;br&gt;
● RecyclerView is still rendering the 6th position, but the data no longer exists.&lt;/p&gt;

&lt;p&gt;● Root cause: Data race caused by multiple asynchronous requests.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd34qv2l3vog7rb8c88mt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd34qv2l3vog7rb8c88mt.png" alt=" " width="800" height="417"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2.5 🌐 Step 5: Multidimensional analysis- Validate assumptions&lt;br&gt;
To further confirm the issue, you can perform multi-dimensional filtering and analysis on the crash data to analyze failure features and confirm the impact scope.&lt;/p&gt;

&lt;p&gt;2.5.1 Crash data structure&lt;br&gt;
The crash data collected by the SDK contains the following core fields:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;{
  "session.id": "session_abc123", // The session ID, which is used to associate the user behavior path.
  "timestamp": 169988400000, // The time when the crash occurred, in milliseconds.
  "exception.type": "crash", // The type of the exception.
  "exception.subtype": "java", // The subtype of the exception.
  "exception.name": "java.lang.NullPointerException", // The type of the exception.
  "exception.message": "Attempt to invoke virtual method on a null object", // The error message.
  "exception.stack": "[{...}]", // The full stack (JSON array).
  "exception.thread_id": 1, // The ID of the crash thread.
  "view.id": "123-abc", // The ID of the page on which the crash occurred.
  "view.name": "NativeCrashActivity", // The name of the page on which the crash occurred.
  "user.tags:": "{\"vip\":\"true\"}", // User tags (custom).
  "properties": "{\"version\":\"2.1.0\"}", // Custom properties.
  "net.type": "WIFI", // The network type of the user.
  "net.ip": "192.168.1.100", // The IP address of the client.
  "device.id": "123-1234", // The ID of the user device.
  "os.version": 14, // The version number of the user's system.
  "os.type": "Android" // The system type of the user.
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;2.5.2 Overview of the crash dashboard&lt;br&gt;
Location: RUM &amp;gt; Experience dashboard &amp;gt; Exception analysis&lt;/p&gt;

&lt;p&gt;On the exception analysis dashboard, you can view the overall breakdown results of the application, including the total number of exceptions, exception trend, device distribution, exception type, and network distribution.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu4aw6tjxb810sa5ecqif.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu4aw6tjxb810sa5ecqif.png" alt=" " width="800" height="362"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2.5.3 Network type distribution&lt;br&gt;
Because the actual list update operation is returned by a network request, we need to pay attention to the user's network type when a crash occurs in the online data. You can view the crash network distribution of V3.5.0 in the crash dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvxqzmc7alp1j8l6r8644.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvxqzmc7alp1j8l6r8644.png" alt=" " width="800" height="406"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;💡Conclusion: 90% crashes occur in 3G/4G networks and the rate of crashes in WiFi networks is very low. This confirms that the network (asynchronous request) is the key factor.&lt;/p&gt;

&lt;p&gt;2.5.4 Device brand distribution&lt;br&gt;
View the distribution of device brands that crashed in V3.5.0 on the crash dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5o7yfw1r5kn5iqlsjxw0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5o7yfw1r5kn5iqlsjxw0.png" alt=" " width="800" height="434"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;💡Conclusion: All brands are affected. It is not a model-specific issue, but a code logic issue.&lt;/p&gt;

&lt;p&gt;2.5.5 Version comparison&lt;br&gt;
In addition to the crash dashboard, we can still use SQL custom analysis on the Log Explorer tab page.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;app.name: xxx and crash | select "app.version", count(*) from log group by "app.version"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Operation: Compare the crash rates of V3.4.0 and V3.5.0.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fflmkxm93io2d8ayv1vre.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fflmkxm93io2d8ayv1vre.png" alt=" " width="584" height="113"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2.6 💻 Step 6: Locate code issues&lt;br&gt;
View the problematic code&lt;br&gt;
Open ProductListActivity.java and find the refresh logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;private void loadProducts() {
    // ❌ Changes in v3.5.0: Optimize performance with asynchronous loading.
    new Thread(() -&amp;gt; {
        try {
            // Simulate a network request.
            List&lt;span class="nt"&gt;&amp;lt;Product&amp;gt;&lt;/span&gt; newProducts = ApiClient.getProducts(currentCategory);
            // ❌ Problem 1: The previous request was not canceled.
            // ❌ Question 2: Directly clear and update data without considering that RecyclerView is rendering.
            runOnUiThread(() -&amp;gt; {
                productList.clear(); //💥Dangerous operation!
                productList.addAll(newProducts); //💥Data update.
                adapter.notifyDataSetChanged(); //💥Notification refresh.
            });
        } catch (Exception e) {
            e.printStackTrace();
        }
    }).start();
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;@Override
public void onBindViewHolder(@NonNull ProductViewHolder holder, int position) {
    //💥Crash point: The position may exceed the range of products.
    Product product = products.get(position); //IndexOutOfBoundsException!
    holder.bind(product);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Find the root cause of the problem&lt;br&gt;
Purpose of V3.5.0 changes: Optimize performance and move network requests to subthreads.&lt;/p&gt;

&lt;p&gt;Introduced issues:&lt;/p&gt;

&lt;p&gt;The previous request is not canceled: When the user quickly clicks the refresh button, multiple requests are performed at the same time.&lt;br&gt;
Data race: When the next request is returned, data is cleared and updated.&lt;br&gt;
Inconsistent UI status: The RecyclerView is rendering a location, but the data has been reduced.&lt;br&gt;
III. Symbolication configuration: Make the stack "speak human language"&lt;br&gt;
Through the previous troubleshooting process, we successfully located the root cause of the crash: The ProductListAdapter.onBindViewHolder() method has an index out-of-bounds problem when dealing with data updates. But you may have a question: How do we get from the obfuscated stack, exactly to ProductListAdapter.java:50 this line of code?&lt;/p&gt;

&lt;p&gt;In a real production environment, to protect code and optimize package size, release versions published to the app store are obfuscated by ProGuard or R8. This means the crash stack initially seen on the console is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;java.lang.IndexOutOfBoundsException: Index: 5, Size: 5
    at java.util.ArrayList.get(ArrayList.java:437)
    at com.shop.a.b.c.d.a(Proguard:58)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the reason why we need symbolication. Next, let's see how to configure symbolication in the RUM console.&lt;/p&gt;

&lt;p&gt;3.1 Symbolize Java/Kotlin obfuscation&lt;br&gt;
Step 1: Preserve the mapping.txt file&lt;br&gt;
After the release version is built, the mapping.txt file is located at:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;app/build/outputs/mapping/release/mapping.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sample file content:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;com.example.ui.MainActivity -&amp;gt; a.b.c.MainActivity:
    void updateUserProfile(com.example.model.User) -&amp;gt; a
    void onClick(android.view.View) -&amp;gt; b

com.example.model.User -&amp;gt; a.b.d.User:
    java.lang.String userName -&amp;gt; a
    void setUserName(java.lang.String) -&amp;gt; a
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 2: Upload the mapping file to the console&lt;br&gt;
Log on to the Cloud Monitor 2.0 console.&lt;br&gt;
Go to RUM &amp;gt; Go to the application you are connected to &amp;gt; Application Settings &amp;gt; File Management&lt;br&gt;
Click the symbol table file &amp;gt; Upload the file&lt;br&gt;
Upload the mapping.txt file&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjqtp29w3kxzyymf290ax.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjqtp29w3kxzyymf290ax.png" alt=" " width="800" height="122"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.2 Symbolize the native.so&lt;br&gt;
After the build is complete, the .so file in the folder is located at:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;app/build/intermediates/cxx/release/xxx/obj/
├── arm64-v8a/
│ └ ── xxx-native.so ← contains debug symbols
├── armeabi-v7a/
│ └ ── xxx-native.so
└ ── x86_64/
└ ── xxx-native.so
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 3: Upload to the console&lt;br&gt;
Similar to the Java mapping file, upload the .so file of the corresponding architecture in the console.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F814bthc2yo6bzlo8uoa9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F814bthc2yo6bzlo8uoa9.png" alt=" " width="800" height="169"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.3 Verify symbolication&lt;br&gt;
Use the symbol table file for parsing: Open crash details &amp;gt; Exception details &amp;gt; Parse the stack &amp;gt; Select the corresponding symbol table file (Use the .so file for the native stack, and .txt file for the java stack)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fud1rvymrcimkuk77ew0g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fud1rvymrcimkuk77ew0g.png" alt=" " width="800" height="295"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click OK to display the parsed stack.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5nr4oayh9r9klzfhpdf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5nr4oayh9r9klzfhpdf.png" alt=" " width="800" height="215"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Symbolization success:&lt;/p&gt;

&lt;p&gt;● Display full class name and method name.&lt;/p&gt;

&lt;p&gt;● Show source file path and line number.&lt;/p&gt;

&lt;p&gt;● C++ function name restored (non-mangled state).&lt;/p&gt;

&lt;p&gt;IV. 📝 Case Summary: Key Value of RUM&lt;br&gt;
What key help does RUM provide in troubleshooting this crash?&lt;/p&gt;

&lt;p&gt;1.Complete stack information + symbolization&lt;br&gt;
● Without RUM: Online applications can only see the obfuscated stack, and do not know where the crash occurred.&lt;/p&gt;

&lt;p&gt;● With RUM: After uploading the mapping file, you can accurately pinpoint ProductListAdapter.java:50.&lt;/p&gt;

&lt;p&gt;2.User behavior path tracing&lt;br&gt;
● Without RUM: We only know that "the user opens the list and it crashes", but cannot reproduce the crash.&lt;/p&gt;

&lt;p&gt;● With RUM: You can view the complete operation timeline and discover that the issue is triggered by "rapidly clicking refresh multiple times."&lt;/p&gt;

&lt;p&gt;3.Multi-dimensional data analysis&lt;br&gt;
● Without RUM: You do not know which users are affected or in what environment the crash occurred.&lt;/p&gt;

&lt;p&gt;● With RUM:&lt;/p&gt;

&lt;p&gt;You discover that 90% of crashes occur on 3G or 4G networks (network latency is the key).&lt;br&gt;
All device models are affected (hardware issues are excluded).&lt;br&gt;
The issue only started to appear in V3.5.0 (version changes are identified).&lt;/p&gt;

&lt;p&gt;4.Real-time alerting + quantified impact&lt;br&gt;
● Without RUM: You rely on user complaints, and discovery is lagged.&lt;/p&gt;

&lt;p&gt;● With RUM: You receive alerts immediately and start troubleshooting immediately.&lt;/p&gt;

&lt;p&gt;Application stability is the cornerstone of user experience. Through systematic crash collection and analysis, developer teams can transform from "passive response" to "proactive prevention," continuously improving application quality and winning user trust. Alibaba Cloud RUM implements a non-intrusive collection SDK for application performance, stability, and user behavior for Android. You can refer to the integration document to experience it. In addition to Android, RUM also supports monitoring analysis for various platforms such as Web, miniapp, iOS, and HarmonyOS. For related questions, you can join the RUM support group (DingTalk group ID: 67370002064) for consultation.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Unified Cross-cloud Logging: Intelligently Importing S3 Data into SLS</title>
      <dc:creator>ObservabilityGuy</dc:creator>
      <pubDate>Fri, 13 Mar 2026 05:48:06 +0000</pubDate>
      <link>https://dev.to/observabilityguy/unified-cross-cloud-logging-intelligently-importing-s3-data-into-sls-1i08</link>
      <guid>https://dev.to/observabilityguy/unified-cross-cloud-logging-intelligently-importing-s3-data-into-sls-1i08</guid>
      <description>&lt;p&gt;This article introduces an intelligent, elastic solution for reliably and efficiently importing diverse S3 log data into Alibaba Cloud’s SLS across multicloud environments.&lt;/p&gt;

&lt;p&gt;1.Why You Need to Import Data from S3 to SLS&lt;br&gt;
With multicloud architectures becoming increasingly popular today, enterprises often face the following scenario: Services on AWS generate a large amount of log data stored in S3, but this data needs to be centralized for analysis and processing in Simple Log Service (SLS) on Alibaba Cloud.&lt;/p&gt;

&lt;p&gt;Typical scenarios include:&lt;/p&gt;

&lt;p&gt;● AWS Service Log Analysis: CloudTrail audit logs, VPC Flow Logs, ALB access logs, and others need to be centrally analyzed.&lt;/p&gt;

&lt;p&gt;● Backflow of Business Data from outside China: Logs generated by services outside China need to flow back to China for compliance audits and analysis.&lt;/p&gt;

&lt;p&gt;● Unified Multicloud Operations: Enterprises adopt a multicloud strategy and need to perform log analysis and Alerting on a unified platform.&lt;/p&gt;

&lt;p&gt;SLS provides powerful Real-time Analysis capabilities, flexible query syntax, and a comprehensive Alerting mechanism, making it an ideal choice for unified log management. However, how to efficiently and reliably import massive amounts of logs from S3 to SLS remains a challenging technical issue.&lt;br&gt;
2.Technical Challenges: Behind the Seemingly Simple Data Transfer&lt;br&gt;
Challenge1: The Difficulty of Real-time Discovery of Massive Small Files&lt;br&gt;
Many AWS services (such as CloudTrail and ALB) continuously write small files to S3, potentially generating hundreds or thousands of files per minute. How can you quickly discover these new files and import the files in a timely manner?&lt;/p&gt;

&lt;p&gt;The core difficulty is that: The ListObjects API of S3 only supports traversing in lexicographic order and does not support filtering by time. This means that to find the latest files, you may need to traverse the entire folder tree.&lt;/p&gt;

&lt;p&gt;For example, assume an S3 bucket already contains hundreds of millions of historical files, and more than 1,000 files are added every minute. If full traversal is used, it may take several minutes to complete one scan, which fails to meet timeliness requirements. However, if only incremental traversal is performed, data may be missed because of irregular file naming.&lt;/p&gt;

&lt;p&gt;Challenge2: Elastic Handling of Traffic Bursts&lt;br&gt;
Service traffic is often volatile. E-commerce sales promotions, Marketing Campaigns, and System failures can all cause the Log Volume to surge instantly.&lt;/p&gt;

&lt;p&gt;Real-world scenario: An E-commerce Customer generates 1 GB of logs per minute normally, but this figure soars to 10 GB or even higher during sales promotions. If the import capability cannot scale-out quickly, data backlogs will occur, affecting the Timeliness of Real-time Analysis and Alerting.&lt;/p&gt;

&lt;p&gt;The greater challenge is that these traffic spikes are often unpredictable. The System needs to automatically detect traffic changes and complete scale-out within a few minutes, which places high demands on the scheduling system.&lt;/p&gt;

&lt;p&gt;Challenge3: Diversity of Data Formats and Cost Control&lt;br&gt;
Log data in S3 varies widely:&lt;/p&gt;

&lt;p&gt;● Compression Format: gzip, snappy, lz4, zstd, and others&lt;/p&gt;

&lt;p&gt;● Data format: JSON, CSV, Parquet, plain text, and others&lt;/p&gt;

&lt;p&gt;● Data Quality: The data may contain dirty data and require field extraction and transformation.&lt;/p&gt;

&lt;p&gt;If you import data into SLS as is and then process the data, extra storage and Compute costs will be incurred. The ideal solution is to complete data cleansing and transformation during the import process.&lt;/p&gt;

&lt;p&gt;3.Our Solution: Intelligent, Elastic, and Comprehensive&lt;br&gt;
In the scenario of migration from S3 to SLS, the biggest headache for O&amp;amp;M teams is "how to move data quickly and stably." Traditional solutions often face a dilemma: either fast but prone to missing data, or stable but slow as a snail.&lt;/p&gt;

&lt;p&gt;The SLS team's solution is: Why choose? Our solution delivers both.&lt;/p&gt;

&lt;p&gt;Through an innovative two-stage parallel architecture:&lt;/p&gt;

&lt;p&gt;● Phase1 (File Discovery): Multiple mechanisms are combined, using Real-time event capture + periodic full validation to ensure that "not a single file is missed."&lt;/p&gt;

&lt;p&gt;● Phase2 (Data Pull): Dedicated transmission channels run at full speed, unaffected by file scanning.&lt;/p&gt;

&lt;p&gt;● Key innovation: The two phases run independently and in parallel, ensuring both speed and stability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn3a9ggwcj0w8y3yd299j.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn3a9ggwcj0w8y3yd299j.jpeg" alt=" " width="800" height="223"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Real-time File Discovery: Near-Real-Time Discovery with Zero Data Loss&lt;br&gt;
Solution1: Dual-mode Intelligent Traversal&lt;br&gt;
Addressing the file discovery difficulty, we provide two complementary traversal modes:&lt;/p&gt;

&lt;p&gt;Full traversal mode&lt;br&gt;
● Periodically (such as every minute) performs a complete scan of the specified folder.&lt;/p&gt;

&lt;p&gt;● Ensures that no files are missed, which is suitable for scenarios with extremely high requirements for data integrity&lt;/p&gt;

&lt;p&gt;● Intelligently records imported files to avoid duplicate processing&lt;/p&gt;

&lt;p&gt;Incremental traversal mode&lt;br&gt;
● Incremental discovery mechanism based on lexicographic order&lt;/p&gt;

&lt;p&gt;● Continues to traverse from the last scanned position each time to quickly discover new files&lt;/p&gt;

&lt;p&gt;● Suitable for standard scenarios where files are named in chronological order. You can achieve minute-level real-time import&lt;/p&gt;

&lt;p&gt;Combination of two modes: Incremental traversal ensures real-time performance, while full traversal acts as a fallback to ensure integrity.&lt;/p&gt;

&lt;p&gt;Solution2: SQS Event-driven Import&lt;br&gt;
For scenarios with extremely high real-time requirements, we support using Simple Queue Service (SQS) message queues to drive the import flow:&lt;/p&gt;

&lt;p&gt;Configure S3 Event Notifications: When a new file is uploaded to Simple Storage Service (S3), an event is automatically sent to SQS&lt;br&gt;
Real-time message Consumption: The import service retrieves file Change Notifications from SQS&lt;br&gt;
Precise import: Directly imports specified files without traversal&lt;br&gt;
This solution can achieve minute-level import latency. It is particularly suitable for:&lt;/p&gt;

&lt;p&gt;● Scenarios where the file creation order is irregular&lt;/p&gt;

&lt;p&gt;● Businesses with strict requirements for real-time performance&lt;/p&gt;

&lt;p&gt;● Complex scenarios where multiple folders need to be monitored simultaneously&lt;/p&gt;

&lt;p&gt;Comparison of solutions:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjf25s9m9hcyalyvw44iw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjf25s9m9hcyalyvw44iw.png" alt=" " width="584" height="307"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2gnbd109e9kw91xi83ku.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2gnbd109e9kw91xi83ku.jpeg" alt=" " width="800" height="485"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Intelligent automatic scaling: Automatically responds to traffic fluctuations&lt;br&gt;
We have implemented three elasticity mechanisms to handle traffic bursts:&lt;/p&gt;

&lt;p&gt;1.Adaptive adjustment based on sliding windows&lt;br&gt;
● Evaluates the Data Volume to be imported every 5 minutes&lt;/p&gt;

&lt;p&gt;● Estimates the required concurrency based on object metadata (size and quantity)&lt;/p&gt;

&lt;p&gt;● Automatically scales out or scales in to ensure that the import speed matches the data generation speed&lt;/p&gt;

&lt;p&gt;2.Optimization of long-tail issues&lt;br&gt;
● Ensures that the file volume or file data volume imported by different Jobs is as consistent as possible to avoid latency caused by long-tail issues&lt;/p&gt;

&lt;p&gt;3.Pre-provisioned concurrency for predictable peaks&lt;br&gt;
● Supports users in submitting tickets to set the import concurrency based on business patterns&lt;/p&gt;

&lt;p&gt;● For example: If a user predicts peak traffic during an activity in advance, the user can submit a ticket to SLS to preset the Job concurrency&lt;/p&gt;

&lt;p&gt;The following figure shows that in a big data import scenario, the system quickly scales out and in. It quickly scales out to a concurrency of 300 and imports file data at a rate of nearly 5.8 GB/s.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6adocl0iude7oawc83y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6adocl0iude7oawc83y.png" alt=" " width="800" height="193"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Comprehensive data processing capabilities&lt;br&gt;
Seamless Support for multiple formats&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4m9k4bhoj3srwh1tqa84.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4m9k4bhoj3srwh1tqa84.png" alt=" " width="789" height="150"&gt;&lt;/a&gt;&lt;br&gt;
Processing before storage: Cost-saving and efficient&lt;/p&gt;

&lt;p&gt;The traditional solution is "store first, then process," which incurs unnecessary storage costs. We support processing before Data Ingestion into SLS, including:&lt;/p&gt;

&lt;p&gt;● Field extraction: Extracts key fields from unstructured logs&lt;/p&gt;

&lt;p&gt;● Data filtering: Discards useless logs to reduce Storage Size&lt;/p&gt;

&lt;p&gt;● Field transformation: Format standardization, UNIX timestamp transformation, and so on&lt;/p&gt;

&lt;p&gt;● Data Masking: Masking of Sensitive Information&lt;/p&gt;

&lt;p&gt;Example of data processing before storage&lt;br&gt;
Source Text log&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl06rcwfl39dw3gdrmt9l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl06rcwfl39dw3gdrmt9l.png" alt=" " width="800" height="80"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ingestion processor rules&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;* | parse-csv -delim='\t' content as time,level,order_id,amount,currency,error_code,response_time,status_code,client_id,customer_email,id_card 
| project-away content
| extend customer_email = regexp_replace(customer_email, '([\s\S]+)@([\s\S]+)', '****@\2') 
| extend id_card = regexp_replace(id_card, '(\d{3,3})(\d+)(\d{3,3})', '\1*****\3')
| extend __time__ = cast(to_unixtime(cast(time as TIMESTAMP)) as bigint) - 28800
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stored log example&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl7mucc63eotp6yds9kfi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl7mucc63eotp6yds9kfi.png" alt=" " width="798" height="1104"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;IV. Solution Value: Not Just Data Transport&lt;br&gt;
Guaranteed Reliability&lt;br&gt;
● File-level status tracking: The import status of each file is clearly traceable&lt;/p&gt;

&lt;p&gt;● Automatic retry mechanism: Temporary failures are automatically retried without manual intervention&lt;/p&gt;

&lt;p&gt;● Integrity validation: Supports import confirmation at the file level&lt;/p&gt;

&lt;p&gt;● Alert monitoring: Real-time monitoring of Key Metrics such as import latency and failure rate&lt;/p&gt;

&lt;p&gt;Cost Optimization&lt;br&gt;
● On-demand elasticity: Automatically adjusts resources based on actual traffic to avoid latency growth&lt;/p&gt;

&lt;p&gt;● Preprocessing: reduces invalid data storage and lowers storage costs.&lt;/p&gt;

&lt;p&gt;● Incremental import: only imports added and Changed files to avoid duplicate imports.&lt;/p&gt;

&lt;p&gt;Out-of-the-box&lt;br&gt;
● Visualization Configuration: No-code setup. You can complete the configuration through the console.&lt;/p&gt;

&lt;p&gt;● Preset templates: provides out-of-the-box configuration templates for common logs such as CloudTrail and JsonArray.&lt;/p&gt;

&lt;p&gt;● Comprehensive documentation: detailed configuration instructions and Best Practices guides.&lt;/p&gt;

&lt;p&gt;V. Best Practices Suggestions&lt;br&gt;
Scenario 1: AWS service log import (dual-mode traversal is recommended)&lt;br&gt;
Typical logs: CloudTrail, VPC Flow Logs, and S3 access logs, in scenarios where file names increment sequentially.&lt;/p&gt;

&lt;p&gt;Recommended configuration:&lt;/p&gt;

&lt;p&gt;● Configure the new file check cycle to one minute.&lt;/p&gt;

&lt;p&gt;● Automatically enable incremental traversal to ensure real-time performance.&lt;/p&gt;

&lt;p&gt;● Automatically enable full traversal to ensure Integrity.&lt;/p&gt;

&lt;p&gt;● Configure the write processor to fetch key fields.&lt;/p&gt;

&lt;p&gt;Effect: Achieve an end-to-end latency of 2 to 3 minutes and 100% data integrity.&lt;/p&gt;

&lt;p&gt;Scenario 2: Application log Real-time Analysis (SQS solution is recommended)&lt;br&gt;
Typical scenarios: Application real-time logs, where the file generation rate and file names have no rules, but rapid Alerting is required.&lt;/p&gt;

&lt;p&gt;Recommended configuration:&lt;/p&gt;

&lt;p&gt;● Configure S3 event Notifications to SQS.&lt;/p&gt;

&lt;p&gt;● Use SQS-driven import.&lt;/p&gt;

&lt;p&gt;Effect: Achieve an end-to-end latency of within 2 minutes to meet real-time Alerting requirements.&lt;/p&gt;

&lt;p&gt;VI. Summary&lt;br&gt;
Data Import from S3 to SLS may seem like a simple data transport task, but it’s actually a systems engineering challenge that requires careful design. We solved the file discovery problem through dual-mode Intelligent traversal, automatically handled Traffic bursts through three types of Elasticity mechanisms, and reduced Customer costs through the write processor.&lt;/p&gt;

&lt;p&gt;This isn’t just a data import tool; it’s a complete cross-cloud log integration solution. Whether it is standard service logs or complex application logs, we can provide efficient, reliable, and economical import capabilities.&lt;/p&gt;

&lt;p&gt;Start now: Log on to the SLS console, select "Import Data &amp;gt; S3 - Data Import", complete the configuration in three steps, and start your cross-cloud log analysis journey.&lt;/p&gt;

</description>
      <category>data</category>
      <category>cloudnative</category>
    </item>
  </channel>
</rss>
