DEV Community

ObservabilityGuy
ObservabilityGuy

Posted on

Is Your OpenClaw Truly Under Control?

This article details how to build a comprehensive observability and security audit system for OpenClaw AI Agents using Alibaba Cloud Simple Log Service (SLS).
Based on OpenClaw and Alibaba Cloud Simple Log Service (SLS), you can ingest logs and OpenTelemetry (OTEL) telemetry into SLS to build an AI Agent observability system. This system helps achieve a closed loop of behavior audit, O&M observability, real-time alerting, and security audit.

1.Why Must We Ask: "Is the Agent Really Under Control?"
"Under control" involves at least four aspects: who triggers the invocation, what the costs are, what operations are performed (especially high-risk tools), and whether the behavior is traceable and auditable. If you cannot answer these questions, you cannot claim that the Agent is running under control.

This article focuses on "how to use Alibaba Cloud SLS to answer the above questions." Session logs answer "what was done and how much it cost." Application logs answer "Identify system abnormalities." OTEL Metrics and traces answer "current Status and Duration." Multiple Data pipelines collaborate to provide a well-documented answer to "Is the Agent really running under control?"

1.1 Security Attack Surface of AI Agents
There is a fundamental difference between AI Agents and traditional backend services: The behavior of an Agent is non-deterministic. For the same User input, the model may generate completely different tool calling sequences. This means you cannot predict all behavior paths through Code review, unlike when you audit a REST API.

If observability is not implemented, you cannot answer "who is invoking your model, how much it costs, or whether malicious instructions have been injected." Therefore, you cannot claim that the Agent is running under control. Specific attack surfaces include:


These risks cannot be addressed solely by runtime protection in the Code (such as the tool policies and loop detectors built into OpenClaw). Runtime protection is the "city wall," while observability is the "sentry post." Only by continuously observing what the Agent is doing, who is invoking it, and how much it costs can you discover what the city wall failed to block.

1.2 Three Pillars of Observability: Different Data Answers Different Questions
Traditional observability is built on the three pillars of Logs, Metrics, and Traces. For AI Agents, these three assume different observability functions. Understanding what questions each can answer is the foundation for building the entire system later:


1.3 Why Choose Alibaba Cloud SLS
Alibaba Cloud Simple Log Service (SLS) is naturally suitable for this scenario:

● Native OTLP support: LoongCollector natively supports the OTLP protocol. It seamlessly integrates with the diagnostics-otel plugin of OpenClaw and is out-of-the-box.

● Rich operators and flexible query: Built-in processing and analysis operators make it convenient to parse, filter, and aggregate JSON nested fields (such as message.content and message.usage.cost) in Session logs. You can perform tool calling Statistics, cost Attribution, and sensitive pattern matching by writing a few lines of Structured Process Language (SPL).

● Security and compliance capabilities: It supports log access audit, RAM permission control, sensitive data masking, and encrypted storage to meet audit trail and compliance requirements. Alerting can be integrated with DingTalk, text messages, and Emails to facilitate timely response to security management events.

● Comprehensive Log Analysis: It provides a one-stop service of "Collection → Index → Query → Dashboard → Alerting." For small-Size Agents, log volume is low, and the cost of the pay-as-you-go billing method is low. When traffic increases, it can also automatically support Auto-scaling.

2.Panoramic Architecture
2.1 Data Pipeline

2.2 Data Source Mapping Table


Next, we will expand on the data sources one by one: Ingest Data → view scenario.

3.Behavior Audit: Session Logs
Session logs are the core data source for AI Agent security audits. They record every round of conversation, every tool calling, and every token consumption—completely reconstructing "What the Agent actually performed."

3.1 Data format
Each session corresponds to a .jsonl file. Each line is a JSON object, and the entry type is distinguished by the type field. The following is a log sequence generated in a typical conversation (taking a user request to read a system file as an example):

User message
{
  "type": "message",
  "id": "70f4d0c5",
  "parentId": "b5690259",
  "message": {
    "role": "user",
    "content": [{ "type": "text", "text": " Help me read the /etc/passwd file " }]
  }
}
Enter fullscreen mode Exit fullscreen mode

Assistant response (including tool calling)

{
  "type": "message",
  "id": "3878c644",
  "parentId": "70f4d0c5",
  "message": {
    "role": "assistant",
    "content": [
    { 
      "type": "toolCall", "id": "call_d46c7e2b...", "name": "read", 
      "arguments": { "path": "/etc/passwd" } 
    }],
    "provider": "anthropic",
    "model": "claude-4-sonnet",
    "usage": { "totalTokens": 2350 },
    "stopReason": "toolUse"
  }
}
Enter fullscreen mode Exit fullscreen mode

Tool execution result

{
  "type": "message",
  "id": "81fd9eca",
  "parentId": "3878c644",
  "message": {
    "role": "toolResult",
    "toolCallId": "call_d46c7e2b...",
    "toolName": "read",
    "content": [{ "type": "text", "text": "root:x:0:0:root:/root:/bin/bash\n..." }],
    "isError": false
  }
}
Enter fullscreen mode Exit fullscreen mode

Assistant final response (stopReason is stop)

{
  "type": "message",
  "id": "a025ab9e",
  "parentId": "81fd9eca",
  "message": {
    "role": "assistant",
    "content": [{ "type": "text", "text": "The content of the file `/etc/passwd` is as follows (excerpt): root:x:0:0:..." }],
    "usage": { "totalTokens": 12741, "cost": { "total": 0.0401 } },
    "stopReason": "stop"
  }
}
Enter fullscreen mode Exit fullscreen mode

From an audit perspective, the above sample (a round of user → assistant toolCall → toolResult → assistant stop) can already answer several key questions: Who (user) asked the Agent to do what (the read tool reads /etc/passwd), which model the Agent used (claude-4-sonnet), how much it cost ($0.0401), and what the result was (successfully read the content of /etc/passwd).

3.2 Connect to Simple Log Service (SLS)
LoongCollector collection configuration

SLS index configuration
Configure the following field indexes for the session-audit Logstore in the SLS console:

3.3 Audit scenario: Sensitive data leakage detection
After the Agent reads files or executes commands via tools, the returned content is recorded in the toolResult entry. If the returned content contains sensitive data such as API keys, AKs, private keys, or passwords, it means that this data has entered the Agent's context—it may be "remembered" by the model and leaked in subsequent conversations.

type: message and message.role : toolResult 
  | extend content = cast(json_extract(message, '$.content')  as array<json>) 
  | project content | unnest 
  | extend content_type = json_extract_scalar(content, '$.type'), content_text = json_extract_scalar(content, '$.text') 
  | where content_type = 'text' | project content_text 
  | where content_text like '%BEGIN RSA PRIVATE KEY%' or content_text like '%password%' or content_text like '%ACCESS_KEY%' or regexp_like(content_text, 'LTAI[a-zA-Z0-9]{12,20}')
Enter fullscreen mode Exit fullscreen mode

3.4 Audit scenario: Skills calling audit
When a skill file (such as SKILL.md) is read by the read tool, it is recorded in the content of the Assistant message with type: "toolCall", name: "read", and arguments.path. You can calculate statistics on which skills are called, the number of calls, and the most recent call time based on the path for compliance and usage analysis.

type: message and message.role : assistant and message.stopReason : toolUse
  | extend content = cast(json_extract(message, '$.content')  as array<json>) 
  | project content, timestamp | unnest 
  | extend content_type = json_extract_scalar(content, '$.type'), content_name = json_extract_scalar(content, '$.name'), skill_path = json_extract_scalar(content, '$.arguments.path') 
  | project-away content 
  | where content_type = 'toolCall' and content_name = 'read' and skill_path like '%SKILL.md' 
  | stats cnt = count(*), latest_time = max(timestamp) by skill_path | sort cnt desc 
Enter fullscreen mode Exit fullscreen mode

3.5 Audit scenario: High-risk tool calling monitoring
OpenClaw's tool permission system (Tool Policy Pipeline + Owner-only encapsulation) has already implemented control at runtime, but the observability layer should monitor independently of runtime protection—in case the policy configuration is incorrect, the observability layer is the last chance for Search. The definition of important tools is divided into two categories based on Scenarios.

Scenario 1: Tools prohibited by default in Gateway HTTP

When invoked via the gateway POST /tools/invoke, the following tools are denied by default, because their threat is too high or they cannot complete normally on the non-interactive HTTP interface:


whatsapp_login Interactive flow: Requires terminal QR code scanning, etc., and will suspend without response on HTTP
Scenario 2: Tools that require explicit approval from ACP

ACP (Automation Control Plane) is the automation entry point. The following tools are not allowed to Pass silently; they must be explicitly approved by the User before they are executed:


Monitoring the invocation of the above tools (and their equivalent Names in the log) in Session logs can detect abnormal or unauthorized behaviors. If a tool is still successfully invoked in the Gateway HTTP scenario, configuration bypass may exist, which you need to troubleshoot.

type: message and message.role : assistant and message.stopReason : toolUse
  | extend content = cast(json_extract(message, '$.content')  as array<json>) 
  | project content, timestamp | unnest | extend content_type = json_extract_scalar(content, '$.type'), content_name = json_extract_scalar(content, '$.name'), content_arguments = json_extract(content, '$.arguments') 
  | project-away content 
  | where content_type = 'toolCall' and content_name in ('exec', 'write', 'edit', 'gateway', 'whatsapp_login', 'cron', 'sessions_spawn', 'sessions_send', 'spawn', 'shell', 'apply_patch')
Enter fullscreen mode Exit fullscreen mode

3.6 Audit Scenario: Cost Attribution
Each Assistant message carries usage (containing totalTokens, input, output, cacheRead, and cacheWrite) as well as provider and model. Aggregating totalTokens by provider and model can answer "where the usage is spent." If the upstream provides usage.cost.total, you can also use the same method to aggregate by provider and model for cost Attribution.

type: message and message.role : assistant 
  | stats totalTokens= sum(cast("message.usage.totalTokens" as BIGINT)), inputTokens= sum(cast("message.usage.input" as BIGINT)), outputTokens= sum(cast("message.usage.output" as BIGINT)), cacheReadTokens= sum(cast("message.usage.cacheRead" as BIGINT)), cacheWriteTokens= sum(cast("message.usage.cacheWrite" as BIGINT)) by "message.provider", "message.model"
Enter fullscreen mode Exit fullscreen mode

4.O&M Observation: Application Logs
The role of application logs is different from Session logs. Session logs record Agent actions (audit-oriented), while application logs record the System running Status (O&M-oriented)—Is the Gateway Started Normally? Did the Webhook report errors? Is the MSMQ stacked?

4.1 Data Format
OpenClaw Gateway uses the tslog library to write structured JSONL logs:

{
  "0": "{\"subsystem\":\"gateway/channels/telegram\"}",
  "1": "webhook processed chatId=123456 duration=2340ms",
  "_meta": {
    "logLevelName": "INFO",
    "date": "2026-02-27T10:00:05.123Z",
    "name": "openclaw",
    "path": {
      "filePath": "src/telegram/webhook.ts",
      "fileLine": "142"
    }
  },
  "time": "2026-02-27T10:00:05.123Z"
}
Enter fullscreen mode Exit fullscreen mode

Key fields:
● _meta.logLevelName: log level (TRACE / DEBUG / INFO / WARN / ERROR / FATAL)
● _meta.path: source code File Path and line number, used for precise positioning
● Numeric key "0": bindings in JSON format, usually containing subsystem (such as gateway/channels/telegram)
● Numeric key "1" and subsequent: log message text

Log files scroll by Day (openclaw-YYYY-MM-DD.log), are automatically cleaned up every 24 hours, and have a single file limit of 500 MB.

4.2 Ingest into Simple Log Service

For indexes, it is recommended to establish field indexes for _meta.logLevelName, _meta.date, _meta.path.filePath, "0" (subsystem bindings), and "1" (message text).

4.3 Fault Dashboard by Subsystem
Application logs are aggregated by abnormal levels (WARN, ERROR, FATAL) and subsystems as dimensions, which makes it easy to see which type of abnormal concentrates in which widget.

_meta.logLevelName: ERROR or _meta.logLevelName: WARN or _meta.logLevelName: FATAL
  | project subsystem = "0.subsystem", loglevel = "_meta.logLevelName" 
  | stats cnt = count(1) by loglevel, subsystem 
  | sort loglevel
Enter fullscreen mode Exit fullscreen mode

4.4 Typical Security Audit Scenarios and Log Samples
Scenario 1: WebSocket unauthorized connection (unauthorized)

Security audit value: When a WebSocket connection is denied during the authentication phase, a WARN log is generated, which facilitates the discovery of unauthorized access caused by token errors, expiration, or forgery. During auditing, follow these points: subsystem: gateway/ws indicates that the log comes from the WS layer. In the message content, conn= indicates the connection ID, remote= indicates the client IP, client= indicates the client ID (such as openclaw-control-ui or webchat), and reason=token_mismatch indicates a token mismatch (expired, incorrect, or forged). If the same remote triggers a large number of unauthorized attempts with reason as token_mismatch within a short period, it may be a dictionary attack or misappropriation attempt. If the client is a known legitimate client but still fails frequently, the issue is likely a configuration or token rotation issue, and you need to troubleshoot from the O&M side.

{
  "0": "{\"subsystem\":\"gateway/ws\"}",
  "1": "unauthorized conn=e32bf86b-c365-4669-a496-5a0be1b91694 remote=127.0.0.1 client=openclaw-control-ui webchat vdev reason=token_mismatch",
  "_meta": { "logLevelName": "WARN", "date": "2026-02-27T07:46:20.727Z" },
  "time": "2026-02-27T07:46:20.728Z"
}
Enter fullscreen mode Exit fullscreen mode

Scenario 2: HTTP tool calling denied or execution failed

Security audit value: Failed or alert logs for POST /tools/invoke can reveal who is attempting to execute prohibited important tools or triggering permission or sandbox exceptions during execution. During auditing, follow these points: subsystem: tools-invoke allows you to quickly filter such events. The exception type (such as EACCES, ENOENT, or path) in the message content can distinguish between "unauthorized access to sensitive paths" and "configuration or path errors". For example, "open '/etc/shadow'" in the following example clearly points to an attempt to read a sensitive file. You need to combine this with Session logs to locate the caller.

{
  "0": "{\"subsystem\":\"tools-invoke\"}",
  "1": "tool execution failed: Error: EACCES: permission denied, open '/etc/shadow'",
  "_meta": { "logLevelName": "WARN", "date": "2026-02-27T10:00:07.000Z" },
  "time": "2026-02-27T10:00:07.000Z"
}
Enter fullscreen mode Exit fullscreen mode

Scenario 3: Connection or Request processing failed

Security audit value: Connection resets and parsing errors can expose abnormal client behavior, malformed requests, or man-in-the-middle interference. During auditing, follow these points: subsystem: gateway indicates that the log comes from the gateway core (WS or request processing). The message content distinguishes between two categories. "request handler failed: Connection reset by peer" is mostly caused by peer disconnection or network interruption. You can check whether the errors occur in bursts based on time or conn (suspected scan or DoS attacks). "parse/handle error: Invalid JSON" indicates that the request body is invalid, which may be a maliciously constructed malformed package or a compatibility issue. When a large number of such errors appear from the same source within a short period, you should prioritize troubleshooting attacks or abnormal clients.

{
  "0": "{\"subsystem\":\"gateway\"}",
  "1": "request handler failed: Connection reset by peer",
  "_meta": { "logLevelName": "ERROR", "date": "2026-02-27T10:00:08.000Z" },
  "time": "2026-02-27T10:00:08.000Z"
}

{
  "0": "{\"subsystem\":\"gateway\"}",
  "1": "parse/handle error: Invalid JSON",
  "_meta": { "logLevelName": "ERROR", "date": "2026-02-27T10:00:08.100Z" },
  "time": "2026-02-27T10:00:08.100Z"
}
Enter fullscreen mode Exit fullscreen mode

Scenario 4: Security audit category (device access upgrade, etc.)

Security audit value: Device pairing and permission upgrades leave an audit trail of "who, from what role or permission, upgraded to what role or permission, from which IP, and what authentication type". During auditing, focus on the structured fields in the message content: reason=role-upgrade indicates that the event is triggered by role promotion. device= indicates the device ID. ip= indicates the client IP, which can be used for comparison with known management IPs. roleFrom=[] roleTo=owner indicates an upgrade from no role to owner, which is a highly sensitive operation. auth=token indicates the Authentication Type used. If the same IP or device upgrades frequently during non-working hours, or if the number of entries with roleTo as owner increases abnormally, you should prioritize troubleshooting whether unauthorized access or account compromise has occurred.

{
  "0": "{\"subsystem\":\"gateway\"}",

  "1": "security audit: device access upgrade requested reason=role-upgrade device=abc-123 ip=192.168.1.1 auth=token roleFrom=[ ] roleTo=owner scopesFrom=[ ] scopesTo=[...] client=control conn=conn-1",

  "_meta": { "logLevelName": "WARN", "date": "2026-02-27T10:00:09.000Z" },
  "time": "2026-02-27T10:00:09.000Z"
}
Enter fullscreen mode Exit fullscreen mode

Scenario 5: FATAL and core exceptions

Security audit value: FATAL indicates that core features are unavailable, which may be caused by tampered configurations, dependency failures, or critical runtime errors. You need to immediately troubleshoot whether the issue is related to intrusion or misconfiguration. During auditing: Filter _meta.logLevelName = 'FATAL' in the error dashboard. Combine subsystem and the message content of "1" to locate the specific component and the cause of the error. If FATAL is accompanied by keywords such as "bind", "config", or "listen", you need to prioritize troubleshooting the exposed surface and configuration consistency. It is recommended that you configure real-time alerting (such as every minute, cnt > 0, push to DingTalk or text messages) to ensure an immediate response.

5.Real-time monitoring and alerting: OTEL telemetry
Session logs and application logs rely mainly on management events and audit trails, which are suitable for conditional retrieval and post-event attribution. From the perspective of the observability system, if you want to master aggregation metrics, trends, and request traces (such as cost/usage trends, session health degree, and duration and dependency of a single request), you need to use OpenTelemetry (OTEL) Metrics (counter, histogram, gauge) and Traces (distributed traces, latency, and invocation relationships). Together with logs, they form the complete observability capability of "logs + metrics + traces."

5.1 Access Simple Log Service (SLS)
OpenClaw has a built-in diagnostics-otel plugin (version 26.2.19 or later). It supports exporting Metrics, Traces, and Logs via the OpenTelemetry Protocol (OTLP)/HTTP (Protobuf) protocol.

Enable the plugin
Execute the command openclaw plugins enable diagnostics-otel to start the plugin. View the plugin status using the openclaw plugins list command. The expected status is loaded.

Configure ~/.openclaw/openclaw.json

{
  "plugins": {
    "allow": ["diagnostics-otel"],
    "entries": {
      "diagnostics-otel": { "enabled": true }
    }
  },
  "diagnostics": {
    "enabled": true,
    "otel": {
      "enabled": true,
      "endpoint": "https://127.0.0.1:4318",
      "protocol": "http/protobuf",
      "serviceName": "openclaw-gateway",
      "traces": true,
      "metrics": true,
      "logs": true,
      "sampleRate": 1,
      "flushIntervalMs": 60000
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Create collection configuration
In the SLS console, create logstores: otlp-logs and otlp-traces. Create metricstore: otlp-metrics, and the corresponding collection configuration.

{
    "aggregators": [
        {
            "detail": {},
            "type": "aggregator_opentelemetry"
        }
    ],
    "inputs": [
        {
            "detail": {
                "Protocals": {
                    "HTTP": {
                        "Endpoint": "127.0.0.1:4318",
                        "ReadTimeoutSec": 10,
                        "ShutdownTimeoutSec": 5,
                        "MaxRecvMsgSizeMiB": 64
                    },
                    "GRPC": {
                        "MaxConcurrentStreams": 100,
                        "Endpoint": "127.0.0.1:4317",
                        "ReadBufferSize": 1024,
                        "MaxRecvMsgSizeMiB": 64,
                        "WriteBufferSize": 1024
                    }
                }
            },
            "type": "service_otlp"
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode

5.2 What data is exported
To answer observability requirements such as "usage and cost," "entry stability," and "queue and session health," OpenClaw exports Metrics and Traces via OTEL. The following provides an overall description and details table (metric name, type, and function) categorized by requirements.

Cost and usage metrics
It is directly related to Large Language Model (LLM) invocation costs and is the core of fee control. By monitoring token consumption, estimated fees, run duration, and context usage, you can master the cost of each model invocation and discover waste caused by improper configuration or inefficient use.


openclaw.cost.usd generates data only when the upstream model.usage management event provides costUsd.

Webhook processing metrics
Webhook is an important entry point for OpenClaw to interact with external systems. By monitoring the quantity of received requests, fault counts, and processing duration, you can discover external invocation abnormalities in time and ensure integration stability.


Message queue metrics
The message queue is a transit station for job processing. By following the enqueue/dequeue quantity, queue depth, and wait time, you can determine whether the system is congested or whether jobs are backlogged. This facilitates adjusting resources or troubleshooting bottlenecks.


Session management metrics
Changes in session status and the quantity of stuck sessions reflect interaction health. Monitoring metrics such as stuck sessions and retries allows you to quickly discover conversations trapped in infinite loops or abnormal statuses, improving observability and troubleshooting efficiency.


Trace Span


5.3 Data value analysis
Scenario: Usage and cost distribution
Answer: Which models and Providers are the usage and money mainly spent on? Is the recent Token consumption trend normal, or is there a sudden surge? How is the cumulative usage ranked by model or Provider? When the Token growth rate is abnormal, you can perform further analysis combined with Session logs.

# Token consumption growth rate (alerts can be set: such as exceeding N tokens/min)sum(rate(openclaw_tokens[10m]))

# Token consumption trend (by model)
sum(rate(openclaw_tokens[5m])) by (openclaw_model)

# Cumulative Tokens (by Provider)
sum(openclaw_tokens) by (openclaw_provider)
Enter fullscreen mode Exit fullscreen mode

Scenario: Session stuck and execution too long
Answer: Are there currently stuck sessions or sessions with no progress? What are the frequency and time periods of stuck occurrences? Does the single Agent execution duration (P95/P99) exceed expectations, or are there long tails?

# Stuck sessions (Alert: > 0)sum(rate(openclaw_session_stuck[5m]))

# Execution duration P95 (Alert: such as > 5 minutes)
histogram_quantile(0.95, sum(rate(openclaw_run_duration_ms_bucket[5m])) by (le))
Enter fullscreen mode Exit fullscreen mode

Scenario: Webhook Error Rate and processing latency
Answer: What are the Request volume and fault counts of Webhooks for each channel, and is the Error Rate within an acceptable range? Have the quantiles (P95/P99) of single Webhook processing duration and Agent execution duration deteriorated? What are the differences in latency distribution by channel or by model? When the Error Rate or latency is abnormal, you can combine application logs to search for specific faults by Webhook subsystem.

# Webhook Error Rate (Alert: such as > 5%)sum(rate(openclaw_webhook_error[5m])) / sum(rate(openclaw_webhook_received[5m]))

# Execution duration P99 (by model)histogram_quantile(0.99, sum(rate(openclaw_run_duration_ms_bucket[5m])) by (le, openclaw_model))

# Webhook processing duration P95 (by channel)histogram_quantile(0.95, sum(rate(openclaw_webhook_duration_ms_bucket[5m])) by (le, openclaw_channel))
Enter fullscreen mode Exit fullscreen mode

Scenario: Queue backlog and wait time
Answer: Are the depth and enqueue/dequeue rates of each queue lane healthy? Is the wait time (P95/P99) of Jobs in the queue lengthening, or is there a backlog Trend? Which lanes are most prone to congestion? This facilitates detecting bottlenecks and adjusting resources before User experience deteriorates.

# Queue depth (by lane)histogram_quantile(0.95, sum(rate(openclaw_queue_depth_bucket[5m])) by (le, openclaw_lane))

# Queue wait time P95 (by lane)
histogram_quantile(0.95, sum(rate(openclaw_queue_wait_ms_bucket[5m])) by (le, openclaw_lane))
Enter fullscreen mode Exit fullscreen mode

6.Multi-source interaction: Composite troubleshooting flow
The previous sections demonstrated the independent value of each Data Pipeline. However, what truly embodies "keeping the Agent running under control" is the ability of multiple observable data Pipelines to work collaboratively.

The key to this flow lies in each Data Pipeline answering questions at different layers, and none is dispensable:

● Only OTEL without Session logs: You know the cost is soaring, but you do not know who, or what was done.
● Only Session logs without OTEL: You can audit behaviors but cannot perceive the Status from a holistic view.
● Only application logs: You can see System Errors but do not know the business behavior of the Agent.

7.Summary
To answer "Is your OpenClaw truly running under control?", you need to answer four questions simultaneously: who is triggering the invocation, how much it costs, what operations were performed (especially high-risk tools), and whether the behavior is traceable and auditable. Relying solely on runtime protection (tool policies, loop detection, and so on) is insufficient to claim control. You must establish a continuous observability system and use data to answer the above questions.

Based on Alibaba Cloud Simple Log Service (SLS), this topic unifies OpenClaw's three types of observable data—Session audit logs, application logs, and OTEL metrics and traces—into SLS to form a complete "logs + metrics + traces" capability. Session logs answer "What did the Agent do and how much did it cost." Application logs answer "Where is the system abnormal." OTEL answers "Current status and duration." By using LoongCollector file collection and OTLP direct ingestion, the system achieves a one-stop closed loop of collection, indexing, query, dashboard, and alerting. It also utilizes the audit, permission, and masking capabilities of SLS to meet compliance requirements.

In practice, the three data pipelines should be used collaboratively. OTEL alerting detects anomalies. Application logs are used to narrow down the scope and locate the subsystem and session. Then, Session logs are used to reconstruct the complete behavior chain and take response measures. Only through the interaction of the three sources can a verifiable audit and O&M closed loop be formed—from "there is an anomaly" to "where the problem is" and then to "what exactly the Agent did"—truly allowing the Agent to run under control.

Top comments (0)