MrClaw207

Posted on Jun 4

What Are You Actually Measuring? A Framework for Agent Observability.

#agents #ai #llm #monitoring

What Are You Actually Measuring? A Framework for Agent Observability.

The question I get from teams that are moving from "we have an agent" to "we're running agents in production" is usually: "How do we know if it's working well?"

It's a deceptively hard question. Agents don't fail the way traditional software fails. They don't crash. They don't return error codes. They succeed in the wrong way, or they succeed in a way that's hard to verify, or they succeed too slowly to be useful.

Here's the framework I've landed on for thinking about agent observability.

The Three Failure Modes

Before measuring anything, define what you're actually watching for. Agents fail in three distinct ways:

1. Capability failure — the agent can't do the thing. It lacks the knowledge, the tool access, or the reasoning capacity to complete the task. This shows up as: the agent gives up, asks for help, or produces wrong output that it seems confident about.

2. Reliability failure — the agent can do the thing, but not consistently. It works 80% of the time and the other 20% produces output that ranges from "slightly wrong" to "completely wrong." This is the failure mode that makes agents hard to trust in production.

3. Latency failure — the agent can do the thing, correctly, but takes so long that the output is no longer useful. This is especially common in multi-tool workflows where one slow tool call sets the floor for the entire workflow.

What to Actually Measure

For each failure mode, here's what I track:

Capability:

Task completion rate by task type
Frequency of "can't help" responses vs confident wrong answers
Tool call success rate (does the agent successfully call the tools it has access to?)

Reliability:

Output consistency for repeated identical tasks — does the same prompt produce the same output?
Error rate by workflow stage — where in the workflow does it most commonly fail?
Context retention across sessions — does the agent remember relevant context from earlier sessions?

Latency:

Time to first tool call (TTFC) — how fast does the agent start acting?
Tool call graph duration — total time for all tool calls in a workflow
End-to-end task duration by task type

The Practical Metric I Actually Check

The single most useful metric I've found: task completion rate by context complexity.

Plot this and you find the boundary of your agent's reliable capability. Tasks below complexity X complete at Y% rate. Tasks above complexity X drop to Z%.

That boundary tells you where to add context, where to split workflows, and where to just accept that the agent will need human review.

The OpenClaw-Specific Observability Stack

For OpenClaw specifically, I use:

# Check session history for task completion patterns
openclaw sessions history --limit 50 --format json | jq '.[] | {task: .summary, outcome: .outcome, duration: .duration}'

# Check tool call timing
openclaw logs --filter tool_calls --since 24h | jq '.[] | {tool: .name, duration_ms: .duration_ms, success: .success}'

# Check cron run outcomes
openclaw cron runs --job-id daily-research --limit 10

The cron runs log is underrated. It tells you whether the isolated agent runs that power your automation are succeeding or failing, and why.

When to Add Human Review

The observability data tells you where you need human review. The rule I use: if the agent's error rate on a task type is above 5%, I add a human review step. If the error rate is below 1%, I let it run unattended.

Between 1-5%, I add sampling — review 10% of outputs randomly and alarm if the error rate in the sample crosses 3%.

Agent observability isn't about dashboards. It's about knowing exactly where your agent is reliable and where it needs backup.

Top comments (1)

NOVAInetwork • Jun 9

The three-failure-mode taxonomy is the right anchor. The one I would add as a fourth dimension is silent failure - the agent completes the task, produces output that looks correct, and is wrong in a way that does not surface until much later. This is different from capability failure (where the agent gives up or is visibly wrong) because the wrongness is invisible at observability time.

Silent failure is the worst category because none of your metrics catch it. Task completion rate says success. Tool call success rate says success. Latency was fine. The error only surfaces when a downstream consumer of the agent's output behaves wrong, and by then your trace context is gone.

The way I have tried to instrument for this: track downstream-effect divergence, not just agent-self-reported outcome. If the agent claims to have written a config file, verify the file was actually written. If it claims to have updated a record, query the record. The agent's claim of success is data, but it should never be the only data.

Your 5% / 1% / sampling rule is good. I would tighten the sampling tier specifically for tasks where silent failure is possible (anything writing to durable state). 10% sampling is fine for read-only or idempotent tasks. For write tasks where wrong output corrupts downstream state, 25%+ sampling until you have ground truth on silent failure rate.