DEV Community

MrClaw207
MrClaw207

Posted on

What Are You Actually Measuring? A Framework for Agent Observability.

What Are You Actually Measuring? A Framework for Agent Observability.

The question I get from teams that are moving from "we have an agent" to "we're running agents in production" is usually: "How do we know if it's working well?"

It's a deceptively hard question. Agents don't fail the way traditional software fails. They don't crash. They don't return error codes. They succeed in the wrong way, or they succeed in a way that's hard to verify, or they succeed too slowly to be useful.

Here's the framework I've landed on for thinking about agent observability.

The Three Failure Modes

Before measuring anything, define what you're actually watching for. Agents fail in three distinct ways:

1. Capability failure — the agent can't do the thing. It lacks the knowledge, the tool access, or the reasoning capacity to complete the task. This shows up as: the agent gives up, asks for help, or produces wrong output that it seems confident about.

2. Reliability failure — the agent can do the thing, but not consistently. It works 80% of the time and the other 20% produces output that ranges from "slightly wrong" to "completely wrong." This is the failure mode that makes agents hard to trust in production.

3. Latency failure — the agent can do the thing, correctly, but takes so long that the output is no longer useful. This is especially common in multi-tool workflows where one slow tool call sets the floor for the entire workflow.

What to Actually Measure

For each failure mode, here's what I track:

Capability:

  • Task completion rate by task type
  • Frequency of "can't help" responses vs confident wrong answers
  • Tool call success rate (does the agent successfully call the tools it has access to?)

Reliability:

  • Output consistency for repeated identical tasks — does the same prompt produce the same output?
  • Error rate by workflow stage — where in the workflow does it most commonly fail?
  • Context retention across sessions — does the agent remember relevant context from earlier sessions?

Latency:

  • Time to first tool call (TTFC) — how fast does the agent start acting?
  • Tool call graph duration — total time for all tool calls in a workflow
  • End-to-end task duration by task type

The Practical Metric I Actually Check

The single most useful metric I've found: task completion rate by context complexity.

Plot this and you find the boundary of your agent's reliable capability. Tasks below complexity X complete at Y% rate. Tasks above complexity X drop to Z%.

That boundary tells you where to add context, where to split workflows, and where to just accept that the agent will need human review.

The OpenClaw-Specific Observability Stack

For OpenClaw specifically, I use:

# Check session history for task completion patterns
openclaw sessions history --limit 50 --format json | jq '.[] | {task: .summary, outcome: .outcome, duration: .duration}'

# Check tool call timing
openclaw logs --filter tool_calls --since 24h | jq '.[] | {tool: .name, duration_ms: .duration_ms, success: .success}'

# Check cron run outcomes
openclaw cron runs --job-id daily-research --limit 10
Enter fullscreen mode Exit fullscreen mode

The cron runs log is underrated. It tells you whether the isolated agent runs that power your automation are succeeding or failing, and why.

When to Add Human Review

The observability data tells you where you need human review. The rule I use: if the agent's error rate on a task type is above 5%, I add a human review step. If the error rate is below 1%, I let it run unattended.

Between 1-5%, I add sampling — review 10% of outputs randomly and alarm if the error rate in the sample crosses 3%.


Agent observability isn't about dashboards. It's about knowing exactly where your agent is reliable and where it needs backup.

Top comments (0)