The Observability Gap
For fifty years, the observability stack has assumed determinism. Prometheus scrapes CPU utilization. Jaeger traces request latency. PagerDuty fires when error rates exceed thresholds. The mental model is mechanical: if the database is slow, queries are slow; if the server crashes, requests fail. The “Three Pillars”: metrics, logs, traces, capture the behavior of infrastructure.
This model works because deterministic systems have a knowable correct state. A 200 OK is correct. A 500 is not. The boundaries are crisp, and deviations are bugs.
Why AI Agents Break This Model
AI agents introduce properties that deterministic observability cannot capture:
- Non-determinism: The same prompt produces different outputs on successive calls. Traditional monitoring treats variance as noise; in agent systems, variance is the signal.
- Semantic correctness: A 200 OK with a hallucinated answer is worse than a 500 error. HTTP status codes carry zero information about output quality. An agent that confidently produces wrong code or wrong medical advice is more dangerous than one that crashes.
- Progressive degradation: As context windows fill, LLM output quality degrades gradually, responses get shorter, less accurate, and more repetitive. This is Context Rot. There is no error. There is no crash. There is only a slow rot that traditional monitoring cannot see.
The “Laziness” Problem
During my work observing coding agents at scale, I discovered a failure mode that no existing observability tool detected: agent laziness. The agent would produce syntactically valid but substantively empty responses, placeholder functions, TODO comments instead of implementations, or responses that technically answered the question while doing as little work as possible.
This is not a hallucination. It is not an error. It is a quality degradation that only becomes visible when you measure the gap between what was asked and what was delivered. This discovery led to the development of a Laziness Index that measures response length shrinkage, placeholder patterns, and delegation frequency. The metrics we were building for coding agents were actually capturing fundamental properties of human-agent collaboration.
The Framework: Behavioral Observability
Observability for non-deterministic systems requires a shift from infrastructure metrics to behavioral metrics. We must measure not what the system is doing (CPU, memory, latency) but what the system is achieving (correct outcomes, user satisfaction, progressive quality).
Core Principle: The Human as Sensor
In human-agent collaboration, the human’s behavior is the most reliable signal of agent quality. When a developer says “that’s wrong, fix it,” they are providing a ground-truth quality signal that no automated evaluation can match.
This is Correction-Based Observability : the systematic detection and scoring of human corrections to agent outputs as a proxy for output quality.
Seven Behavioral Metrics
- Hallucination Index: Rate of human corrections to agent outputs.
- Laziness Index: Response quality degradation and effort avoidance.
- Context Rot Index: Quality degradation over session length.
- Flow Score: Consecutive productive interactions without correction.
- Loop Rate: Consecutive correction cycles without progress.
- Session Health: Three-tier classification (Clean, Bumpy, Troubled).
- Cost Per Outcome: Token spend divided by tangible deliverables.
Application Across High-Stakes Domains
The correction-based observability pattern is universally applicable to any human-agent collaboration where the human can signal dissatisfaction.
Healthcare: Clinical Decision Support
A hallucinating coding agent produces a bug. A hallucinating clinical agent produces a misdiagnosis. In this domain, the framework uses tighter thresholds. A 15% hallucination rate in coding is a productivity issue; in healthcare, the threshold for a “Troubled” session is often a single override.
Energy and Grid Management
In energy systems, the consequences of errors manifest in physical system behavior. The observability layer tracks Physical Constraint Violation Rates — recommendations that violate thermal limits or voltage bounds — which are physically impossible “hallucinations.”
Financial Services and Legal
- Finance: Measuring the Fair Lending Deviation Index to track if underwriter overrides vary by borrower demographics.
- Legal: Monitoring the Citation Hallucination Index to detect non-existent case law before it reaches a court filing.
Architecture: Privacy-First and Local-First
Behavioral observability data is sensitive. A correction log reveals what an expert (a doctor, an attorney, an engineer) had to fix.
The proposed architecture for this framework is local-first : all computation happens on the practitioner’s machine. No raw sessions or corrections leave the device. Team-level aggregation uses anonymized identities and transmits only aggregate metrics. This removes the primary barrier to AI observability: the fear that the tool will expose individual performance rather than system reliability.
Conclusion
Non-deterministic observability is not a product; it is a discipline. As we scale agentic architectures, we must stop measuring what agents consume (tokens, latency) and start measuring what they achieve.
While the foundational primitives for agent tracking exist in the open-source Agent Governance Toolkit (AGT), this behavioral framework represents a necessary evolution in how we ensure AI reliability at scale.

Top comments (0)