The DEV community is buzzing about OpenTelemetry standardizing LLM tracing. That is a real win. Spans, traces, semantic conventions for gen AI — all of it matters. I have been watching this space for a while.
But I want to say something that production experience has drilled into me.
Observability without correction is a dashboard full of problems you are still solving manually.
What Tracing Gives You
OpenTelemetry for LLMs gives you visibility into:
- Latency per call
- Token consumption
- Span trees across your agent chain
- Model inputs and outputs at each step
That is genuinely useful. I am not dismissing it.
But here is what it does not give you:
Detection that the output is hallucinated before it reaches your user
Automatic retry with a corrected prompt when groundedness fails
Cost circuit breakers that fire before your inference bill explodes
Safety flags that block a response instead of just logging that it was bad
You are still the correction layer. You are the human staring at a Grafana dashboard at 2am deciding whether to roll back a prompt.
What Production AI Actually Looks Like
I spent the last two years building AI systems at scale inside regulated industries. Healthcare revenue cycle. Power grid intelligence. Genomics pipelines. These are not playgrounds.
The pattern I kept seeing: teams invest heavily in logging and tracing. They build beautiful dashboards. And then when the LLM misbehaves in production, the process to correct it is still manual, slow, and incident-driven.
The gap is not observability. The gap is autonomous correction at the output layer.
Nobody had shipped that as a product. So I built it.
ARGUS: Autonomous Runtime Guardian for Unified Systems
ARGUS is an open-source LLM observability platform that goes one layer further than tracing. It evaluates six dimensions of LLM output in real time:
For agentic systems specifically, ARGUS adds three more signals:
- ASF Agent Success Fraction — what percentage of agent tasks complete successfully
- ERR Error Recovery Rate — how well the agent recovers from tool failures
- CPCS Cost Per Completed Subtask — real cost accountability at the task level
When a dimension fails a threshold, ARGUS does not just log it. It triggers a correction loop.
The Architecture in One Diagram
LLM Call → ARGUS Eval Layer → Pass/Fail per Dimension
↓
Fail Detected
↓
Autonomous Correction Loop
(prompt rewrite + retry)
↓
Corrected Output → Your App
The observability layer is the open-source core (argus-ai on PyPI). The autonomous correction loop is the proprietary layer being built for enterprise deployment.
Why OpenTelemetry + ARGUS Is the Right Stack
I am not building against OpenTelemetry. The right mental model is:
- OpenTelemetry = infrastructure observability, distributed tracing, the plumbing
- ARGUS = semantic output evaluation, correction, quality guarantee
They compose. You can pipe ARGUS evaluation results as spans into your OTel collector. You get the full picture: infrastructure health AND output quality in the same trace.
What I Learned the Hard Way
At R1 RCM I led the engineering work that contributed to a $4.1B acquisition. The AI systems underpinning that work processed millions of healthcare claims. When the LLM got it wrong, it was not just a metric. It was a denied claim, a delayed payment, a patient impact.
Tracing told us what happened. Correction prevented why it would happen again.
That difference is what drove me to build ARGUS.
Get Started
pip install argus-ai
GitHub: github.com/anilatambharii/argus-ai
PyPI: pypi.org/project/argus-ai
If you are building LLM systems in production and want to collaborate, reach out. This is open-source and I want it to be the standard evaluation layer the community builds on.
25 years of production AI. All opinions are mine. All lessons were expensive.

Top comments (0)