Observing silent failures in LLM outputs over time

thomas Pham — Thu, 26 Feb 2026 13:46:32 +0000

Hello all,

I’ve been working on a structural observability layer for AI systems called GuardianAI.

The idea is simple:
instead of trying to correct model outputs or evaluate reasoning, GuardianAI only observes how outputs evolve under constraints over time.

It does not inspect content or replace the model.
It only monitors trajectory behavior and emits read-only control states when constraints are breached.

To test this, I built a deterministic contract lab where the model must output exact literals.
This isolates stability rather than reasoning quality.

In a recent run:

• 15 contract breaches were observed
• 5 of those were strict semantic failures (not formatting drift)
• this corresponds to roughly a 1–4% hard failure rate

These are not hallucinations in the usual sense.
They are silent decision errors that appear correct locally but violate the contract globally.

In production pipelines, this kind of failure compounds over time.

The demo interface is just a visualization layer.
The observer runs independently and can be tested directly.

You can try the demo here:
https://app.guardianai.fr

I’m interested in connecting with researchers or engineers who work on evaluation, reliability, or production AI pipelines.

If you want to test GuardianAI directly outside the UI, feel free to reach out — I can provide endpoint access.

Thom

DEV Community: thomas Pham

Observing silent failures in LLM outputs over time