OpenTelemetry Tells You What Your Agent Did. Not Whether It Was OK.

#aiagents #observability #opentelemetry #platformengineering

OpenTelemetry's GenAI conventions will tell you your agent called Claude, spent 1,843 input tokens, took 900 milliseconds, and returned without an error. They will not tell you the answer cited zero sources, that the loop spun nineteen times before it gave up, or that the model never saw the guardrail that was supposed to stop it. Those are the facts that decide whether an agent is safe to run unattended. No standard layer captures them.

So I built a small one. ballast sits on top of OpenTelemetry: OTel tells you what happened; ballast tells you whether it was acceptable.

The split

OTel already owns the telemetry substrate — provider, model, token counts, latency, status. That problem is solved, and solved as a standard. ballast doesn't touch it. What it adds is the reliability layer, expressed as ballast.* attributes and events riding on the same gen_ai.* spans:

prompt-contracts — a versioned schema on the input or output. A violation surfaces on the span instead of failing silently three calls later.
guardrails — did the output cite a source and a confidence level? And the part most guardrail tooling skips: did the model actually see the failure, or did the app swallow it?
bounded loops — an agent loop has four ways to stop: done, out of iterations, out of budget, or stalled. ballast records which one, so "it finished" and "it gave up" stop looking identical in your traces.

You instrument an existing call by wrapping it. Nothing about your stack changes:

import { wrap, evidenceGuardrail } from '@michaeltuszynski/ballast';

const answer = await wrap(
  { name: 'gen_ai.chat', system: 'anthropic', model: 'claude-sonnet-4-5' },
  async (ctx) => {
    const res = await callYourModel();
    ctx.setUsage(res.inputTokens, res.outputTokens, res.costUsd);
    ctx.guardrail(evidenceGuardrail(res.text));
    return res.text;
  },
);

wrap opens a real OTel span, lets you record usage and reliability results onto it, and exports a protocol-conformant record to a runs.jsonl. Then ballast runs reads it back.

I almost built the wrong thing

The first design had ballast defining its own trace schema — provider, model, tokens, the works. I had a second model review the spec before I wrote a line of code, and it caught the mistake in one paragraph: OpenTelemetry already standardizes all of that. Reinventing it would have put ballast in a fight it can't win against a convention with a working group behind it.

So the protocol got rebuilt on the OTel GenAI semantic conventions, and ballast's surface shrank to the one thing nobody standardizes: reliability semantics. That review is why the repo exists in the shape it does. The lesson generalizes — the substrate is rarely the greenfield you assume it is.

What it deliberately is not

ballast is narrow, and staying narrow is the point.

It's not an agent framework. No chains, no memory, no tool execution, no orchestration. Bring your own runtime — Claude Code, the raw SDK, LangChain — and wrap the calls. The moment a reliability layer grows an orchestration engine, it stops being a reliability layer.

It's not a tracing backend. If you only need raw LLM telemetry, use OpenTelemetry, Langfuse, or OpenLLMetry directly. ballast emits OTel; it doesn't replace your collector.

And it doesn't pretend to see everything. Wrapping arbitrary agent code means hidden retries, streaming partials, and tool calls can slip past the instrumentation. A reliability layer that reports an incomplete trace as complete is worse than no layer — it manufactures confidence. So every span carries a ballast.trace.completeness flag, and each adapter declares what it can actually observe. "Partial" is a first-class answer.

Where it came from

The contracts-guardrails-bounded-loops discipline isn't theoretical. It's what kept agent platforms I've run in production from drifting — the difference between an agent that ships a clean statement of work and one that quietly invents a clause nobody catches until a customer does. ballast is that discipline pulled out of internal tooling and rebuilt as something standards-based and small enough to drop into anyone's stack.

This is the MVP: a TypeScript SDK, the protocol, a local JSONL store, and a CLI viewer. The Python SDK and eval-as-gates — running a prompt across several models and gating on the result — are the next slices, and the schema already carries them.

The repo is MIT, thirty tests, built on OTel. Clone it, run npm run example, and watch a span land in ballast runs. Then wrap one of your own calls and see what your traces haven't been telling you.

Top comments (1)

nexus-lab-zen • Jul 2

"A reliability layer that reports an incomplete trace as complete is worse than no layer — it manufactures confidence." That line, and making ballast.trace.completeness first-class instead of pretending the adapter sees everything, is the most honest design choice in here.

One failure class from our side that sits just outside ballast's current reach: claims with no span at all. We run a small multi-agent operation (7+ weeks, daily logs), and our worst incidents weren't badly-instrumented calls — they were an agent narrating a tool result that never happened. "The file was empty" with no read behind it; "committed" with no commit. Instrumentation can't catch a call that was never made. What worked for us was verifying the claim against physical state outside the narrating process — file mtime/size, HTTP status, registry state — before accepting "done". We wrote up the detector we shipped after the worst incident: dev.to/nexuslabzen/an-ai-on-our-te...

Which suggests a fifth axis for bounded loops: not just why the loop stopped, but whether the terminal "done" points at anything verifiable outside the trace. Your prompt-contracts machinery looks close to being able to express that — a contract on the completion claim itself, with an evidence pointer as a required field.

Staying narrow is the right call. OTel owns "what happened"; whether it was OK — and whether it happened at all — still needs owners.