Observability and evidence in AI coding workflows: two log streams, two masters

#ai #rust #observability #security

A few months back I watched an external reviewer ask one question I could not answer.

For the AI session that touched this medical device firmware on Tuesday, can you show me the inputs, the policy decisions, the outputs, and a signature that says nobody changed the record?

We had beautiful dashboards. Token usage, latency, top tools, error rates. We did not have what she was asking for.

That conversation forced a vocabulary change. We started saying observability and evidence as two separate words for two separate things. This article is the version of that conversation I wish someone had given me a year earlier.

The short version

Observability is consumed by an engineer trying to understand what the system is doing. Rich, sometimes lossy, meant for a dashboard. Sampling is fine. Cardinality limits are fine.

Evidence is consumed by someone who was not there. A reviewer. A regulator. A future-you reading the file three months after the session. Evidence has to be complete for the session, normalized so the reader does not have to learn your stack, and tamper evident so they can trust no one rewrote history. Sampling is not fine. Format drift is not fine.

If you build only observability, you can debug yesterday. If you build evidence, you can defend yesterday.

A side by side

Aspect	Observability	Evidence
Audience	Internal engineers	External reviewers and future you
Time horizon	Hours to days	Months to years
Sampling	Often, by tenant or rate	Never
Schema	Vendor specific or OTLP	Normalized, vendor neutral (AGEF)
Tamper evidence	Not required	Required
Storage	Hot, indexed	Warm, durable, lifecycle managed
Cost driver	Cardinality	Volume per session
Tooling	Datadog, Langfuse, Helicone	Akmon trust pipeline plus AGEF bundles
Failure mode	Slow dashboards	Failed audit

Both belong in your stack. Neither replaces the other.

Why teams keep conflating them

Three reasons, all reasonable.

The data starts in the same place. Both come from the agent runtime.
The first audience is engineering. Observability comes first because it solves immediate pain. Evidence is for a later audience.
Vendors blur the line. Many tools sell observability as "audit ready". Look at the schema, not the marketing.

You can use the same source events. You should not use the same destination format.

The data shape each one needs

Observability wants to slice and dice. The data is dimensional. Tag everything. Sample when you have to. A typical span has attributes for tenant, user, model, tool name, version, with counters for tokens in, tokens out, cost. High cardinality fields can be dropped or hashed.

Evidence wants to be complete and verifiable. The data is hierarchical. Each session is a chain of typed events with a known closed kind list (SessionStart, UserTurn, ProviderCall, ToolCall, RetrievalCall, PermissionGate, AssistantTurn, SessionEnd). Inputs and outputs are content-addressed. Hashes link events to objects. The bundle has a signed manifest in the future and a verifiable chain today.

The two shapes are related but not the same.

What gets recorded in each

A short tour of fields, with the question they answer.

Observability

latency_ms: how slow was that step?
tokens_input, tokens_output: what is this costing?
model_name, model_version: did a recent change cause the regression?
error_class, error_message: where are breakages clustered?
user_id_hash, tenant_id: who is affected?

Evidence (AGEF v0.1)

Manifest: agef_version, producer, session.id, session.head, timestamps, hash_algorithm, counts.
Events: parents, kind, emitted_at, sequence, plus kind-specific fields.
Objects: every input, output, prompt, message, side effect.

A reader of an observability stream cannot answer "show me the input that produced this denial". A reader of an AGEF bundle cannot easily answer "what is the p99 latency for this tool". Both are right.

A pipeline that produces both

You do not need two duplicate pipelines. You need one source and two sinks.

agent runtime  ->  Akmon (single Rust binary)
                          |
                          +--> OTLP / metrics --> observability backend
                          |
                          +--> .akmon/audit/<session>.jsonl  --> trust pipeline
                                .akmon/evidence/<session>.json
                                AGEF bundle (Phase 4)

Akmon writes the audit chain and the evidence summary. Phase 4 export turns those into a portable AGEF bundle. Observability data continues to flow to your dashboard.

How to roll this out without breaking the existing stack

If your team already has observability, do not rip it out. Add evidence next to it.

A migration that works in three steps:

Two weeks: install Akmon. Run the trust pipeline on a small set of sessions. Confirm the three exit codes are 0.
Two weeks: add policy packs. Pin the tool surface. Run replay against a small canonical set in CI.
Ongoing: expand to more repos. Map AGEF event kinds to control statements. Hand a sample to your reviewer and ask what is missing.

By the end of the third step, the conversation with the reviewer changes. Instead of "we have logs", the answer is "here is the bundle, here is the verifier, here is the policy that fired".

Five things teams get wrong

Sampling for evidence. If a session is missing, the answer to "what did the agent do at 9:14" is "we do not know". The worst possible answer.
Storing raw inputs without redaction in shared bundles. Sensitive data does not belong in an evidence bundle that is going to a third party. Redact at the boundary; keep the structural events.
Same retention as observability. Observability is hot. Evidence is warm. Use storage classes that match the access pattern.
No verification step. A bundle that is not verified is a bundle that might fail when it counts. Make bundle import --verify-only a required step before any handoff.
Treating the format as an internal detail. Evidence is consumed by people outside your team. Pick an open format that survives them, not just one that fits today. AGEF is meant for that.

Why I picked AGEF for the evidence side

I wrote AGEF because I needed it. Observability formats were not built for the audience that consumes evidence. Vendor-specific traces leave the reader stuck in a dashboard. Custom JSON drifts in a quarter and rots in a year.

AGEF is small, opinionated, and portable. The full spec fits in a short document. The bundle is tar.zst with a manifest.json, events.bin (length-delimited canonical CBOR), and objects/<hex> files. The format is signed (planned). It works for any runtime that can emit it. It is the part of the stack I want to outlive any one tool, including mine.

Where to start

If you have observability already, you have most of the source data. The work is to project it into a normalized record per session, redact at the object level when you share, and verify the bundle on import. Akmon and AGEF give you that path without rewriting your dashboard.

If you do not have observability yet, start there. Get OpenTelemetry traces flowing for any agent runtime you care about. Then put Akmon in the middle and start writing evidence.

The repo is at github.com/radotsvetkov/akmon. The format is at github.com/radotsvetkov/agef. The site is at radotsvetkov.github.io/akmon.

When the next reviewer asks for the inputs, the policy decisions, the outputs, and a signature, you will have one answer instead of three.