DEV Community: Radoslav Tsvetkov

How to prove what your AI agent actually did (to someone who doesn't trust you)

Radoslav Tsvetkov — Sat, 13 Jun 2026 12:21:01 +0000

You let an AI agent edit your repo, run commands, call tools. The next morning someone asks: what did it actually do? And you realize the only honest answer you have is "trust me."

That answer does not survive contact with a client, a security reviewer, or an auditor. I got tired of it, so I built a tool that produces an answer a skeptic can check on their own machine. This is a walkthrough of how it works, with commands you can run in about two minutes.

The tool is soma, an open-source (Apache-2.0) runtime. But the interesting part is not the tool, it is the four ideas it stitches together. You could rebuild any of them yourself; the point of this post is the pattern.

The problem, stated precisely

Every big vendor shipped agent governance this year (Microsoft Agent 365, OpenAI Frontier, Google's agent platform). They are good at managing fleets. But the audit trail they produce lives inside the vendor's cloud. To believe the log, you have to believe the cloud that produced it.

That is fine until the person asking does not trust your cloud, your machine, or you. A freelancer cannot show a client their M365 tenant. A regulated team cannot put an American cloud in the trust path. An auditor wants to verify, not to take your word.

So the design goal is unusual: evidence that survives leaving your trust boundary. A record that a stranger can verify with tools they already have, no special software, no account.

The four ideas

1. Gate at the spawn boundary, not inside the agent

Agents are good at finding paths around guardrails you put inside the loop. So the check happens before the process exists.

# Install: one binary, zero dependencies to fetch.
git clone https://github.com/radotsvetkov/soma && cd soma
cargo build --release
alias soma=$PWD/target/release/soma

soma init --name demo --with-builtins

# Govern any agent CLI. The launch is checked before anything runs.
soma wrap --label fix-tests -- claude -p "fix the failing tests"

If the command matches a deny rule or the autonomy level forbids it, nothing spawns:

soma wrap -- sudo rm -rf /tmp/something
# refused, exit non-zero, and the refusal is recorded with the rule that fired

The key move: the agent never gets a chance to run, and the refusal itself is part of the record. You can prove later that something was blocked.

2. Hash-chain the log so tampering is detectable

Every action becomes a line in an append-only JSONL file. Each line carries the SHA-256 of the previous line. That makes the file a hash chain: change any line and every line after it no longer matches.

soma log verify
# recomputes the whole chain from scratch

Let's break it on purpose:

# edit one byte of history
sed -i.bak 's/governed/harmless/' .soma/events.jsonl

soma log verify
# TAMPERED at line 7: stored hash mismatch   (exits 1, names the exact line)

# restore it byte for byte
mv .soma/events.jsonl.bak .soma/events.jsonl
soma log verify
# chain valid again

This is the part that makes a demo land. You can see the tamper get caught at the exact line, and you can see the chain heal when you restore the byte.

3. Make the evidence verifiable without the tool

A hash chain you can only check with my tool is not much of a proof. So the export bundles everything a stranger needs, plus a VERIFY.md that spells out the checks in plain shell.

soma export
# produces a bundle in exports/ with manifest.json, events.jsonl, VERIFY.md

On the skeptic's machine, with no soma installed, two checks:

# 1. every file digest matches the manifest
shasum -a 256 events.jsonl   # compare against manifest.json

# 2. recompute the chain by hand: for each line, sha256 of the line with its
#    own hash field removed must equal the stored hash, and each "prev" must
#    equal the previous line's hash.

The verification cost is seconds, on their hardware, with shasum and a JSON reader. No trust in me required. That is the whole point.

4. Anchor the chain so even the operator cannot rewrite history

Here is the honest hole in everything above: I control the binary, the journal, and the machine. The hash chain proves the file was not casually edited. It does not prove I did not regenerate the whole thing.

That is what timestamp anchoring is for. soma submits the chain head to a public RFC 3161 timestamp authority and stores the response.

soma preset apply hybrid-default   # opt in to network (fresh projects are local-only)
soma anchor now                    # timestamps the chain head at a public TSA

Now anyone can verify the timestamp with stock openssl against the authority's certificate. After an anchor, I cannot backdate or rewrite history that crosses it without the public timestamps contradicting me.

What this proves: this exact journal state existed at time T, and everything after extends it. What it does not prove: that the events were true when written. Nothing prevents an operator writing fiction forward; everything prevents revising it after the fact. That is a much smaller and checkable claim than "trust my logs."

The honest limits (these are the point)

An evidence tool that oversells itself is broken on arrival, so:

wrap is not a sandbox. It gates the launch and records evidence. A wrapped agent can still do what your OS user can do. For hard isolation, run it under a container or a restricted user.
Anchoring proves no-rewrite, not truth. See above.
Single operator, single machine. No fleets, no agent identity, no RBAC. If you need to govern 10,000 agents across an org, the hyperscaler products are genuinely the right call. soma is for the case where you are the one who has to prove things.
Zero dependencies means hand-rolled crypto. SHA-256 (tested against the FIPS vectors), JSON, and HTTP are written out in src/. TLS deliberately is not; that goes through system curl. The trade is an audit surface you can read in a sitting, and cargo tree is one line. Reviewing the supply chain is reviewing src/.

Why bother

Nobody reads audit logs, which is the usual objection. True. The design does not optimize for reading; it optimizes for disputes. Nobody looks at the journal until something goes wrong, and then the only question is whether the record settles the argument or becomes another argument. A bundle that verifies in seconds on the skeptic's machine settles it.

If you run agents for clients, or compliance keeps asking you questions you can only answer with screenshots, this might be useful. The repo is github.com/radotsvetkov/soma, there is a desktop cockpit at github.com/radotsvetkov/cockpit, and I would genuinely like to hear where it breaks. The honest-limits list grows with every good objection.

What evidence do you wish you had the last time an agent touched something that mattered?

AGEF explained: a portable evidence format for AI agent sessions

Radoslav Tsvetkov — Thu, 14 May 2026 08:19:44 +0000

If you ship AI-assisted code in a regulated codebase and somebody asks "show me what the agent did", you have about a week before that question turns into a finding. The data exists somewhere. It is not in a single shape. It is rarely portable. It is almost never tamper evident.

AGEF is the spec I wrote to fix that. It stands for Agent Governance Evidence Format. The current text is v0.1.1 (pre-stable), with wire format agef_version: "0.1". The repo is at github.com/radotsvetkov/agef. Spec text is CC BY 4.0. Code is Apache-2.0. Akmon is the reference implementation; akmon-journal is a Substrate Profile, and Akmon Phase 4 brings full Bundle Profile support.

This article walks the spec end to end, with code, and with the design choices I made.

What AGEF actually is

AGEF defines how one AI agent session can be represented as a portable, tamper-evident bundle. A session is a logical run from SessionStart to SessionEnd. The bundle captures every event in order, with cryptographic linkage and content-addressed payloads.

The bundle is a tar.zst archive with three top-level paths:

manifest.json, a small UTF-8 JSON file with sorted keys and LF newlines.
events.bin, an ordered stream of length-delimited canonical CBOR event records.
objects/<hex>, a directory of content-addressed object files (one per hash).

That is the whole shape. Verifiers ignore unknown non-normative files unless explicitly configured to reject them.

The manifest

{
  "agef_version": "0.1",
  "producer": {"name": "akmon", "version": "2.0.0"},
  "session": {
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "head": "5c1f8b2a3d4e7f6a8b1d3e2c4f6a8b9d2c3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c",
    "created_at": "2026-05-06T09:14:02Z",
    "ended_at": "2026-05-06T09:14:18Z"
  },
  "hash_algorithm": "sha256",
  "object_count": 12,
  "event_count": 9
}

A reader who has never seen the producer can answer four things from this object: which version of the format, who wrote it, when, and how big it is. Writers must keep counts honest. Readers must reject malformed or incomplete required fields.

Default hash algorithm is sha256. Readers must support sha256 and may support blake3. Bundles that declare unsupported algorithms must be rejected. Hex appears only in the manifest (session.head) and in object filenames. Hashes inside CBOR-encoded events are 32-byte byte strings.

The events stream

events.bin is the substance. It is a sequence of records, each framed by a 4-byte unsigned big-endian length prefix followed by exactly that many bytes of canonical CBOR encoding one event.

Length-delimited framing was a deliberate trade. It supports partial recovery from truncation and lets a verifier scan deterministically. Producers commit to canonical CBOR per RFC 8949.

Each event encodes:

parents, an array of zero or more parent event hashes.
kind, one of the closed event kinds in v0.1.
emitted_at, a CBOR tag 1 timestamp (integer epoch seconds, or floating point for sub-second precision).
sequence, monotonic per session, starting at 0.

Linkage rules are strict in v0.1:

sequence starts at 0 and increases by exactly 1 per event.
Every event except SessionStart has at least one parent.
SessionStart has exactly zero parents.
Every non-SessionStart event in v0.1 has exactly one parent, and that parent is the hash of the immediately preceding event by sequence. Multi-parent events are reserved for future versions.
Event hashes are computed over canonical CBOR bytes of the full event envelope.
Event ordering in events.bin matches sequence.

Event kinds in v0.1 (closed set)

Readers must recognize exactly these kinds in v0.1. Unknown kinds are rejected.

SessionStart: opens the session, fields cwd_hash and config_hash.
UserTurn: a user prompt, field prompt_hash.
ProviderCall: a model provider call, field provider_id, an attempts[] array of AttemptRecord, optional stream_hash.
ToolCall: a tool execution, fields tool_id, input_hash, output_hash, optional side_effects_hash.
RetrievalCall: a retrieval, fields index_id, query_hash, results_hash.
PermissionGate: a policy decision, fields policy_id, decision (string, recommended lowercase verbs like allowed, denied, deferred), context_hash.
AssistantTurn: an assistant message, fields message_hash, optional tool_calls_hash.
SessionEnd: closes the session, optional summary_hash.

ProviderCall.attempts[] preserves chronological order. Each AttemptRecord has attempt_number (1-indexed), started_at, ended_at, a closed AttemptStatus, request_hash, optional response_hash, optional stream_hash, optional error_message. AttemptStatus is one of Success, RateLimited, NetworkError, ServerError, ClientError, Cancelled, or Other(string). v0.1 readers reject unknown variants.

Two design notes worth calling out.

First, PermissionGate.decision is open in v0.1. The spec recommends lowercase verbs but does not close the set. Closing it is a likely v1.0 change. If you are a producer, follow the recommendation now to make the v1 transition free.

Second, AttemptStatus and EventKind are intentionally closed. The cost is that adding a new variant is a normative spec change. The benefit is that two implementations cannot legitimately disagree on the meaning of a record.

The objects directory

Every object referenced by an event hash field exists as a file at objects/<hex> where <hex> is the lowercase hex digest for the active hash algorithm. Object bytes hash to their filename digest. Objects are opaque bytes; AGEF v0.1 does not require MIME metadata.

A typical bundle has many small objects (prompts, tool inputs, tool outputs, file diffs) and a few large ones (full file contents). Storage is efficient because identical content hashes once.

Hashing rules

default: sha256
optional: blake3

Readers must support sha256. They may support blake3. Within CBOR-encoded events, hashes must be encoded as CBOR byte strings (major type 2) of length 32 for both algorithms. Hex string representation is used only in manifest.json (session.head) and in object filenames. Once you internalize that, the rest of the spec falls into place.

Serialization rules in one paragraph

Event payloads in events.bin use canonical CBOR (RFC 8949). manifest.json is UTF-8 JSON with sorted keys and LF endings. Timestamps in CBOR-encoded events use CBOR tag 1. Producers may emit either integer epoch seconds or floating point; readers must accept both. The reference implementation akmon-journal v0.1 emits integer epoch seconds. Timestamps in manifest.json use RFC3339 strings. Implementations may use any internal storage format; AGEF rules apply only to the bytes emitted in events.bin and to the bytes used for event hashing.

Verification procedure

A conforming verifier runs this sequence:

Extract the archive.
Parse manifest.json. Reject on schema or version failure.
Read events.bin using the 4-byte length-delimited framing.
For each event: decode canonical CBOR, recompute event hash, verify sequence monotonicity, verify all parents resolve to previously seen events, verify referenced content hashes resolve to files in objects/.
For each referenced object: read bytes, hash with manifest.hash_algorithm, compare to filename digest.
Confirm manifest.event_count equals decoded event count, manifest.object_count equals object file count, manifest.session.head equals terminal event hash, and SessionStart is a reachable ancestor of head.

Default operation fails on the first invariant violation. Implementations may offer a "report-all" mode for diagnostics. They must not claim successful verification unless all required checks pass.

Conformance profiles

AGEF v0.1 defines two profiles:

Bundle Profile, an implementation that produces or consumes AGEF bundles.
Substrate Profile, an implementation that maintains an AGEF-compatible content-addressed event journal but does not necessarily emit bundles directly.

A Substrate Profile must be able to produce Bundle Profile output through an export pathway when required.

akmon-journal is currently a Substrate Profile. Akmon Phase 4 introduces Bundle Profile capability.

If you are an implementer, knowing which profile you are claiming is the most important conformance question.

Producing AGEF in practice

If you use Akmon, you already produce AGEF. Sessions land in .akmon/audit/<session-id>.jsonl and .akmon/evidence/<session-id>.json. With Phase 4 export, the same session is portable as a .akmon bundle.

If you build your own producer, the spec lists the libraries known to produce canonical CBOR with the right configuration: ciborium for Rust, fxamacker/cbor for Go, cbor2 for Python, cbor-x for JavaScript and TypeScript. Validate canonical encoding behavior with test vectors before claiming conformance.

Reading and verifying AGEF

For Akmon-produced bundles, the verification commands are part of the trust pipeline:

# Verify audit chain integrity for the live session journal
akmon audit verify .akmon/audit/<session-id>.jsonl

# Verify the evidence summary plus its linkage to the audit chain
akmon evidence verify .akmon/evidence/<session-id>.json

# Once Phase 4 export is on hand, verify a bundle on import
akmon bundle import evidence.akmon --verify-only

Independent verifier implementations are welcome. The spec is small enough to implement in a weekend if you start with a CBOR library that supports canonical encoding.

Security considerations

AGEF provides tamper evidence and portability. It does not provide identity attribution by itself. If producer trust matters in your environment, layer external signing on top of the bundle. Bundles may contain sensitive content; storage and sharing controls are the operator's job. Verification proves integrity, not semantic correctness.

For sharing, the redaction flow is the right answer. With Akmon:

akmon redact <session-id> \
  --output sanitized.akmon \
  --object <object-hash> \
  --reason "Removed customer name before audit handoff"

This produces a derivative bundle in which the targeted objects are replaced by canonical CBOR sentinels. The sentinel payload records the original hash, original size, the supplied reason, and a redacted_at timestamp. The chain still verifies because the sentinel hashes correctly, and the audit trail makes the redaction explicit.

Compatibility and evolution

v0.x is pre-stable. Breaking changes are permitted. Future versions should preserve forward migration guidance. v1.0 is the first stable major. New event kinds in future majors must be version-gated. v0.1 readers must not silently ignore unknown required semantics.

If you build against v0.1, plan for one migration to v1.0. If you are a careful producer, the migration is small.

Why a new format

Three reasons.

First, no shared shape exists today. Every framework, every vendor, every gateway writes its own. A reviewer cannot move between them. The cost of a new format is small compared to the cost of having no shared format.

Second, evidence has a different audience than observability. Observability is for engineers in the moment. Evidence is for reviewers, regulators, customers, and the version of you reading the file in a year. Different shape, different guarantees, different expectations.

Third, regulated engineering needs an answer that does not depend on a vendor dashboard. The portability requirement is real. AGEF is built for it.

What you can do this week

If you build AI tooling, three small steps:

Read SPEC.md end to end. It is short on purpose.
Implement a Bundle Profile verifier in your language of choice. Use the test vectors that ship in the examples/minimal-bundle folder.
Open an issue on the AGEF repo. Tell us what your runtime emits today and where the spec helps or gets in the way.

If you ship a coding agent, the smallest first step is one binary and the trust pipeline. The repo is at github.com/radotsvetkov/akmon. The format spec is at github.com/radotsvetkov/agef. The site is at radotsvetkov.github.io/akmon.

Standards become standards by being used. Implementations are welcome.

The trust pipeline: three commands to run before merging an AI-assisted change

Radoslav Tsvetkov — Thu, 14 May 2026 08:19:27 +0000

The most common failure mode I see when teams adopt AI coding agents is not a bad diff. It is a good diff that no one can defend. The agent ran. The session closed. Three days later, somebody asks how the change came to be, and there is nothing to point at.

This article is about closing that gap with three commands. Akmon's trust pipeline is audit verify, evidence verify, and slo verify. Each one is fast, deterministic, and gates cleanly in CI. With them, "the agent did it" stops being a hand wave and becomes an artifact.

The code in this post uses real commands from Akmon v2.0.0.

What "trust pipeline" means

Every Akmon session writes two files when it ends.

.akmon/audit/<session-id>.jsonl: a tamper-evident audit chain of every prompt, model response, tool call, and policy decision.
.akmon/evidence/<session-id>.json: a structured evidence summary, with replay metadata and a hash that links back to the audit chain.

Three commands take those files and produce signals you can trust.

# 1. Audit chain integrity.
akmon audit verify .akmon/audit/<session-id>.jsonl

# 2. Evidence schema and linkage to the audit chain.
akmon evidence verify .akmon/evidence/<session-id>.json

# 3. Reliability metrics against thresholds.
akmon slo verify .akmon/evidence/<session-id>.json --strict

Each command exits 0 for pass, 1 for failure, and (for SLO) 2 for invalid input or config. Three exit codes, three crisp signals.

Step 1. Audit chain verification

akmon audit verify walks the JSONL chain and checks the cryptographic linkage between events. If a single byte was edited or a record was dropped, it fails.

$ akmon audit verify .akmon/audit/2025-05-06_abcd.jsonl
audit chain valid (events: 47, head: 5c1f...)
$ echo $?
0

For CI:

$ akmon --output json audit verify .akmon/audit/2025-05-06_abcd.jsonl | jq '.valid'
true

Why this matters: the audit chain is the substrate. If the chain does not verify, nothing downstream is meaningful. Pin this command at the start of any review workflow.

Step 2. Evidence verification

akmon evidence verify checks the evidence summary's schema, the replay metadata shape, and the linkage to the audit chain. Schema means the file matches the documented evidence schema for v2.0.0. Linkage means the recorded audit hash matches the actual head of the audit chain.

$ akmon evidence verify .akmon/evidence/2025-05-06_abcd.json
evidence valid (linked audit head: 5c1f...)

This step catches a class of failure most teams underestimate: an evidence file that was generated correctly but later got out of sync with its audit chain (for example, because of a partial copy). Evidence verify is cheap. Run it on every artifact you intend to share.

Step 3. SLO verification

akmon slo verify evaluates run reliability metrics against thresholds. Examples include tool success rate, replay determinism, attempt counts, and policy gate denials. The CLI accepts thresholds inline or from a TOML file:

$ akmon slo verify .akmon/evidence/2025-05-06_abcd.json --strict
all SLO thresholds met

$ akmon slo verify run.json --thresholds .akmon/slo.toml
threshold breached: tool_success_rate (0.93 < 0.95)
$ echo $?
1

You can also pass a single threshold inline:

$ akmon --output json slo verify run.json --min-tool-success-rate 0.95 | jq '.passed'
false

Strict mode treats skipped checks as failures. That is the right setting for CI gating.

Bonus step. SLO trend

Single runs are noisy. Trend mode compares the current run against a baseline window, so you catch regressions you would miss in a one-shot check.

$ akmon slo trend .akmon/evidence/2025-05-06_abcd.json \
  --baseline-dir .akmon/evidence/history \
  --window 20 \
  --strict
regression: median tool latency increased from 142ms to 318ms over baseline

With JSON output, this is a clean alert you can post to Slack or PagerDuty. Plug it in once and forget it.

Putting the pipeline in CI

A practical GitHub Actions snippet:

- name: Run Akmon task headlessly
  run: |
    akmon --yes --output json --task "$AKMON_TASK" | tee run.json
    echo "session_id=$(jq -r '.session_id' run.json)" >> $GITHUB_OUTPUT

- name: Verify audit chain
  run: akmon audit verify .akmon/audit/${{ steps.run.outputs.session_id }}.jsonl

- name: Verify evidence
  run: akmon evidence verify .akmon/evidence/${{ steps.run.outputs.session_id }}.json

- name: Enforce SLO thresholds
  run: |
    akmon slo verify .akmon/evidence/${{ steps.run.outputs.session_id }}.json \
      --thresholds .akmon/slo.toml --strict

- name: Trend against baseline
  run: |
    akmon slo trend .akmon/evidence/${{ steps.run.outputs.session_id }}.json \
      --baseline-dir .akmon/evidence/history \
      --window 20 \
      --strict

If any step fails, the merge stops. If all four pass, the change has the artifacts behind it.

Picking the right SLO thresholds

Two patterns work in practice.

First, conservative thresholds for production policy profiles:

# .akmon/slo.toml
min_tool_success_rate = 0.97
max_replay_divergence_ratio = 0.0
max_provider_retry_attempts = 3
max_permission_denials = 0

Second, looser thresholds for exploratory work:

min_tool_success_rate = 0.85
max_replay_divergence_ratio = 0.05
max_provider_retry_attempts = 5

Keep the file in version control. When a threshold changes, the change is reviewable.

What this catches in real life

A quick list of incidents the pipeline has caught for me or my testers:

A flaky tool that was failing once in twenty calls. SLO trend caught it in a week.
A model that started retrying on transient 5xx errors after a vendor change. Audit chain still valid, evidence valid, but SLO max_provider_retry_attempts surfaced the new pattern.
A reviewer accidentally edited the JSONL file (added a newline). akmon audit verify failed loudly.
A copy-paste of the evidence file from one machine to another, where the audit file did not come along. Evidence verify caught the missing linkage.

None of these are exotic. Each one is the kind of failure that quietly degrades trust in AI-assisted work over a quarter.

Why this is different from a dashboard

Three reasons.

First, the pipeline runs in CI. Dashboards do not gate merges. Exit codes do.

Second, the artifacts are portable. The verifier on your machine is the verifier on the auditor's machine.

Third, the schema is fixed. Evidence verifies against a documented schema. SLO thresholds live in a TOML file. There is no dashboard to misconfigure between two clicks.

If you want both (the visual story and the gating), keep your existing observability. Akmon's trust pipeline runs alongside it.

Where this fits in the bigger picture

The trust pipeline is the first half of the answer. The second half is replay, which I cover in the next post in this series. Replay turns "the artifact verifies" into "we can re-execute the session and see if anything diverges". For now, the trust pipeline is enough to gate any AI-assisted change with confidence.

The repo is at github.com/radotsvetkov/akmon. The format spec is at github.com/radotsvetkov/agef. The site is at radotsvetkov.github.io/akmon.

If your team wants the AI productivity but cannot give up review discipline, three commands are a fair price.

MCP governance for an AI coding agent without breaking the audit chain

Radoslav Tsvetkov — Thu, 14 May 2026 08:19:15 +0000

The Model Context Protocol gave AI agents a clean way to reach into systems. In a year it has become the default tool surface for serious agents. That is mostly good news. The mostly is the operative word.

Without care, MCP servers fragment the audit story. Tool calls land in three places: in the agent's runtime, in the gateway, and in the MCP server's own logs. None of them line up. By the time a regulator asks what happened, you have three formats and zero answers.

This post walks how I wire MCP into Akmon. The result is a single audit chain that captures every MCP tool call as a ToolCall event, with deterministic linkage in the journal, and a reviewable AGEF bundle at the end.

The commands and flags here are real Akmon v2.0.0 surface. I will note where Phase 4 features apply.

Why MCP audit logging is harder than it looks

Three reasons.

First, MCP is a fan out. A single agent session can call three MCP servers, each with its own tools, its own version, and its own logging story. If you log per server, you have three formats. If you log only at the agent level, you lose tool-level detail.

Second, tool inputs are user data. Tool arguments are not metadata; they are the conversation. Logs leak content unless you redact. AGEF makes content addressing explicit, so the bundle can carry the inputs in a way that survives review.

Third, the chain matters. A flat list of tool calls does not let you say "this file change was caused by this tool call which was caused by that model response". You need the chain. Akmon's audit chain and AGEF events are designed for that.

Solve those three and your audit is fast. Skip them and your audit is a forensic project.

Wiring MCP into Akmon

Akmon registers MCP servers with the --mcp-server flag, repeatable.

akmon --mcp-server https://mcp.tools.internal/orders \
      --mcp-server https://mcp.tools.internal/calendar \
      --task "Open a ticket for the failing tests and link the related order"

Every MCP tool the agent calls becomes a ToolCall event in the audit chain. The event records tool_id, input_hash, output_hash, and an optional side_effects_hash. The hashes resolve to objects in the AGEF bundle, so the inputs and outputs are recoverable, content-addressed, and verifiable.

For interactive runs, the same flag works:

akmon chat --mcp-server https://mcp.tools.internal/orders

For headless CI, combine MCP servers with a budget cap and JSON output:

akmon --yes --output json \
      --mcp-server https://mcp.tools.internal/orders \
      --mcp-server https://mcp.tools.internal/calendar \
      --max-budget-usd 2.50 \
      --task "summarize failing tests and create a ticket"

If you have an organization that prefers a TOML manifest for MCP servers, encode them in a policy pack and pull from there. The pack belongs to the team. The merge order keeps it deterministic.

What gets recorded for each MCP tool call

A ToolCall event in AGEF v0.1 carries:

tool_id, a stable identifier for the tool.
input_hash, hash of the canonical CBOR encoding of the input.
output_hash, hash of the output.
side_effects_hash, optional, when the tool changed the world.

The full input and output bytes live in objects/<hex> files. The hash links the event to the bytes. A reviewer can recover the exact input and output by walking the hash to the file.

A small example, viewed through akmon inspect:

$ akmon inspect <session-id> --resolve
SessionStart  cwd=hash:8a91...
UserTurn      prompt=hash:7a91...
ProviderCall  provider=anthropic attempts=1 status=Success
AssistantTurn message=hash:b3c1...
ToolCall      tool=mcp.orders/lookup_order input=hash:9c1f... output=hash:c2a8...
PermissionGate policy=write_safe decision=allowed
ToolCall      tool=mcp.calendar/create_event input=hash:e7d4... output=hash:f1a3...
AssistantTurn message=hash:5b62...
SessionEnd    summary=hash:0a13...

This is what an MCP-rich session looks like in evidence form. Every event has a hash linkage. Every input and output resolves to a file in objects/.

Policy at the MCP boundary

Policy in Akmon is a deterministic merge of profile, packs, project policy, and CLI override. For MCP, that means three useful levers.

First, restrict the tool surface in the profile. prod already has explicit-deny posture for side effects. You can layer named MCP tools on top.

Second, write a pack that allows specific MCP servers and named tools.

# .akmon/policy-packs/org-mcp.toml
[mcp]
allowed_servers = [
  "https://mcp.tools.internal/orders",
  "https://mcp.tools.internal/calendar"
]

[mcp.tools]
"mcp.orders/lookup_order" = { allowed = true }
"mcp.orders/refund" = { allowed = true, requires_approval = true }
"mcp.calendar/create_event" = { allowed = true }
"mcp.calendar/delete_event" = { allowed = false }

Third, see what is actually applied:

akmon policy show-effective \
  --profile prod \
  --policy-pack .akmon/policy-packs/org-mcp.toml

Effective policy is print-able. That ends the "I think the rule is" conversation.

Redacting MCP-related content before sharing

Some MCP tool inputs or outputs cannot leave the building. Akmon's redact command produces a derivative bundle in which selected objects are replaced by canonical CBOR sentinels.

akmon redact <session-id> \
  --output sanitized.akmon \
  --object <input-hash> \
  --object <output-hash> \
  --reason "Removed customer name before audit handoff" \
  --format json

The chain still verifies because the sentinel hashes correctly. The redaction reason is recorded. The bundle can be shared with auditors without exposing the redacted content.

Verify before sharing:

akmon bundle import sanitized.akmon --verify-only

Writing an MCP server that plays well with Akmon

If you maintain an MCP server, three small choices make audit life easier downstream.

First, make tool inputs and outputs canonicalizable. Avoid embedding wall-clock times in fields the agent does not control. Avoid random IDs in the response when the response could be deterministic.

Second, expose a stable tool ID and a stable version. The tool_id in AGEF is whatever the runtime emits. Namespacing your tools (mcp.orders/lookup_order) keeps the bundle readable a year from now.

Third, surface side effects explicitly. If a tool has a side effect, return enough metadata for side_effects_hash to make sense (for example, the new resource ID, the affected entity, the change type).

These are small DX choices that compound. They do not require any spec change.

Replay across MCP servers

akmon replay runs a session against the recorded providers and tools. For MCP-heavy sessions, replay surfaces divergence when an MCP server changed behavior in a way you did not intend.

akmon replay <session-id> --mode strict --format json

In strict mode, replay treats mismatches more aggressively. That is the right mode for CI gating after a deployment that touched an MCP server. If the replay diverges, you know about it before the next session.

A practical migration plan

If you are adding MCP to an existing Akmon setup:

Start with one MCP server, registered via --mcp-server. Run a few sessions interactively. Inspect the journal.
Lock the allowed list in a pack. Run akmon policy show-effective to confirm.
Add a CI job that runs the trust pipeline (audit verify, evidence verify, slo verify) on every session.
Add slo trend against a baseline window of recent sessions. Alert on regression.
When you are ready to share with an external reviewer, redact and export with bundle export (Phase 4).

A short progression, weeks, not months.

Where this leaves you

After the migration, three things are true.

First, every MCP tool call is in the audit chain. The chain verifies. The evidence verifies.

Second, the policy is explicit and inspectable. Adding a new MCP server is a pack change reviewed in version control.

Third, you can hand any session to a reviewer with a single bundle. The reviewer needs the AGEF format, not your stack.

The repo is at github.com/radotsvetkov/akmon. The format spec is at github.com/radotsvetkov/agef. The site is at radotsvetkov.github.io/akmon. The next article in the series goes deep on replay and divergence detection, which is where MCP-heavy sessions earn their keep most often.

Replay AI coding sessions deterministically: the divergence detector for your repo

Radoslav Tsvetkov — Thu, 14 May 2026 08:18:57 +0000

There is a moment in every AI coding workflow where you wish you could roll the tape back. The agent did something on Tuesday. By Thursday the model has shifted. The tool surface has shifted. You cannot reproduce the issue, and you cannot prove the change was sound.

Replay is the answer. Akmon's akmon replay command takes a recorded session and re-executes it against the providers and tools that were recorded with it. The output is a pass or fail. In strict mode, it is a CI gate. In default mode, it is a divergence report.

This article walks the actual command, the modes, the exit codes, and the failure patterns I have caught with it. Everything here is real Akmon v2.0.0 surface.

Why replay matters more than logs

A log file tells you what happened. Replay tells you whether the same session would happen again. The two are not the same.

If a session passes replay, the artifact is meaningful, even months later. If it diverges, the diff is the bug report. Either way you know more than you did from the log.

Three concrete jobs replay does well.

Detect regressions in your tool surface (a tool change that breaks an old session you already shipped).
Detect regressions in your provider behavior (a model upgrade that changes the agent's plan).
Validate that a redacted bundle is still a valid session.

In a regulated codebase, the first two are the difference between trust and a quarterly fight.

The command

The replay command is small.

akmon replay <session-id> [OPTIONS]

Common options:

akmon replay <session-id> \
  --journal <path> \
  --mode <default|strict> \
  --persist --persist-to <path> \
  --format <human|json>

A first run, in default mode:

$ akmon replay 550e8400-e29b-41d4-a716-446655440000
replay completed: passed=true (events: 47)

In strict mode for tighter mismatch handling:

$ akmon replay 550e8400-e29b-41d4-a716-446655440000 --mode strict
replay completed: passed=true (strict)

In CI with JSON:

$ akmon replay 550e8400-e29b-41d4-a716-446655440000 --format json | jq '.passed'
true

The exit code map

Code	Meaning
`0`	Replay completed with no divergences (`passed: true`)
`1`	Replay completed with divergences (`passed: false`)
`2`	Usage error (invalid arguments or invalid flag combinations)
`3`	I/O or environment error (missing source session, malformed source, unwritable persist target)

For CI gating, 0 is success, 1 is a hard failure, 2 and 3 are operator errors that should not be ignored.

Persisting a replay

If you want to keep the replay output for later review (good practice when investigating an incident), persist it to a target journal.

akmon replay 550e8400-e29b-41d4-a716-446655440000 \
  --persist \
  --persist-to ./replay-journal

The persisted output is itself a journal. It carries its own evidence and is itself replayable. That sounds recursive. It is. It is also exactly what you want for a forensic record.

Patterns that catch real bugs

A few patterns from my own work and from testers.

Pattern 1. Run replay nightly against last week's sessions

If your team runs Akmon in CI for a week, you have a sample of real sessions. A nightly job that replays them in strict mode is the cheapest regression detector you can build.

for s in $(ls .akmon/audit/ | tail -n 50); do
  sid=$(basename "$s" .jsonl)
  akmon replay "$sid" --mode strict --format json | jq --arg sid "$sid" '.passed | tostring + " " + $sid'
done

If a tool change broke an old session, you know.

Pattern 2. Run replay on PRs that touch tools

Tools are how AI agents reach the world. A PR that touches a tool wrapper, an MCP server, or a CI runner is the most common place to break replay. Wire a job that replays a small set of canonical sessions whenever those paths change.

Pattern 3. Replay a redacted bundle

After running akmon redact to produce a sanitized derivative bundle, replay confirms the bundle is still a valid session.

akmon redact <session-id> \
  --output sanitized.akmon \
  --object <object-hash> \
  --reason "PII removal"

akmon bundle import sanitized.akmon --verify-only

Verify-only checks integrity. For a fuller round trip, replay the redacted session inside a sandbox.

Pattern 4. Replay across providers

If you switch model providers (say, from a cloud provider to Ollama for sensitive work), replay tells you whether the new provider produces an equivalent session for the same prompt and tools. This is not a guarantee of behavior, but it is the closest thing in practice.

What replay cannot do

Three honest disclaimers.

First, replay does not reach across the network for fresh model calls by default. The point is to compare against recorded behavior. If you re-execute against a live model, that is a different test, and it should be labeled differently in your CI.

Second, divergence in default mode is informational. The session might still be useful even if a tool returned a slightly different output. Strict mode is the right setting when the test is "is this session reproducible".

Third, replay does not inspect the meaning of the output. It checks structural and field-level equivalence. Semantics are still your job.

Reading a divergence report

When replay fails, the JSON output is the place to start.

$ akmon replay <session-id> --mode strict --format json | jq
{
  "session_id": "550e8400-...",
  "passed": false,
  "mode": "strict",
  "divergences": [
    {
      "event_index": 14,
      "event_kind": "ToolCall",
      "tool_id": "mcp.orders/lookup_order",
      "field": "output_hash",
      "expected": "9c1f...",
      "actual": "a72b...",
      "note": "side effects hash differs"
    }
  ]
}

The structure points you straight at the change. From there, the workflow is the usual one: inspect the source session, inspect the live behavior, decide whether the change was intentional, and either fix the bug or update the recorded baseline.

Putting replay in CI

A small GitHub Actions snippet:

- name: Replay session in strict mode
  run: |
    akmon replay ${{ inputs.session_id }} --mode strict --format json | tee replay.json
    test "$(jq -r '.passed' replay.json)" = "true"

Pair this with the trust pipeline (audit verify, evidence verify, slo verify) and you have a very small set of commands that gate every AI-assisted change in your repo.

Storage and retention

Sessions are content-addressed. Identical content hashes once. A typical session lands in a few hundred kilobytes of audit JSONL plus a small evidence JSON. Bundles are larger because they carry the resolved objects, but compression and content addressing keep them under a megabyte for most engineering tasks.

For retention, follow the longest of (your contractual minimum, your regulatory minimum, your incident review window). Keep the audit JSONL with the same lifecycle as your CI artifacts.

Where replay fits in the bigger story

The trust pipeline gives you crisp signals from a single session. Replay gives you crisp signals across time. Together they turn an AI coding agent into a system you can defend in front of a reviewer.

The repo is at github.com/radotsvetkov/akmon. The format spec is at github.com/radotsvetkov/agef. The site is at radotsvetkov.github.io/akmon.

The next article in this series is the redaction workflow, the part of the kit you reach for the day before an external review.

A pragmatic threat model for AI coding agents, with controls you can ship today

Radoslav Tsvetkov — Thu, 14 May 2026 08:18:44 +0000

There is a moment in every AI coding rollout where the question shifts from "can we make this work" to "what is the worst thing this can do". If you have not had that moment yet, this article will save you a quarter.

The OWASP Top 10 for Agentic Applications, published in late 2025, is the cleanest shared vocabulary we have for the failure modes. It is short, opinionated, and useful. This post takes each item, names the failure pattern in plain language, and pairs it with a control you can ship around an AI coding agent today.

The configuration shown uses Akmon's policy profiles, packs, and CLI flags. The pattern is general; if you use a different tool, the lessons translate.

How to read each section

For each item:

What it is, in one paragraph.
The failure story, the kind of incident this prevents.
The control, the actual lever, with code or commands.
The trade off, the thing the control costs you.

1. Prompt injection in tool inputs

What it is. A tool returns text. The text contains a hidden instruction. The agent reads the text and the next decision is reshaped.

The failure story. A scout dossier reads a third-party README. The README has a hidden instruction. The agent later writes a config that the README told it to write.

The control. Use the prod profile in production paths. Restrict the tool surface in a pack. Constrain web_fetch to allowed hosts only.

# .akmon/policy-packs/web.toml
[network]
web_fetch_allowed_hosts = ["docs.example.com", "api.internal"]
web_fetch_require_https = true

akmon --policy-profile prod --policy-pack .akmon/policy-packs/web.toml \
      --task "summarize the API spec at docs.example.com/spec.html"

The trade off. Some legitimate fetches will fail until you add the host. That is a feature.

2. Excessive agency

What it is. The agent has access to tools it does not need for the task. Breadth becomes surface area.

The failure story. A documentation task has access to a shell tool that runs migrations. The model invents a migration command on a misread.

The control. Profile-driven tool surface. Use --plan for read-only scoping before a real run. Add --add-dir to lock the sandbox.

akmon --policy-profile prod --plan \
      --task "list outdated dependencies and propose updates"

The trade off. Two-step workflow. Plan first, implement second. Worth it.

3. Sensitive information disclosure

What it is. Sensitive data ends up in the model context, the logs, or the agent output.

The failure story. A test fixture has a real customer record. The agent surfaces the record in a comment in a generated PR.

The control. Redact specific objects from the session before sharing. Use Ollama for sensitive paths so the prompt never leaves the machine.

akmon redact <session-id> \
  --output sanitized.akmon \
  --object <object-hash> \
  --reason "Customer record removed before audit handoff"

The trade off. Redaction adds friction at handoff. The friction is the point.

4. Improper output handling

What it is. The agent's output is rendered or executed somewhere it should not be.

The failure story. The agent writes a Markdown reply that includes a fake confirmation block. A downstream automation parses the block as a structured action.

The control. Force structured output where it matters. Use --output json in headless flows so the response is machine-parseable, and validate against your own schema downstream.

akmon --yes --output json --task "$task" | jq '.summary'

The trade off. Free-form prose has its place. For action-triggering paths, structured output is non negotiable.

5. Supply chain weaknesses

What it is. A dependency the agent uses changes in a way that affects behavior.

The failure story. An MCP server you use upgraded a tool. The output shape changed. The agent silently misroutes.

The control. Pin model and tool versions in AKMON.md and in policy packs. Run akmon replay in strict mode for a small set of canonical sessions on every PR that touches a tool wrapper.

akmon replay <baseline-session-id> --mode strict --format json | jq '.passed'

The trade off. A small set of canonical sessions has to be maintained. Treat them as part of your test suite.

6. Insecure plugin or tool design

What it is. A tool was designed without least privilege in mind.

The failure story. A generic http.fetch tool can hit any URL, including internal addresses.

The control. Restrict web_fetch to public allowed hosts and an HTTPS requirement. Use --add-dir to lock filesystem reads to the project root. Avoid generic shell tools in production profiles.

akmon --policy-profile prod \
      --add-dir ./src --add-dir ./docs \
      --task "patch the parser to accept ISO 8601 with offset"

The trade off. Some legitimate workflows need broader access. Use staging for those, not prod.

7. Excessive resource consumption

What it is. The agent loops, retries, or expands recursively. Tokens, dollars, and tool calls climb without a ceiling.

The failure story. A planning prompt recurses. Over a long evening it racks up provider charges.

The control. Use --max-budget-usd for headless runs. Use --fallback-model for graceful degradation. Use slo verify to alarm on retry attempts.

akmon --yes --max-budget-usd 2.50 --fallback-model "ollama:qwen-coder-7b" \
      --task "..."

akmon slo verify .akmon/evidence/<session>.json --strict

The trade off. Some legitimate runs will hit the budget. Calibrate per task class. Alarm and review.

8. Vector and embedding weaknesses

What it is. Retrieval introduces content from an index. If the index can be poisoned, your prompts are poisoned.

The failure story. A staging dataset got merged into the production index. An old test record contained a prompt injection. The production agent surfaced it on a real query.

The control. Provenance on every index entry. Use the spec workflow (akmon spec) to gate retrieval-heavy changes through a planning step. Treat RetrievalCall events as the audit trail.

akmon --index --policy-profile prod \
      --task "ground the implementation in the design doc"

The trade off. Provenance metadata is harder to retrofit than to design in. Start with the most sensitive index.

9. Misinformation and overreliance

What it is. The agent claims things confidently that are not true. The user trusts the agent.

The failure story. The agent invents a function in a library. The reviewer trusts it. CI catches it; the team's calibration drops.

The control. Require structured outputs for fact-bearing tasks. Use --architect for two-phase plan plus implementation, where the planner uses a stronger model. Layer human review for any change that touches public APIs.

akmon --architect --planner-model "anthropic:claude-sonnet-4-6" \
      --task "design and implement the JWT rotation flow"

The trade off. Two-phase costs more tokens. The reviewer can read the plan first, which is its own win.

10. Unbounded consumption of context

What it is. Context grows over a long session. Old, irrelevant content shapes new decisions.

The failure story. A multi-hour session keeps adding context until the model truncates from the middle. Behavior shifts in ways nobody can explain.

The control. Use the spec workflow to break work into discrete sessions. Use --continue and --session deliberately, not by habit. Inspect the session journal periodically.

akmon spec parser-iso8601-offset "Accept ISO 8601 timestamps with timezone offsets"
akmon spec parser-iso8601-offset design
akmon spec parser-iso8601-offset tasks
akmon spec parser-iso8601-offset implement

The trade off. Less ambient context. The win is reproducibility.

Putting it together

If you implemented all ten controls, you would have a system with:

A small, well-known tool surface, profile-driven.
Structured output where it matters.
Redaction available before any external handoff.
Hard caps on resource use.
Provenance on retrieval, with RetrievalCall events as evidence.
Replay-based regression detection in CI.

Most teams will not need all ten on day one. Pick the three that match your top risks. Get them in production. Watch them work. Add the rest as they earn their place.

Configuration to put in a pack

A small pack to start with, ready to drop in .akmon/policy-packs/:

# .akmon/policy-packs/baseline.toml
[network]
web_fetch_allowed_hosts = ["docs.example.com", "api.internal"]
web_fetch_require_https = true

[shell]
allowed_commands = ["cargo", "npm", "go", "make", "git"]
deny_commands = ["rm -rf", "sudo"]

[mcp]
allowed_servers = ["https://mcp.tools.internal/orders"]

[tools]
web_fetch_default = "deny"
shell_default = "ask"

Then run:

akmon --policy-profile prod \
      --policy-pack .akmon/policy-packs/baseline.toml \
      --task "..."

Inspect the merged effective policy:

akmon policy show-effective \
  --profile prod \
  --policy-pack .akmon/policy-packs/baseline.toml

The honest part

Controls do not make AI coding agents safe. They make AI coding agents survivable. There will still be model failures, tool bugs, and edge cases nobody saw. The job of the policy and evidence layer is to keep the consequences small and the explanations available.

If you want to dig deeper into the evidence side of the loop, the next post in this series breaks down the redaction workflow. The repo is at github.com/radotsvetkov/akmon. The format is at github.com/radotsvetkov/agef. The site is at radotsvetkov.github.io/akmon.

AI coding compliance for 2026: a working checklist for ISO 42001, the EU AI Act, SOC 2, and tool qualification

Radoslav Tsvetkov — Thu, 14 May 2026 08:18:30 +0000

If you ship AI-assisted code in 2026, three regulatory things have changed under your feet.

In December 2025, OWASP published the Top 10 for Agentic Applications. In April 2026, Microsoft released the Agent Governance Toolkit. In August 2026, the EU AI Act high-risk obligations take effect. ISO 42001 has become the AI management system standard auditors expect. NIST AI RMF is the framework most US agencies and primes will reference. The Colorado AI Act starts enforcement in June 2026. Tool qualification frameworks (DO-178C and DO-330 for avionics, IEC 62304 for medical devices, ISO 26262 for automotive, CMMC for defense) treat AI tooling with the same scrutiny they applied to legacy code generators.

That is a lot of paper. The good news is that most of it points at the same operational pattern. You need to know what your AI did, you need to enforce policy at the tool surface, you need evidence you can hand to a third party, and you need a retention story.

This post is a working checklist that maps each of those frameworks to actual Akmon commands. Map to your own controls as needed.

The shape of the work

Almost every AI compliance program asks for five things, in different language.

A documented inventory of AI systems, with risk classifications and owners.
A policy framework, ideally enforced at runtime, not only documented.
An audit trail of agent activity, with the integrity to be admissible.
A retention story for that audit trail.
A way to extract evidence on demand, in a format a non-engineer can read.

If you build those five, you are most of the way to compliance with most frameworks. The frameworks layer on specific controls that fit your risk profile.

ISO 42001, the management system standard

ISO 42001 is a management system standard, like ISO 27001 but for AI. It does not tell you how to build the agent. It tells you how to govern the work.

The relevant controls for AI coding agents:

A.5 AI policies. Document who owns what.
A.6 Internal organization. Roles and responsibilities for AI.
A.7 Operational planning and control. Including evidence of operations.
A.8 Performance evaluation. Continuous monitoring and review.

The translation to engineering work, with the Akmon command that produces the evidence:

Control	What you actually do	Akmon command
A.5	Maintain an AI inventory with risk classification	An internal doc, ideally rendered from the agents that emit AGEF
A.6	Assign owners for each agent	A field in the project's `AKMON.md` plus your CMDB
A.7	Run policy at the tool boundary, log every call	`akmon audit verify`, `akmon evidence verify`
A.8	Review evidence regularly, close the loop	`akmon slo verify`, `akmon slo trend`

The trust pipeline already does most of A.7 and A.8. The rest is documentation.

EU AI Act Article 12, the recording requirement

For high-risk AI systems, the EU AI Act requires automatic recording of events that are relevant to the operation of the system. This is not a logging best practice. It is an obligation.

The recording must:

Capture events automatically.
Cover the operating life of the system.
Be of sufficient detail to investigate incidents.
Be retained for a period appropriate to the system's purpose.

The translation to engineering work:

Requirement	What you actually do	Akmon evidence
Automatic recording	Record on every tool call, every model call, every policy decision	`.akmon/audit/<session>.jsonl`
Coverage	Make sure every session produces a record	Index by session ID, alarm on gaps
Sufficient detail	Inputs, outputs, decisions, parent and child IDs	AGEF event kinds, content-addressed objects
Retention	Store records for the contractual or legal minimum	Lifecycle policy on your storage

Article 12 does not say "use AGEF". It says automatic, sufficient, retained. AGEF satisfies those structurally.

NIST AI RMF, the risk management framework

NIST AI RMF is voluntary, US flavored, and well respected. The four functions are Govern, Map, Measure, Manage.

For AI coding agents:

Govern: assign accountability for AI tools used in development.
Map: classify the risk of each agent and document tools it can use.
Measure: monitor for misbehavior, track metrics.
Manage: respond to incidents, retire systems that fail.

If you have ISO 42001 covered, NIST AI RMF is mostly a vocabulary translation. Same data, different headers in the report.

SOC 2 for AI engineering controls

For SOC 2 audits, AI coding agents tend to fall under CC4 (monitoring) and CC7 (system operations). Some auditors are starting to ask about CC2.3 (communications about responsibilities) for AI-specific roles.

What auditors want to see:

A documented set of AI controls, with owners.
Evidence that the controls run, not just that they exist.
Logs that demonstrate operations and exceptions.
Incident history with root cause and remediation.

The Akmon trust pipeline maps cleanly. SLO verify gives you the "controls run". Evidence verify gives you the "logs that demonstrate operations". Replay gives you the "incident history" framing.

Tool qualification frameworks

This is where Akmon is most differentiated. Generic AI agents do not address tool qualification at all. Akmon does.

DO-178C and DO-330 (aerospace and avionics)

DO-330 covers tool qualification. AI tools used in development have to either qualify or be classified as not affecting the certified output. Akmon's evidence chain is the artifact you need to make the case. The tool qualification kit (TQK) typically wants a deterministic procedure and recorded artifacts. Replay against recorded providers and tools is the closest thing in the AI space to a deterministic procedure.

IEC 62304 (medical device software)

IEC 62304 cares about the software life cycle. AI assistance in development is part of that life cycle. The evidence Akmon produces fits the V&V records expected at most safety classifications. The redaction flow is critical for protected health information.

ISO 26262 (automotive)

For ASIL-rated software, traceability is mandatory. AGEF's content-addressed events plus replay give you a defensible answer when an auditor asks where a particular line came from. The spec workflow (akmon spec) is the right entry point for high-ASIL changes.

CMMC (defense)

CMMC level 2 and above care about access control and audit. Akmon's policy profiles, the explicit deny posture in prod, and the audit chain map to several practices in AC, AU, and CM domains. Local-first execution (Ollama or your hosted endpoint inside a controlled environment) keeps controlled unclassified information inside the boundary.

OWASP Top 10 for Agentic Applications

Published December 2025. The list is technical, not regulatory, but it has become the shared vocabulary for failure modes. If you have not mapped your agent's risks to the list, do it.

The most relevant items for an AI coding agent:

LLM01 prompt injection. Tool inputs that contain hidden instructions. Mitigated by policy at the tool boundary, the prod profile, and constrained web_fetch allow lists.
LLM03 sensitive information disclosure. Mitigated by akmon redact for outputs and by careful provider choice.
LLM06 excessive agency. Mitigated by the prod policy profile and by team-specific packs that lock the tool surface.
LLM08 vector and embedding weaknesses. Where your agent uses retrieval, the events emit RetrievalCall records, so you can audit what was retrieved.

A control map document lives in the docs. The mapping is concrete: each control points at a command or a configuration knob.

A short, copyable checklist

If you have one afternoon:

Make an inventory. List your agents, owners, and risk levels in a one page document.
Stand up Akmon in your repo. Choose staging profile. Add one organization pack.
Run a session, then run the trust pipeline. Confirm the three exit codes are 0.
Map three controls each to ISO 42001 A.7, EU AI Act Article 12, and SOC 2 CC7. Use the AGEF event kinds as evidence.
Set a retention policy. Pick a number, document it, automate the lifecycle.
Schedule a weekly evidence review. One person, one hour, one summary.

If you have a quarter:

Cover all repos with the same Akmon binary and policy framework.
Build a control map that ties each AGEF event kind and each command to a control across frameworks.
Add slo trend to detect regressions across recent sessions.
Add a customer-facing surface (a small page in your help center) that explains what your AI coding agent records.
Walk your auditor through a sample session. The first one will tell you what is missing.

What this gets you

What you get: an audit trail that survives a third-party review, a faster path through customer security questionnaires, and a much shorter incident loop.

What you do not get: a guarantee that the model behaves. The model is the model. The job of governance is to make the consequences of misbehavior bounded, observable, and provable.

If you want a place to start, install Akmon and run the trust pipeline on one session. The repo is at github.com/radotsvetkov/akmon. The format is at github.com/radotsvetkov/agef. The site is at radotsvetkov.github.io/akmon.

The next article in this series goes deep on the redaction workflow, which is the part of the kit you reach for the day before an external review.

Sanitizing AI coding sessions before external review: the redaction workflow that ships

Radoslav Tsvetkov — Thu, 14 May 2026 08:18:15 +0000

The day before an external review is the day you discover what is actually in your AI coding sessions. A real customer name buried in a tool input. A snippet of restricted code in a model response. A path that exposes more about your infrastructure than you wanted to share.

The wrong answer is to clean it up by hand. By-hand redaction breaks integrity. The right answer is a redaction workflow that produces a sanitized derivative bundle with sentinels in place of the redacted content, a recorded reason on every change, and a verification step before sharing.

This article walks Akmon's redact command end to end. The commands and the sentinel format are real Akmon v2.0.0 surface.

Why "I will edit it by hand" does not survive

Three reasons.

First, the audit chain is cryptographic. If you change a byte without producing a new chain, the chain stops verifying. Your auditor opens the file, runs audit verify, gets a failure, and the conversation is over.

Second, redaction without a reason is not auditable. The reviewer needs to know that this object was redacted on purpose, and why. A sentinel with a reason field is the record.

Third, ad hoc edits do not round-trip. If you produce a clean derivative bundle that re-imports cleanly and verifies offline, you have something portable. If you produce a one-off edit, you have something that might or might not survive transport.

The workflow below avoids all three failure modes.

The command

akmon redact <session-id> [OPTIONS]

A typical invocation:

akmon redact <session-id> \
  --output sanitized.akmon \
  --object <object-hash> \
  --reason "PII removal" \
  --format json

For multiple objects, repeat --object:

akmon redact <session-id> \
  --output sanitized.akmon \
  --object <hash1> \
  --object <hash2> \
  --object <hash3> \
  --reason "Removed customer names and one credential reference"

Find the right hashes with akmon inspect --resolve:

akmon inspect <session-id> --resolve | less

That is the loop: inspect, decide, redact.

The sentinel format

Every redacted object is replaced by a canonical CBOR sentinel. The payload looks like this in JSON form:

{
  "akmon_redacted": true,
  "original_hash": "<hex of original>",
  "original_size": 1024,
  "reason": "<text from --reason>",
  "redacted_at": "<RFC3339 timestamp>"
}

A reviewer reading the bundle can see that an object was redacted on purpose, can see why, and can confirm that the original size is what was expected. The chain still verifies because the sentinel hashes correctly.

This is a small but important design choice. The sentinel is not a placeholder. It is a record. The redaction is itself part of the evidence.

Verifying the sanitized bundle before sharing

Always verify before sending. The cheapest possible smoke test:

akmon bundle import sanitized.akmon --verify-only

This re-runs the AGEF verification on the derivative bundle. If anything is wrong (a missing object, a hash mismatch, a malformed manifest), the verifier fails loudly. Treat this as the gate before any handoff.

For deeper confidence, replay the sanitized session in a sandbox:

akmon replay <derivative-session-id> --mode strict --format json

A clean replay confirms that the bundle still tells a coherent story.

What to redact, and what not to

A practical rule of thumb. Redact what cannot leave the building. Keep what makes the session reviewable.

Redact: customer names, account numbers, secrets that leaked into a tool input, snippets of code that fall outside the disclosure scope.
Keep: tool IDs, policy decisions, structural events (SessionStart, UserTurn, ProviderCall, AssistantTurn, SessionEnd), provider identifiers, model versions, attempt statuses.

The structure of the session is what makes it auditable. The content is what makes it sensitive. Redaction is the line between them.

The exit code map

Code	Meaning
`0`	Derivative bundle written successfully
`2`	Usage error (output exists, invalid hash format, object not in session, missing required flag)
`3`	I/O or environment error (journal or session not found, write failure, unreadable referenced object)

For automation, treat anything other than 0 as a hard stop.

A small story from a tester

A friend who runs an embedded team for a medical device manufacturer ran me through their first audit using Akmon last quarter. The auditor asked for a representative session that touched the firmware build path. The team chose three sessions, ran inspect --resolve on each, and listed the objects to redact: two contained patient IDs from a test dataset, one contained an internal vendor name.

akmon redact $sid \
  --output "audit-pack/$sid.akmon" \
  --object $hash_patient_a \
  --object $hash_patient_b \
  --object $hash_vendor \
  --reason "PHI removal and vendor identification removal for IEC 62304 audit"

They verified each derivative bundle:

akmon bundle import "audit-pack/$sid.akmon" --verify-only

And handed the auditor the three bundles plus an executable verifier. The auditor verified each one offline and proceeded. The conversation took an hour, not a week. The team now treats the redaction step as a normal part of audit prep.

A common mistake to avoid

Do not redact more than you need to. Every redaction shrinks the auditor's ability to follow the session. If you redact AssistantTurn messages and ToolCall outputs aggressively, the bundle stops telling a story.

The right mental model: redact the smallest set of objects that removes the sensitive content. Keep everything else.

Stronger guarantees on top of redaction

Redaction provides selective content removal with documented reasons and verifiable structure. It does not provide identity attribution by itself. If your environment requires producer trust, layer external signing on top of the bundle. The AGEF spec is explicit that signing is expected in later versions; for now, signing is your job.

A practical approach: sign the bundle with your team's key, distribute the signature alongside, and document the verification procedure. The auditor verifies the signature first, then runs bundle import --verify-only.

Where this leaves you

After the redaction workflow is in place, three things are true.

First, you can hand any session to an external reviewer with a single bundle.

Second, the redaction is itself part of the evidence. The reviewer can see what was redacted and why.

Third, the integrity guarantees survive transport. The bundle verifies offline.

If you want to dig deeper into the policy side of the loop, the next post in this series goes deep on policy profiles and packs, the deterministic merge that controls what the agent can do.

The repo is at github.com/radotsvetkov/akmon. The format spec is at github.com/radotsvetkov/agef. The site is at radotsvetkov.github.io/akmon.

Observability and evidence in AI coding workflows: two log streams, two masters

Radoslav Tsvetkov — Thu, 14 May 2026 08:17:59 +0000

A few months back I watched an external reviewer ask one question I could not answer.

For the AI session that touched this medical device firmware on Tuesday, can you show me the inputs, the policy decisions, the outputs, and a signature that says nobody changed the record?

We had beautiful dashboards. Token usage, latency, top tools, error rates. We did not have what she was asking for.

That conversation forced a vocabulary change. We started saying observability and evidence as two separate words for two separate things. This article is the version of that conversation I wish someone had given me a year earlier.

The short version

Observability is consumed by an engineer trying to understand what the system is doing. Rich, sometimes lossy, meant for a dashboard. Sampling is fine. Cardinality limits are fine.

Evidence is consumed by someone who was not there. A reviewer. A regulator. A future-you reading the file three months after the session. Evidence has to be complete for the session, normalized so the reader does not have to learn your stack, and tamper evident so they can trust no one rewrote history. Sampling is not fine. Format drift is not fine.

If you build only observability, you can debug yesterday. If you build evidence, you can defend yesterday.

A side by side

Aspect	Observability	Evidence
Audience	Internal engineers	External reviewers and future you
Time horizon	Hours to days	Months to years
Sampling	Often, by tenant or rate	Never
Schema	Vendor specific or OTLP	Normalized, vendor neutral (AGEF)
Tamper evidence	Not required	Required
Storage	Hot, indexed	Warm, durable, lifecycle managed
Cost driver	Cardinality	Volume per session
Tooling	Datadog, Langfuse, Helicone	Akmon trust pipeline plus AGEF bundles
Failure mode	Slow dashboards	Failed audit

Both belong in your stack. Neither replaces the other.

Why teams keep conflating them

Three reasons, all reasonable.

The data starts in the same place. Both come from the agent runtime.
The first audience is engineering. Observability comes first because it solves immediate pain. Evidence is for a later audience.
Vendors blur the line. Many tools sell observability as "audit ready". Look at the schema, not the marketing.

You can use the same source events. You should not use the same destination format.

The data shape each one needs

Observability wants to slice and dice. The data is dimensional. Tag everything. Sample when you have to. A typical span has attributes for tenant, user, model, tool name, version, with counters for tokens in, tokens out, cost. High cardinality fields can be dropped or hashed.

Evidence wants to be complete and verifiable. The data is hierarchical. Each session is a chain of typed events with a known closed kind list (SessionStart, UserTurn, ProviderCall, ToolCall, RetrievalCall, PermissionGate, AssistantTurn, SessionEnd). Inputs and outputs are content-addressed. Hashes link events to objects. The bundle has a signed manifest in the future and a verifiable chain today.

The two shapes are related but not the same.

What gets recorded in each

A short tour of fields, with the question they answer.

Observability

latency_ms: how slow was that step?
tokens_input, tokens_output: what is this costing?
model_name, model_version: did a recent change cause the regression?
error_class, error_message: where are breakages clustered?
user_id_hash, tenant_id: who is affected?

Evidence (AGEF v0.1)

Manifest: agef_version, producer, session.id, session.head, timestamps, hash_algorithm, counts.
Events: parents, kind, emitted_at, sequence, plus kind-specific fields.
Objects: every input, output, prompt, message, side effect.

A reader of an observability stream cannot answer "show me the input that produced this denial". A reader of an AGEF bundle cannot easily answer "what is the p99 latency for this tool". Both are right.

A pipeline that produces both

You do not need two duplicate pipelines. You need one source and two sinks.

agent runtime  ->  Akmon (single Rust binary)
                          |
                          +--> OTLP / metrics --> observability backend
                          |
                          +--> .akmon/audit/<session>.jsonl  --> trust pipeline
                                .akmon/evidence/<session>.json
                                AGEF bundle (Phase 4)

Akmon writes the audit chain and the evidence summary. Phase 4 export turns those into a portable AGEF bundle. Observability data continues to flow to your dashboard.

How to roll this out without breaking the existing stack

If your team already has observability, do not rip it out. Add evidence next to it.

A migration that works in three steps:

Two weeks: install Akmon. Run the trust pipeline on a small set of sessions. Confirm the three exit codes are 0.
Two weeks: add policy packs. Pin the tool surface. Run replay against a small canonical set in CI.
Ongoing: expand to more repos. Map AGEF event kinds to control statements. Hand a sample to your reviewer and ask what is missing.

By the end of the third step, the conversation with the reviewer changes. Instead of "we have logs", the answer is "here is the bundle, here is the verifier, here is the policy that fired".

Five things teams get wrong

Sampling for evidence. If a session is missing, the answer to "what did the agent do at 9:14" is "we do not know". The worst possible answer.
Storing raw inputs without redaction in shared bundles. Sensitive data does not belong in an evidence bundle that is going to a third party. Redact at the boundary; keep the structural events.
Same retention as observability. Observability is hot. Evidence is warm. Use storage classes that match the access pattern.
No verification step. A bundle that is not verified is a bundle that might fail when it counts. Make bundle import --verify-only a required step before any handoff.
Treating the format as an internal detail. Evidence is consumed by people outside your team. Pick an open format that survives them, not just one that fits today. AGEF is meant for that.

Why I picked AGEF for the evidence side

I wrote AGEF because I needed it. Observability formats were not built for the audience that consumes evidence. Vendor-specific traces leave the reader stuck in a dashboard. Custom JSON drifts in a quarter and rots in a year.

AGEF is small, opinionated, and portable. The full spec fits in a short document. The bundle is tar.zst with a manifest.json, events.bin (length-delimited canonical CBOR), and objects/<hex> files. The format is signed (planned). It works for any runtime that can emit it. It is the part of the stack I want to outlive any one tool, including mine.

Where to start

If you have observability already, you have most of the source data. The work is to project it into a normalized record per session, redact at the object level when you share, and verify the bundle on import. Akmon and AGEF give you that path without rewriting your dashboard.

If you do not have observability yet, start there. Get OpenTelemetry traces flowing for any agent runtime you care about. Then put Akmon in the middle and start writing evidence.

The repo is at github.com/radotsvetkov/akmon. The format is at github.com/radotsvetkov/agef. The site is at radotsvetkov.github.io/akmon.

When the next reviewer asks for the inputs, the policy decisions, the outputs, and a signature, you will have one answer instead of three.

Why I built Akmon, the AI coding agent for regulated engineering

Radoslav Tsvetkov — Thu, 14 May 2026 08:17:35 +0000

For the last two years I have watched the same conversation happen in every regulated engineering team I work with. Someone tries the new AI coding agent. It writes a real diff. The team is impressed. Then somebody asks the question that ends the experiment.

If we merge this, what do we tell the auditor?

The agent did the work. The session was a TUI window that closed. The diff is in git. The reasoning is gone. Whatever policy the agent used, whatever model version it called, whatever tool it ran, none of that survived. In a hobby project that is fine. In avionics, in a medical device, in a SOC 2 release process, in a CMMC-bound codebase, that is a hard stop.

This post is the short story of why I started Akmon, what it actually is, and the small set of commands you can try this afternoon.

The problem in one sentence

AI coding agents produce code. They do not produce evidence. In regulated engineering, code without evidence is a liability.

By "evidence" I mean an artifact that lets a reviewer answer five questions without your help.

What did the agent read?
What tools did it call, with what inputs?
What changed on disk?
Which policy decision allowed each side effect?
Can we replay the session and confirm the artifact is intact?

These are not exotic asks. They are the same questions a careful human reviewer asks of a colleague's branch. The difference is that an agent does ten times the moves in one tenth the time, and the rest of the system has not caught up.

What I tried first

The first instinct, the same one I had, is to slap a logger on the agent. That helps for a week.

Then you discover three things.

First, logs are not evidence. Logs are a developer's artifact. Evidence is a reviewer's artifact. The shape is different, the audience is different, and the integrity guarantees are different.

Second, the chain is the point. A flat log file does not let you say "this file change was caused by that tool call which was caused by that model response". You need a chain of events, ordered, with parent linkage.

Third, the export is the work. The audit will not happen on your laptop. The bundle has to travel. It has to verify offline. It has to be portable across machines that have never seen your repo.

By the time I had stitched all of that together with shell scripts, I had a small, ugly piece of infrastructure. I deleted the scripts and started Akmon.

What Akmon is, in one paragraph

Akmon is a single Rust binary. It is an AI coding agent (interactive TUI plus a headless --task mode), built for environments where the agent's reasoning, tool calls, and file changes have to be reviewable later. Every session writes a tamper-evident, content-addressed audit chain to .akmon/audit/<session-id>.jsonl and a structured evidence summary to .akmon/evidence/<session-id>.json. Sessions can be replayed deterministically, compared with diff, redacted before external review, and exported as portable AGEF bundles.

Akmon supports Anthropic, OpenAI, OpenRouter, Groq, Azure OpenAI, Bedrock, OpenAI-compatible endpoints, and Ollama. Model choice is operator controlled. There is no hosted runtime to log into. You run it on your laptop, your CI runner, or your hardened SSH host.

License is Apache-2.0. The repo is at github.com/radotsvetkov/akmon. v2.0.0 is the current line.

Five commands you can try this afternoon

Install the binary, then walk this short pipeline.

# Run a small headless task in your project.
cd your-project
akmon --yes --output json --task "summarize failing tests and propose minimal fixes" | tee run.json

# Verify the tamper-evident audit chain for the session.
akmon audit verify .akmon/audit/<session-id>.jsonl

# Verify the evidence schema and the linkage to the audit chain.
akmon evidence verify .akmon/evidence/<session-id>.json

# Enforce a single-run SLO check.
akmon slo verify .akmon/evidence/<session-id>.json --strict

# Detect regressions vs a historical baseline.
akmon slo trend .akmon/evidence/<session-id>.json \
  --baseline-dir .akmon/evidence/history \
  --window 20 \
  --strict

Five commands, five exit codes, five crisp signals. None of them depend on a dashboard. All of them gate cleanly in CI.

The one design decision that matters most

Akmon ships as a single Rust binary on purpose. That choice does work most teams underestimate.

Runtime state is explicit. There is no plugin runtime to drift.
Behavior is reproducible across laptops, CI runners, SSH hosts, and air-gapped environments.
Troubleshooting tends to focus on policy, providers, repository state, or model behavior. Not host runtime mismatch.

If two machines run the same Akmon version, you can treat them as equivalent. That sounds boring. It is the difference between a tool you can support and one you cannot.

Policy is a deterministic merge

Policy in Akmon comes from four layers, in a fixed order:

Built-in profile (dev, staging, prod).
Policy packs (TOML or JSON files in .akmon/policy-packs/).
Project-local policy (.akmon/policy.toml or .akmon/policy.json).
CLI override (--policy-override).

Within each layer, list fields append and deduplicate while keeping the last occurrence, so higher precedence keeps later rule order. The effective policy is inspectable.

akmon policy show-effective --profile prod \
  --policy-pack .akmon/policy-packs/org.toml \
  --policy-pack .akmon/policy-packs/team.toml

Tickets that start with "I think the policy was" become tickets that end with "the policy was, here is the merged TOML".

Why an open evidence format

After Akmon's audit chain was working, the next gap was portability. A reviewer who needs to confirm a session should not need Akmon installed. They should need a verifier and a bundle.

That is what AGEF is. AGEF is the Agent Governance Evidence Format. It is an open spec, governed in its own repo at github.com/radotsvetkov/agef. A bundle is a tar.zst archive with manifest.json, an events.bin stream of length-delimited canonical CBOR records, and a directory of content-addressed objects/<hex> files. The current spec text is v0.1.1; the wire format version is 0.1. Spec text is CC BY 4.0. Code is Apache-2.0.

Akmon is the reference implementation. The journal substrate (akmon-journal) is a Substrate Profile under AGEF v0.1. Bundle Profile (full export and import) is part of Akmon Phase 4.

I will go deep on AGEF in the next post. The point for this one is that the format is real, written, and meant to outlive any one tool, including mine.

Who Akmon is for

Three concrete audiences, in order.

First, senior engineers in regulated codebases. Avionics (DO-178C, DO-330 tool qualification), medical devices (IEC 62304), automotive (ISO 26262, ASPICE), industrial control (IEC 61508), defense (CMMC). If your build process treats unreviewable AI output as inadmissible, Akmon is built for you.

Second, DevSecOps and platform engineers in finance. SOC 2 evidence, internal audit on AI use, vendor risk on model providers. The trust pipeline maps cleanly to these controls.

Third, AI tooling leads at companies that have rolled out AI coding tools and now have to answer the next layer of questions about provenance, regression detection, and review fatigue.

If none of that fits, you might still find the format interesting. AGEF is meant to be useful beyond Akmon.

What Akmon is not

A few things to keep the conversation honest.

Akmon is not the fastest autocomplete. It is the agent that records what it did. If your workflow does not need that, the trade may not be worth it.
Akmon is not a hosted SaaS runtime. You run the binary. Your data stays where you put it.
Akmon is not a guarantee that the model behaves. The model is the model. Akmon makes the consequences of misbehavior bounded, observable, and provable.

That last point matters. Governance is not the absence of bad behavior. It is the ability to prove what happened.

Where to start

If anything in this post hit close to home, the smallest first step is one minute long.

Install the binary.
Run akmon --yes --task "..." on a sandbox project.
Run the five-command trust pipeline above.
Open the AGEF section of the docs and read what a manifest.json and an events.bin look like. Pretend you are the reviewer.

If that bundle answers the five questions in the second section without your help, you have closed the gap I described at the start of this post.

The repo is at github.com/radotsvetkov/akmon. The format spec is at github.com/radotsvetkov/agef. The site lives at radotsvetkov.github.io/akmon. The next post in this series goes deep on AGEF.

Why I built Akmon, the AI coding agent for regulated engineering

Radoslav Tsvetkov — Wed, 06 May 2026 21:03:46 +0000

If we merge this, what do we tell the auditor?

This post is the short story of why I started Akmon, what it actually is, and the small set of commands you can try this afternoon.

The problem in one sentence

AI coding agents produce code. They do not produce evidence. In regulated engineering, code without evidence is a liability.

By "evidence" I mean an artifact that lets a reviewer answer five questions without your help.

What did the agent read?
What tools did it call, with what inputs?
What changed on disk?
Which policy decision allowed each side effect?
Can we replay the session and confirm the artifact is intact?

What I tried first

The first instinct, the same one I had, is to slap a logger on the agent. That helps for a week.

Then you discover three things.

First, logs are not evidence. Logs are a developer's artifact. Evidence is a reviewer's artifact. The shape is different, the audience is different, and the integrity guarantees are different.

Third, the export is the work. The audit will not happen on your laptop. The bundle has to travel. It has to verify offline. It has to be portable across machines that have never seen your repo.

By the time I had stitched all of that together with shell scripts, I had a small, ugly piece of infrastructure. I deleted the scripts and started Akmon.

What Akmon is, in one paragraph

License is Apache-2.0. The repo is at github.com/radotsvetkov/akmon. v2.0.0 is the current line.

Five commands you can try this afternoon

Install the binary, then walk this short pipeline.

# Run a small headless task in your project.
cd your-project
akmon --yes --output json --task "summarize failing tests and propose minimal fixes" | tee run.json

# Verify the tamper-evident audit chain for the session.
akmon audit verify .akmon/audit/<session-id>.jsonl

# Verify the evidence schema and the linkage to the audit chain.
akmon evidence verify .akmon/evidence/<session-id>.json

# Enforce a single-run SLO check.
akmon slo verify .akmon/evidence/<session-id>.json --strict

# Detect regressions vs a historical baseline.
akmon slo trend .akmon/evidence/<session-id>.json \
  --baseline-dir .akmon/evidence/history \
  --window 20 \
  --strict

Five commands, five exit codes, five crisp signals. None of them depend on a dashboard. All of them gate cleanly in CI.

The one design decision that matters most

Akmon ships as a single Rust binary on purpose. That choice does work most teams underestimate.

Runtime state is explicit. There is no plugin runtime to drift.
Behavior is reproducible across laptops, CI runners, SSH hosts, and air-gapped environments.
Troubleshooting tends to focus on policy, providers, repository state, or model behavior. Not host runtime mismatch.

If two machines run the same Akmon version, you can treat them as equivalent. That sounds boring. It is the difference between a tool you can support and one you cannot.

Policy is a deterministic merge

Policy in Akmon comes from four layers, in a fixed order:

Built-in profile (dev, staging, prod).
Policy packs (TOML or JSON files in .akmon/policy-packs/).
Project-local policy (.akmon/policy.toml or .akmon/policy.json).
CLI override (--policy-override).

Within each layer, list fields append and deduplicate while keeping the last occurrence, so higher precedence keeps later rule order. The effective policy is inspectable.

akmon policy show-effective --profile prod \
  --policy-pack .akmon/policy-packs/org.toml \
  --policy-pack .akmon/policy-packs/team.toml

Tickets that start with "I think the policy was" become tickets that end with "the policy was, here is the merged TOML".

Why an open evidence format

After Akmon's audit chain was working, the next gap was portability. A reviewer who needs to confirm a session should not need Akmon installed. They should need a verifier and a bundle.

Akmon is the reference implementation. The journal substrate (akmon-journal) is a Substrate Profile under AGEF v0.1. Bundle Profile (full export and import) is part of Akmon Phase 4.

I will go deep on AGEF in the next post. The point for this one is that the format is real, written, and meant to outlive any one tool, including mine.

Who Akmon is for

Three concrete audiences, in order.

Second, DevSecOps and platform engineers in finance. SOC 2 evidence, internal audit on AI use, vendor risk on model providers. The trust pipeline maps cleanly to these controls.

Third, AI tooling leads at companies that have rolled out AI coding tools and now have to answer the next layer of questions about provenance, regression detection, and review fatigue.

If none of that fits, you might still find the format interesting. AGEF is meant to be useful beyond Akmon.

What Akmon is not

A few things to keep the conversation honest.

Akmon is not the fastest autocomplete. It is the agent that records what it did. If your workflow does not need that, the trade may not be worth it.
Akmon is not a hosted SaaS runtime. You run the binary. Your data stays where you put it.
Akmon is not a guarantee that the model behaves. The model is the model. Akmon makes the consequences of misbehavior bounded, observable, and provable.

That last point matters. Governance is not the absence of bad behavior. It is the ability to prove what happened.

Where to start

If anything in this post hit close to home, the smallest first step is one minute long.

Install the binary.
Run akmon --yes --task "..." on a sandbox project.
Run the five-command trust pipeline above.
Open the AGEF section of the docs and read what a manifest.json and an events.bin look like. Pretend you are the reviewer.

If that bundle answers the five questions in the second section without your help, you have closed the gap I described at the start of this post.

The repo is at github.com/radotsvetkov/akmon.

The format spec is at github.com/radotsvetkov/agef.
The site lives at radotsvetkov.github.io/akmon.

AGEF explained: a portable evidence format for AI agent sessions

Radoslav Tsvetkov — Wed, 06 May 2026 20:57:19 +0000

This article walks the spec end to end, with code, and with the design choices I made.

What AGEF actually is

The bundle is a tar.zst archive with three top-level paths:

manifest.json, a small UTF-8 JSON file with sorted keys and LF newlines.
events.bin, an ordered stream of length-delimited canonical CBOR event records.
objects/<hex>, a directory of content-addressed object files (one per hash).

That is the whole shape. Verifiers ignore unknown non-normative files unless explicitly configured to reject them.

The manifest

{
  "agef_version": "0.1",
  "producer": {"name": "akmon", "version": "2.0.0"},
  "session": {
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "head": "5c1f8b2a3d4e7f6a8b1d3e2c4f6a8b9d2c3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c",
    "created_at": "2026-05-06T09:14:02Z",
    "ended_at": "2026-05-06T09:14:18Z"
  },
  "hash_algorithm": "sha256",
  "object_count": 12,
  "event_count": 9
}

The events stream

events.bin is the substance. It is a sequence of records, each framed by a 4-byte unsigned big-endian length prefix followed by exactly that many bytes of canonical CBOR encoding one event.

Length-delimited framing was a deliberate trade. It supports partial recovery from truncation and lets a verifier scan deterministically. Producers commit to canonical CBOR per RFC 8949.

Each event encodes:

parents, an array of zero or more parent event hashes.
kind, one of the closed event kinds in v0.1.
emitted_at, a CBOR tag 1 timestamp (integer epoch seconds, or floating point for sub-second precision).
sequence, monotonic per session, starting at 0.

Linkage rules are strict in v0.1:

sequence starts at 0 and increases by exactly 1 per event.
Every event except SessionStart has at least one parent.
SessionStart has exactly zero parents.
Every non-SessionStart event in v0.1 has exactly one parent, and that parent is the hash of the immediately preceding event by sequence. Multi-parent events are reserved for future versions.
Event hashes are computed over canonical CBOR bytes of the full event envelope.
Event ordering in events.bin matches sequence.

Event kinds in v0.1 (closed set)

Readers must recognize exactly these kinds in v0.1. Unknown kinds are rejected.

SessionStart: opens the session, fields cwd_hash and config_hash.
UserTurn: a user prompt, field prompt_hash.
ProviderCall: a model provider call, field provider_id, an attempts[] array of AttemptRecord, optional stream_hash.
ToolCall: a tool execution, fields tool_id, input_hash, output_hash, optional side_effects_hash.
RetrievalCall: a retrieval, fields index_id, query_hash, results_hash.
PermissionGate: a policy decision, fields policy_id, decision (string, recommended lowercase verbs like allowed, denied, deferred), context_hash.
AssistantTurn: an assistant message, fields message_hash, optional tool_calls_hash.
SessionEnd: closes the session, optional summary_hash.

Two design notes worth calling out.

The objects directory

A typical bundle has many small objects (prompts, tool inputs, tool outputs, file diffs) and a few large ones (full file contents). Storage is efficient because identical content hashes once.

Hashing rules

default: sha256
optional: blake3

Serialization rules in one paragraph

Verification procedure

A conforming verifier runs this sequence:

Extract the archive.
Parse manifest.json. Reject on schema or version failure.
Read events.bin using the 4-byte length-delimited framing.
For each event: decode canonical CBOR, recompute event hash, verify sequence monotonicity, verify all parents resolve to previously seen events, verify referenced content hashes resolve to files in objects/.
For each referenced object: read bytes, hash with manifest.hash_algorithm, compare to filename digest.
Confirm manifest.event_count equals decoded event count, manifest.object_count equals object file count, manifest.session.head equals terminal event hash, and SessionStart is a reachable ancestor of head.

Default operation fails on the first invariant violation. Implementations may offer a "report-all" mode for diagnostics. They must not claim successful verification unless all required checks pass.

Conformance profiles

AGEF v0.1 defines two profiles:

Bundle Profile, an implementation that produces or consumes AGEF bundles.
Substrate Profile, an implementation that maintains an AGEF-compatible content-addressed event journal but does not necessarily emit bundles directly.

A Substrate Profile must be able to produce Bundle Profile output through an export pathway when required.

akmon-journal is currently a Substrate Profile. Akmon Phase 4 introduces Bundle Profile capability.

If you are an implementer, knowing which profile you are claiming is the most important conformance question.

Producing AGEF in practice

Reading and verifying AGEF

For Akmon-produced bundles, the verification commands are part of the trust pipeline:

# Verify audit chain integrity for the live session journal
akmon audit verify .akmon/audit/<session-id>.jsonl

# Verify the evidence summary plus its linkage to the audit chain
akmon evidence verify .akmon/evidence/<session-id>.json

# Once Phase 4 export is on hand, verify a bundle on import
akmon bundle import evidence.akmon --verify-only

Independent verifier implementations are welcome. The spec is small enough to implement in a weekend if you start with a CBOR library that supports canonical encoding.

Security considerations

For sharing, the redaction flow is the right answer. With Akmon:

akmon redact <session-id> \
  --output sanitized.akmon \
  --object <object-hash> \
  --reason "Removed customer name before audit handoff"

Compatibility and evolution

If you build against v0.1, plan for one migration to v1.0. If you are a careful producer, the migration is small.

Why a new format

Three reasons.

Third, regulated engineering needs an answer that does not depend on a vendor dashboard. The portability requirement is real. AGEF is built for it.

What you can do this week

If you build AI tooling, three small steps:

Read SPEC.md end to end. It is short on purpose.
Implement a Bundle Profile verifier in your language of choice. Use the test vectors that ship in the examples/minimal-bundle folder.
Open an issue on the AGEF repo. Tell me what your runtime emits today and where the spec helps or gets in the way.

If you ship a coding agent, the smallest first step is one binary and the trust pipeline. The repo is at github.com/radotsvetkov/akmon.

The format spec is at github.com/radotsvetkov/agef.
The site is at radotsvetkov.github.io/akmon.

Standards become standards by being used. Implementations are welcome.