I gave a Grok CLI agent swarm one instruction around 1am and went to bed. By 5:30 it had closed 37 items off my operator backlog and landed 20 commits on main: 2,838 lines added, 112 removed, across 51 files. Good night’s work for a process I wasn’t awake for.
Then I tried to audit it, and discovered my own ingest pipeline had quietly turned the entire run into garbage. The schema mismatch is mundane. The lesson underneath it is not.
The setup
I run a personal automation stack with an operator backlog: a Postgres-backed queue of tickets tagged by whether an agent can safely act on them unattended. My usual agents drain that queue against a set of contracts: an investigator role that diagnoses, and a worker role that ships with test gates. I wanted to see how a Grok CLI swarm would handle the same contracts.
The kickoff was one line: keep working through the queue, do the autonomous-safe work. The first session fanned out a dozen subagents in about three seconds. Each read the same prompt contracts, and each claimed tickets with a lease so they wouldn’t collide.
Every session writes a transcript to disk. My stack ingests those transcripts into Postgres so I can answer questions like “what did agent #7 actually run.” Standard local-first observability: the model provider doesn’t hold my data, I do.
The morning
I went to look at the per-session breakdown and the numbers were obviously wrong:
SELECT count(*) AS sessions,
count(model) AS with_model,
sum(turn_count) AS turns,
count(*) FILTER (WHERE is_subagent) AS subagents
FROM grok_sessions;
sessions | with_model | turns | subagents
----------+------------+-------+-----------
95 | 0 | 99 | 0
Ninety-five sessions, zero with a model recorded, ninety-nine total turns, zero subagents. That immediately failed a sanity check: six sessions had subagent-<uuid> in their project name, and a dozen subagents launching in three seconds do not average one turn each. The data was lying.
The transcript format
The importer was written against a Claude-Code-shaped assumption. It expected a role field on each record, tool calls embedded inside content blocks, and a model field somewhere on the message. Here is what Grok CLI actually writes, one JSON object per line:
$ cat chat_history.jsonl | jq -r '.type' | sort | uniq -c
47 assistant
47 reasoning
1 system
51 tool_result
3 user
Three things break immediately.
There is no role. The discriminator is type, and the values include reasoning and tool_result, record types the parser had never heard of. Anything that wasn’t system or assistant got bucketed into user. So a 149-record transcript with 47 real assistant turns counted as roughly three turns, the actual user messages. Everything else disappeared.
Tool calls aren’t embedded in content blocks. They’re a top-level array on assistant records:
$ cat chat_history.jsonl | jq -c 'select(.type=="assistant")
| {model_id, tools: (.tool_calls | length), first: .tool_calls[0].name}' | head -3
{"model_id":"grok-build","tools":2,"first":"read_file"}
{"model_id":"grok-build","tools":3,"first":"read_file"}
{"model_id":"grok-build","tools":2,"first":"grep"}
The parser scanned content blocks looking for tool-use records that never existed, so every session appeared to have made zero tool calls.
The model was right there. Every assistant record carried "model_id": "grok-build". The importer had a state.model variable. It declared it, threaded it through the insert path, and never assigned it. A dead variable writing NULL ninety-five times.
The real failure
The real failure wasn’t misreading the transcript. It was assuming the transcript was the source of truth. The parser bug merely exposed the design flaw.
In hindsight the warning signs were obvious. Every session directory already contained multiple independent records of what happened: transcripts, event streams, session metadata, and git state. I had accidentally built an audit pipeline that ignored corroborating evidence and trusted the narrative, so when the narrative parser failed, the entire picture failed with it.
The session metadata alone contained everything I was missing:
{
"current_model_id": "grok-build",
"session_kind": "subagent",
"num_chat_messages": 149,
"created_at": "2026-06-05T00:28:06.848Z",
"last_active_at": "2026-06-05T00:33:48.184Z"
}
Every missing column was already available. session_kind identifies subagents, num_chat_messages is the real turn count, created_at and last_active_at are the real timestamps, and additional metadata links the session back to the correct repository context. Meanwhile my importer was deriving timestamps from file modification times and inferring project identity from directory names.
The fix is not a patch to the parse loop. It’s architectural: trust authoritative session metadata for session-level facts, use transcripts only for message bodies, handle the record types that actually exist, and corroborate one source against another.
Why I’m telling on myself
Here’s the uncomfortable bit. I could not answer “what did the swarm actually do” from the swarm’s own records. I had to reconstruct the night from two sources the agents didn’t write: the backlog table, which recorded what changed state, and git history, which recorded what actually landed. Those two agreed, and they’re trustworthy precisely because the agents couldn’t rewrite them afterward.
$ git log --since="2026-06-04 23:30" --until="2026-06-05 08:00" \
--pretty=tformat: --numstat \
| awk 'NF==3 {a+=$1; d+=$2; f++}
END {printf "%d files, +%d/-%d\n", f, a, d}'
51 files, +2838/-112
That number I believe. The transcript-derived “99 turns” I don’t.
The lesson
This has nothing to do with Grok specifically. Autonomous agents are trivial to launch and genuinely hard to audit, and the gap between those two facts is where teams are going to get hurt over the next few years. An agent telling you what it did is not evidence. It’s a claim, and it’s a claim sourced from the least trustworthy possible place: the thing being audited.
I filed two tickets against my own importer. One fixes the parser; the other forces a re-parse of the 95 corrupted rows, since the incremental loader skips files whose size hasn’t changed and a logic fix doesn’t change the file. They’ll get fixed. But the design flaw matters more than the bug: verify agent work from sources the agent cannot write to, and preferably from multiple sources that must agree. Otherwise you’re not auditing behavior, you’re auditing a story about behavior.
If your only record of an agent’s actions comes from a channel the agent controls, you don’t have observability. You have a press release with a timestamp.
The swarm had a productive night. I just had to prove it the hard way.
Top comments (1)
The last line is the exact problem statement FIPSign was built to solve.
The gap you're describing isn't observability — it's auditability. Observability tells
you what happened. Auditability tells you that what happened is verifiable and
tamper-proof. They solve different problems, and most agent stacks only have the first.
The source-of-truth problem you hit — "I had to reconstruct the night from two sources
the agents didn't write" — is the right instinct but it's architectural duct tape. Git
history and the backlog table worked this time because the agents couldn't write to
them. But that's a constraint on your specific setup, not a guarantee.
The primitive that closes this at the protocol level: each agent action produces a
signed token from an external source the agent doesn't control. sub: "agent_swarm_7",
scoped to the operation, short TTL, revocable. If the token verifies, the action is
authentic. If it doesn't, something was tampered with or the agent was compromised. The
record isn't a claim — it's a cryptographic proof.
Where it gets more interesting for a swarm setup: a Private CA that issues a
certificate per agent. Now you can verify not just "this token is valid" but "this
token was issued by an agent that belongs to this authorized fleet." Each agent has a
cryptographic identity issued by a CA you control — and if an agent is compromised, you
revoke its certificate instantly, not wait for tokens to expire.
I've been building exactly this with FIPSign — post-quantum signing API on ML-DSA-65
(NIST FIPS 204). An agent is just another sub. Sign the action, verify anywhere, revoke
if something looks off.
Your pipeline lied because it was the only witness. Signed tokens give you a second
witness that can't be coached.