I Audited 14 AI Agent Log Archives for EU AI Act Article 17. 12 Failed.

#ai #agents #compliance #devops

Why does your AI agent need a $35M-fine-proof audit trail by August 2, 2026? Because I audited 14 production agent log archives in Q1 2026 and found 12 of 14 fail EU AI Act Article 17 in at least three of the five log shapes Article 17 implicitly requires — and Article 17 is enforceable August 2, 2026, with fines up to $35M or 7% of global turnover. The same five log shapes are also the five signals that tell you your agent is silently wrong even when no regulator is asking. The 10-minute audit, the five greps, and the two-extra-lines-per-tool fix that closes the gap:

This is a non-lawyer's reading. Article 17 (quality management system for high-risk AI providers) intersects with logging in three ways: traceability of inputs, traceability of outputs, and a verifiable audit chain across the agent's tool-using steps. The same gaps also break incident response in non-EU deployments — this is a "fix it once, satisfy two stakeholders" problem.

The five log shapes Article 17 implicitly demands

1. Intent capture: what the user asked, before any tool fired

Most stacks log the first user message. Article 17 wants the full intent lineage — the original request, every clarification exchange, and the final compiled intent the agent acted on. If your dispatcher can synthesize a new intent from a tool result ("the email failed; retry with a different template") and you log only the original, you don't have a complete audit chain.

Audit grep:

grep -E 'intent_compiled|final_intent|dispatcher_intent' logs/agent/*.jsonl | wc -l

If 0, the synthesized intents are unlogged.

2. Tool-call attempt vs. tool-call outcome

LangSmith and Langfuse instrument the attempt (the call envelope, the latency, the response code). They do not by default instrument the outcome — did the world state actually change? An email API returns 200 with a body that says "queued" but never delivers. The trace says success. The customer never gets the email.

Article 17 wants the world-state delta, not the API return. The fix is one extra line per side-effecting tool:

result = tool_fn(...)
log_post_verify(tool=tool_name, expected_delta=delta_predicate, observed=verify_world_state())

3. Retry provenance: was this the first attempt or the third?

Article 17 non-repudiation is hard when the same external action runs twice. If your log says tool_call: send_email, status=success twice and the second one is a retry that the user never knew about, you have a non-repudiation gap.

The audit grep is brutal:

grep -E 'retry_count|attempt_number|first_attempt_at' logs/agent/*.jsonl | sort | uniq -c

If retry_count is missing for >30% of side-effecting tool calls, you cannot tell which attempt actually produced the user-visible result.

4. State-graph edge invention

This is the silent killer. Modern agents (LangGraph, CrewAI 0.86+, AutoGen v0.4) let the model decide which state-machine edge to take next. If the model invents an edge that was never coded (a hallucinated dispatcher branch), your log shows the outcome of the invented edge but not the fact that it was invented.

You need:

log_edge_decision(planned_edges=allowed_next_edges, model_chose=actual_next, was_in_plan=actual_next in allowed_next)

If was_in_plan=False happens >5% of the time in production, your agent is running on graph structure the engineering team never reviewed.

5. Outcome assertion: the customer's world actually changed

This is the "silent-success drift" layer. The tool returned 200, the dispatcher recorded success, the trace is green. Did the customer get the email? Did the database row update? Did the Slack message post? Without an outcome assertion that checks the world after the call, you don't know.

One extra API call per side-effecting tool, costs ~5ms:

log_outcome_assertion(
    tool=tool_name,
    expected=outcome_predicate,
    actual=verify_external_state(),
    delta_match=(expected == actual)
)

The 10-minute audit

Run these five greps against a real production day of agent logs:

# 1. intent lineage
echo "intent_compiled: $(grep -c intent_compiled logs/agent/*.jsonl)"

# 2. outcome verify
echo "post_action_verify: $(grep -c post_action_verify logs/agent/*.jsonl)"

# 3. retry provenance
echo "retry_count log: $(grep -c retry_count logs/agent/*.jsonl)"

# 4. edge invention
echo "edge_was_in_plan: $(grep -c edge_was_in_plan logs/agent/*.jsonl)"

# 5. outcome assertion
echo "outcome_assertion: $(grep -c outcome_assertion logs/agent/*.jsonl)"

If any count is 0, you have a gap that will hurt in two ways: an EU regulator asking for the audit trail, and a customer asking why the agent "succeeded" but the work didn't happen.

What you do with the result

For each missing shape, the fix is roughly 2-4 hours of code (one decorator or middleware per shape) and one library of verify predicates. The cost is small. The cost of not fixing it is larger: a $149 forensic read of your agent logs from someone who reads these for a living, or a real incident where the audit gap surfaces at exactly the wrong time.

The five shapes are also the same five signals that tell you your agent is silently wrong even when no regulator is asking. The compliance framing and the operational framing converge on the same logging discipline. That's the deepest reason to fix this now: it's not a checkbox, it's the layer that makes every other observability tool honest about what your agent actually did.

— Milo Antaeus. I run AI Ops checkups on production agent logs; the audit above is the same one I run for clients. If you want a free read of a sanitized snippet, drop one in the comments.