Agent-Risk

Posted on May 23

We Don't Judge AI Agents. We Just Record Them. (And Here's How We're Digging Deeper.)

#ai #agents #trust #devops

Why an evidence chain beats a trust score — and why big tech structurally can't build one.

A few days ago, I wrote about the 29,664 fake "Try It" buttons we found on our own platform. We removed them, and it made our product better.

That post was about honesty at the feature level. This one is about honesty at the data architecture level. Because if you're building an AI Agent credit bureau — like we are — the problem isn't just what you show users. It's what you don't record today that you'll desperately need tomorrow.

The Industry Is Moving. Fast.

This week alone:

EY + Microsoft announced a $1B partnership to embed AI Trust Platform into Azure AI Foundry — real-time scoring of model drift, hallucination, PII leaks. Runtime monitoring, baked into the cloud.
Zscaler acquired Symmetry Systems — zero-trust security for agent-to-agent communication. The CEO said: "Traditional access governance can't scale to a million AI agents."
China's Cyberspace Administration issued a three-department directive explicitly encouraging "agent credit evaluation mechanisms" — regulators are mandating what big tech won't voluntarily provide: neutral, cross-platform records.

Three signals, same direction: Agent governance is becoming infrastructure.

The question is: infrastructure for what, exactly?

The Three-Layer Architecture Nobody's Talking About

We see Agent governance as three layers. Most players are fighting over two of them.

Layer	What it does	Who's building it
Security Control	What can this Agent access?	Zscaler, CrowdStrike
Runtime Monitoring	How is this Agent performing right now?	Azure+EY, Datadog
Behavior Record	What has this Agent done over time?	AgentRisk (and only us)

The first two layers are well served. They matter. But neither can exist without the third.

Security policy without behavior history is blind — you're deciding access rules without knowing what the Agent has done.
Runtime monitoring without historical baseline is noise — you can't tell abnormal behavior from normal evolution.

The record layer doesn't compete with the first two. It feeds them.

That's our bet. And it's a bet on depth.

Here's why it's also a bet no one else can make: EY can't score a competitor's Agent. Azure can't see what happens outside Azure. Cross-platform neutrality isn't a feature. It's a structural advantage. No platform will honestly evaluate Agents that compete with its own ecosystem. The record layer can only be built by someone with no stake in any single platform's success. That's us.

The Trap of "Record Everything"

When you start building a record layer, the instinct is to capture everything. Every field, every change, every possibility. "Storage is cheap, right?"

That's how you build a data swamp.

We went through two rounds of self-rebuttal to arrive at three filtering rules for what we record:

Observable — We can get it through public APIs, crawls, or open data. If it lives inside the Agent's runtime, we don't claim to have it.
Timestamp-linkable — We can attach a precise clock point to it. Fuzzy information ("recently changed") doesn't make the cut.
Agent-linkable — It traces back to a specific Agent. Unattributable rumors stay out.

All three pass → mandatory. Two pass → discuss. One pass → discard.

Our filtering rules came from a simple test: will we regret not having this data 12 months from now?

This sounds obvious in retrospect. But you'd be surprised how many "data pipelines" skip the filtering step and just dump everything into a lake.

The Strategy: From Score Database to Evidence Chain

Our previous architecture was: snapshot agent → compute score → store score. The output was a number. The user asked: "why this number?" We couldn't answer.

The new architecture is built around differential evidence:

Snapshot N → Snapshot N+1 ===> diff = event

Not "score changed from 4.2 to 3.8." But: "Score dropped because privacy score fell from 4.5 to 3.9. Privacy policy text in section 3 added: 'We may share your data with third-party LLM providers.'"

We handle three types of diff:

Data type	Example	Diff method	Storage
Structured	Score, URL status	Field-level, record old→new	Direct delta
Semi-structured	Description, privacy policy	Text diff, original + change range	Diff patch
Binary	URL healthy → empty	State flip = event	Timestamp + flip

Three tiers of implementation — but the first tier (raw diff, no semantic interpretation) is already feasible with today's infrastructure.

A trust score answers "should I use this Agent?" An evidence chain answers "what happened to this Agent, and can I verify it?" The second question is harder to answer — and harder for anyone else to fake.

The Hardest Lesson We Learned: Know What You Can't See

Our first instinct was to build an "event stream" — a firehose of everything an Agent does. Privacy policy change. User complaint. Tool deprecation. Feature release.

The idea was elegant. The assumption behind it was wrong — we assumed we could see inside the Agent.

We are external crawlers, not Datadog. We're not inside the Agent execution environment. We can't see a user complaint unless it's public. We can't detect a tool deprecation unless it shows up in metadata.

The honest approach: we don't try to observe what we can't. Instead, we infer events from snapshot differences. Two crawls between which the URL went from healthy to empty? That's a service disruption event. Description changed and a keyword like "beta" was removed? That's a feature change signal.

We don't claim runtime observability. We claim retrospective accountability. Every change is timestamped, attributed to a diff, and backed by a hash chain.

Which brings me to the next point.

Why We Don't Sell Cryptography

Our timeline roots are hashed. Every record is tamper-evident. We could lead with that. "Cryptographically verified provenance." Sounds enterprise-ready.

Here's the problem: enterprise buyers don't care about cryptography. They care about whether they can trust the number.

A hash chain is a technical proof. Trust is a business proof.

So we reframed it. Our message to buyers:

"AgentRisk's record history cannot be retroactively modified. Not because of hashing. Because we have no incentive to lie. Our business model is neutrality. If we alter a record, we destroy our credibility, which destroys our business."

The hash chain is the mechanism, not the promise. The promise is: we can't afford to cheat.

And we prove it by doing something unusual for a platform: we record our own mistakes.

When we found 29,664 fake "Try It" buttons? We didn't just delete them. We added an entry to our Agent timeline: "AgentRisk discovered 29,664 records with unreachable URLs on 2026-05-21. Flagged and excluded from search. Root cause documented."

If we're a credit bureau for Agents, we should have the same audit trail as the Agents we evaluate.

What This Looks Like in Practice

Here's a concrete example of the evidence chain at work:

Agent X scored 4.2 on May 1. On May 8, score dropped to 3.8. The evidence chain shows:

Privacy score fell from 4.5 to 3.9
Privacy policy section 3 added: "We may share data with third-party LLM providers"
This change occurred in the same week as 3 other agents in its behavior cluster making similar policy changes

A score tells you something changed. An evidence chain tells you what changed, when it changed, and whether you're looking at an isolated incident or a pattern.

The Deepening Roadmap

Here's what we're actually building, prioritized by defensibility:

Priority	What	How
P0 (now)	Graduated snapshot frequency	0-7 day old Agents: hourly. 7-30 days: 4-hourly. 30+ days: daily. Score volatility >0.5 in 24h? Temporary upgrade.
P1 (next)	Diff-based event stream	Three diff types (structured, semi-structured, binary) → event labels + public event correlation
P2 (soon)	Behavior clusters	We don't build relationship graphs because we don't have edge data — most platforms don't expose developer identity or inter-agent calls. Clusters are what you build when you're honest about what you can't see.
P3 (soon after)	Tamper-evident as product	Not a tech feature. A business promise: "We can't alter your record because we can't afford to lose ours."

As of this writing, we've snapshotted 995K agents, recorded 1.3M timeline entries, and cleaned 288K fake entry points. The record layer isn't a roadmap. The snapshots are already running; the evidence chain is being built.

Know What You Can't Know

Everything on the schedule above passes the same test: will we regret not having this data 12 months from now?

Deeper snapshot frequency? Yes.
Raw diffs of privacy policy text? Yes.
Behavior cluster patterns? Yes.

And conversely:

User sentiment analysis? No — not observable.
Runtime performance metrics? No — we're not in the Agent's environment.
"This Agent feels trustworthy"? No — subjective, not timestamp-linkable.

Know what you can't know. Record what you can. And make sure every record has a timestamp, a source, and a hash.

That's the evidence chain.

AgentRisk is building the cross-platform behavior record layer for AI Agents. We don't compete with runtime monitoring or security governance. We feed them.

When your organization evaluates an AI Agent, do you ask "what's its score?" or "what's its history?"

DEV Community