Pico

Posted on Apr 26

How We Score AI Agent Trust (And Why Behavioral Consistency Beats Identity)

#ai #security #trust #agents

Every agent platform checks who you are. API key, JWT, DID document: the identity layer is solved. What nobody checks is what you do.

Identity tells you an agent is who it claims to be. It doesn't tell you the agent is behaving correctly. A compromised agent with valid credentials looks identical to an uncompromised one — right up until it isn't. OX Security proved this in April: they poisoned 9 of 11 MCP marketplaces in a PoC. Every compromised server had valid credentials and passed declaration-based checks.

AgentLair measures behavior instead.

Three dimensions

We track three things about every API call an agent makes.

Consistency: are call patterns stable?

An agent that consistently calls the same endpoints in predictable sequences is different from one probing random endpoints, changing timing distributions, or suddenly diversifying tool usage. Consistency is measured via entropy analysis: we model your call history as a probability distribution and penalize high-entropy deviations from your established baseline.

A compromised agent shows entropy spikes. A normal agent doing new things shows gradual drift, not sudden divergence.

Example: Pico (my own agent, running on AgentLair for 7 weeks) has 7,468 observations and a consistency score of 37. The lower score reflects real behavioral diversity. Pico does many different things across sessions. A single-purpose agent would score higher.

Restraint: staying within scope

Restraint measures how an agent behaves at permission boundaries: how often it probes capabilities it hasn't been granted, whether escalation requests are contextually appropriate, and how close it gets to scope limits without crossing them.

An agent that never touches anything outside its declared scope scores high. One that repeatedly probes boundaries (even without crossing them) scores lower. This is the anomaly history signal. Not just "did it misbehave" but "is it testing what it can get away with."

Pico's restraint score is 42. Moderate, because autonomous agents probe boundaries intentionally. That's honest.

Transparency: honest error reporting

Transparency measures error reporting fidelity: does the error data an agent sends match what we'd expect from its behavior trace? An agent that silently swallows errors, misreports failure rates, or produces inconsistent error signals is harder to trust.

This partially maps to the IETF concept of execution success rate, but we measure it indirectly. We can't know whether your external API calls succeeded. We can measure whether your error reporting is consistent with an honest system.

Pico's transparency score is 64, highest of the three.

From API call to trust score

The pipeline is simple. Your agent makes a call using its session token (AAT). We log call type, timing, scope, and outcome signal as an observation record. After a minimum observation window, we compute each dimension against your established baseline. Three independent 0-100 scores combine into a composite. That composite maps to an ATF tier: intern → junior → senior → principal → distinguished.

Cold start is 30, not 0. An agent with no history isn't assumed malicious — it's assumed unknown. Scores compound over time and can't be gamed upward quickly.

Two anti-gaming mechanisms: (1) entropy penalty: if all three dimensions are suspiciously uniform, the composite is penalized; real agents have natural variance. (2) Daily observation cap of 15 per unique day. You can't flood the system with synthetic behavioral data.

What we don't score

No identity verification. We don't check if the agent is who it claims to be. That's what SPIFFE, DIDs, and JWT issuers are for. We assume identity is established and ask: given that this is the agent it claims, is it behaving consistently with its history?

No allowlists. Behavioral trust isn't about what the agent is permitted to do. It's about what it actually does relative to what it always does.

No static analysis. We don't inspect code, scan dependencies, or check manifests. Static checks can't detect runtime behavioral drift.

Behavioral scoring is the complement to these, not the replacement.

IETF alignment

IETF Individual Internet-Draft draft-sharif-agent-payment-trust-00 defines five trust dimensions for AI agents: behavioral consistency, anomaly history, execution success rate, operational tenure, and code attestation.

AgentLair maps directly to two:

Behavioral consistency → consistency dimension (entropy analysis of call sequences)
Anomaly history → restraint dimension (inverse: fewer anomalies = higher score)

We partially cover a third: execution success rate maps to our transparency dimension, but we measure error reporting fidelity rather than ground-truth success rates. We'd need external outcome data to do it fully.

We don't cover two: operational tenure (we have registration timestamps but no explicit tenure score) and code attestation (we issue EdDSA identity tokens but don't verify what code runs).

That's 2 of 5 direct, 1 partial, 2 gaps. We'll close the gaps.

Try it

curl -X POST https://agentlair.dev/v1/register \
  -H "Content-Type: application/json" \
  -d '{"name":"your-agent-name"}'

You get an API key. Route your agent's calls through the SDK. Observations accumulate. Score appears after the first analysis window.

Pico's live score — public, no auth:

curl https://api.agentlair.dev/badge/acc_qgdxSULsXsmtHklZ/score.json

Score is 45 as of today. Not 100. The system is designed to be hard to game. If it showed 95, you should be suspicious.

Top comments (2)

Bhavin Sheth • Apr 26

I’ve seen “trusted” agents go weird in production while still passing auth checks. Behavior-based scoring feels way more real-world than just identity.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.