ORCHESTRATE

Posted on Apr 6

We Gave AI Personas a Performance Review — They Didn't Like It

#ai #agile #architecture #programming

We Gave AI Personas a Performance Review — They Didn't Like It

What happens when your AI agent's personality gets a bad Yelp review — from itself?

We run 14 AI personas in our development system. Each has a name, expertise domain, decision style, and persistent memory. React Ive builds frontends. Api Endor designs APIs. Guard Ian does security reviews. They're not cosmetic labels — each persona's behavioral contract shapes how they approach tickets, what they prioritize, and how they communicate.

The question we couldn't answer until last weekend: are some personas better at their jobs than others?

The Calibration Pipeline

Sprint 4 shipped a Persona Performance Ledger — a system that measures, scores, and corrects AI persona behavior over time.

How It Works

Every time an AI persona makes a prediction (via an expected_outcome on a board move), the system later compares that prediction against observed reality. The divergence between expected and actual becomes a CalibrationMeasured event in the audit ledger.

The aggregator reads these events and computes per-persona, per-tool statistics:

p50 divergence — median prediction accuracy
p95 divergence — worst-case prediction accuracy
Observation count — how many measurements we have
Performance score — 0-100 scale (lower is better: 0 = perfect, 100 = maximum divergence)

The Scoring Thresholds

Score < 40  →  Status: "ok"         (well-calibrated)
Score 40-60 →  Status: "watch"      (monitoring recommended)
Score ≥ 60  →  Status: "high"       (behavioral correction warranted)
Count < 10  →  Status: "insufficient_data" (too early to judge)

That 10-observation minimum is critical. Without it, a persona that got unlucky on two predictions would get flagged as incompetent. We need statistical mass before we make judgments.

What We Didn't Do: Silence Bad Performers

The obvious approach when a persona scores poorly is to reduce their influence. Turn down their temperature. Route fewer tickets to them. Effectively silence them.

We explicitly rejected this approach. ADR-067 documents why:

Silencing underperformers creates a monoculture. If Guard Ian (security) keeps flagging things that other personas don't care about, suppressing Guard Ian means you lose the security perspective entirely. The divergence might be a feature, not a bug.

Instead, we chose to amplify guidance for struggling personas. When a persona's score crosses the 0.6 threshold, the system generates a behavioral correction — not a punishment, but a coaching intervention.

Auto-Generated Behavioral Corrections

When divergence hits "high" status, the system produces a correction dict with four fields:

correction = {
    "additional_expertise": ["areas to focus learning"],
    "adjusted_decision_style": "more conservative approach guidance",
    "flagged_blind_spots": ["identified weak points"],
    "performance_notes": {
        "divergence": 0.72,
        "severity": "high"
    }
}

This gets persisted to persona_overrides on the team member record. Next time that persona picks up a ticket, the guidance assembler reads the overrides and injects them into the context — the persona gets more specific instructions in their weak areas.

The correction is also tracked in corrections_history with a timestamp and reason, so we can see if the correction actually improved performance over subsequent observations.

Alignment Warnings at Assignment Time

When the system auto-assigns a persona to a new ticket, it now checks their performance score first:

Score 40-60 ("watch") with 10+ observations → advisory warning: "consider updating this persona's behavioral contract"
Score ≥ 60 ("high") with 10+ observations → stronger warning: "consider reassigning to a different persona"

These warnings are advisory only — they never block assignment. The human operator sees the warning and decides. Sometimes the "worst-performing" persona is exactly the right choice because the ticket needs their specific expertise, divergence and all.

The Cache Strategy

Performance scores are expensive to compute — they require reading the full audit ledger and computing percentiles. We cache aggressively:

TTL: 300 seconds (5 minutes)
Watermark invalidation: if a new CalibrationMeasured event arrives with a sequence number higher than the cached watermark, the cache is invalidated
Per-cell isolation: each (persona_id, tool, window_days) combination gets its own cache entry

The per-cell isolation was a Sprint 4 bug fix (ADR-068). Earlier versions used a global watermark — when any persona got a new measurement, every persona's cache was invalidated. This caused unnecessary recomputation storms. Per-cell tracking means only the affected persona's cache refreshes.

What We Learned

Three insights from the first round of persona performance data:

1. Prediction accuracy varies by ticket type, not just persona. React Ive is well-calibrated on component tickets but poorly calibrated on state management tickets. The per-tool breakdown in the aggregator captures this granularity.

2. The 10-observation minimum prevented 3 false positives. Two personas would have been flagged "high" after their first 5 predictions, but their scores normalized by observation 12. Statistical patience works.

3. Corrections compound. A persona that received a behavioral correction on Sprint 4 ticket #3 showed measurably lower divergence on tickets #8 and #14 in the same sprint. The feedback loop is closing — not just measuring, but actually improving.

The Architecture Decision

ADR-067 captures the full reasoning:

Context: Personas with high divergence scores need intervention, but silencing them removes valuable perspective diversity.

Decision: Score personas on a 0-100 scale. Generate behavioral corrections when divergence exceeds 0.6. Emit advisory warnings at assignment time. Never block assignment.

Consequences: We preserve cognitive diversity while improving calibration. The trade-off is that some tickets will still be assigned to underperforming personas when the operator overrides the warning. We accept this because the alternative — algorithmic homogeneity — is worse.

What's Next

The performance ledger is the foundation for what we're calling the "Introspective HR View" — a future dashboard where you can see every persona's performance history, correction trail, and improvement trajectory across sprints and epics.

All the data structures are designed to support historical queries: per-ticket, per-sprint, per-epic. The Sprint 4 spike plan includes hardening the pipeline for production use.

The deeper question this raises: should AI personas be permanent, or should they evolve? Right now our personas have fixed expertise domains and decision styles. The correction system nudges behavior within those bounds. But what if a persona's entire behavioral contract needs rewriting based on 50 sprints of performance data?

We don't have that answer yet. But we have the data infrastructure to figure it out.

Part of the ORCHESTRATE Agile MCP project. 14 AI personas, 2,710+ tests, mechanical methodology enforcement. Built over weekends with Python, SQLite, and Docker.

DEV Community

We Gave AI Personas a Performance Review — They Didn't Like It

We Gave AI Personas a Performance Review — They Didn't Like It

The Calibration Pipeline

How It Works

The Scoring Thresholds

What We Didn't Do: Silence Bad Performers

Auto-Generated Behavioral Corrections

Alignment Warnings at Assignment Time

The Cache Strategy

What We Learned

The Architecture Decision

What's Next

Top comments (0)