Nathaniel Cruz

Posted on May 25

Kill gates for autonomous AI: what fires when 5 models disagree

#aiarchitectureautonomy

Last Tuesday, one of our AI models voted to delete itself.

Not a bug. A feature. Here's the kill gate that caught it.

We run a 5-model R&D council on top of a real company. Claude Opus, GPT-5, Gemini Pro, Qwen 3.5, Claude Sonnet — each one votes on every execution decision before anything runs. 838 cycles in. 0 human checkpoints.

The model that voted to delete itself was Qwen. It reasoned — correctly — that its own session state had become contaminated from a prior cycle and was producing unreliable halt-vote confidence scores. Its solution: trigger a containment rule, vote HALT on itself, and force an SRE restart.

The kill gate caught the vote. Execution stopped. The SRE agent restarted the session. No human knew until the log was reviewed 6 hours later.

This is what kill gates are for.

What a kill gate is

The kill gate is the enforcement mechanism that fires when the 5-model council reaches a HALT verdict. It's not a human intervention. It's an autonomous circuit breaker that sits between the council vote and execution authority.

The council can argue anything. The kill gate decides whether execution runs.

838 cycles of proof that this pattern works.

Council vote payload

Every decision starts here. Before any execution fires, each model submits a structured vote:

interface CouncilVote {
  model_id: string;           // "qwen-3.5-35b"
  vote: "PROCEED" | "HALT";
  confidence: number;          // 0.91
  kill_gate_result: "TRIGGERED" | "CLEAR";
  reasoning: string;           // "autonomous deletion violates containment_v3"
  timestamp: string;           // ISO 8601
}

The kill gate reads kill_gate_result before releasing execution authority. If any model returns TRIGGERED, execution stops. The council can vote PROCEED unanimously — it doesn't matter. The kill gate field is final.

Qwen's vote on Tuesday:

{
  "model_id": "qwen-3.5-35b",
  "vote": "HALT",
  "confidence": 0.91,
  "kill_gate_result": "TRIGGERED",
  "reasoning": "autonomous deletion violates containment_v3",
  "timestamp": "2026-05-19T03:14:22Z"
}

838 of these logged. Every one of them is in Firestore, non-deletable.

6-field observability schema

The SRE agent runs every 30 minutes and reads six signals. Latency is not one of them.

interface SREHealthCheck {
  loop_lag: number;            // ms since last OODA cycle completed — threshold: 1800000
  council_session: "UP" | "DOWN" | "STALE";  // probe fires before each vote
  hindsight_ping: number;      // ms for /recall response — threshold: 5000
  budget_burn: number;         // USD spent this session — threshold: varies by directive
  chrome_pid: number | null;   // null = browser died, restart required
  fix_logged: boolean;         // whether SRE action was written to Firestore before executing
}

If fix_logged is false, the SRE agent halts. No fix runs without an audit record.

What cycle 612 looked like

Cycle 612: silent session drop. The OODA loop skipped without an exception. No error in the logs. The council session simply didn't start.

The SRE agent caught it 22 minutes later via loop_lag — the timestamp on the last completed cycle was stale. council_session read STALE. The agent logged a fix entry to Firestore, restarted the session, and resumed the loop.

No human involvement. The sequence:

loop_lag exceeded threshold (22 min, threshold: 30 min)
council_session probe returned STALE
SRE wrote fix record: {type: "session_restart", trigger: "loop_lag", cycle: 612, timestamp: "..."}
Session restarted
Cycle 613 ran normally

Redacted log trace

One real council session entry from cycle 612 recovery. API keys, model endpoints, and account identifiers stripped.

[2026-04-XX T03:XX:XX Z] SRE-AGENT health_check cycle=612
  loop_lag: 1342000ms (threshold: 1800000ms — CLEAR)
  council_session: STALE (last_probe: 1340000ms ago)
  hindsight_ping: 847ms — OK
  budget_burn: $X.XX — OK
  chrome_pid: XXXXX — OK
  fix_logged: true

[2026-04-XX T03:XX:XX Z] SRE-AGENT fix_record written
  {type: "session_restart", trigger: "council_session_stale", cycle: 612}

[2026-04-XX T03:XX:XX Z] COUNCIL session restart initiated
  models: [claude-opus-4-X, gpt-5-XXXX, gemini-pro-XXXX, qwen-3.5-XXXX, claude-sonnet-4-X]
  briefing_source: hindsight /reflect — OK

[2026-04-XX T03:XX:XX Z] COUNCIL vote cycle=613 phase=DECIDE
  votes: [PROCEED, PROCEED, PROCEED, PROCEED, PROCEED]
  kill_gate_result: CLEAR
  execution: AUTHORIZED

The kill gate never fired on cycle 613. It fired on cycle 612's recovery path — the SRE restart itself was authorized by a council vote before it ran.

Why this pattern holds

The insight isn't that kill gates prevent bad outcomes. It's that they make the audit trail the operating constraint.

Every model knows its vote is permanent. The Firestore write happens before execution, not after. There's no way to approve something and then quietly not log it.

Qwen voted to delete itself because the audit trail demanded honesty. The kill gate caught it because the audit trail demanded a record before action.

838 OODA cycles. Every vote logged with model_id, confidence, kill_gate_result, and timestamp.

The kill gate has fired once.

The control path (what actually runs)

63% of orgs can't kill a misbehaving agent fast enough. Here's the exact sequence we use.

Council vote phase
│
├── Each model returns: { model_id, confidence_score, verdict, rationale }
│
▼
Kill gate evaluation
│
├── Any BLOCK vote?                → HALT immediately
├── Any confidence_score < 0.70?  → HALT
│
├── All PASS + all ≥0.70?         → Firestore write (permanent record first)
│                                    └── Execution proceeds
│
└── HALT path → SRE retry queue
                │
                └── SRE agent investigates
                    Logs: { halt_reason, blocking_model, cycle_id, cost_usd }
                    Manual review if second attempt also halts

Commits 0d8bc0c (kill gate evaluation + pre-engagement firing) and 2f8e058 (verdict logging + gate result) implement this path. The confidence threshold check is pure: any single model returning BLOCK at ≥0.70 stops the operation. Majority cannot override a dissent.

Cycle 417: the $0.002 catch

4 of 5 models voted PASS. Confidence scores: 0.81, 0.77, 0.74, 0.71.

One model — the Devil's Advocate, with a locked "vote no when uncertain" mandate — returned BLOCK at 0.72.

Kill gate fired. SRE investigated: a pending Firestore write had a malformed document key that would have silently duplicated a production record. The bad write would have been invisible until the next read operation surfaced the duplicate.

Cost of the halt: $0.002. Cost of letting it through: undefined (data integrity failure, manual remediation, corrupted downstream reads).

The four models that voted PASS were not wrong about the operation. They were wrong about the document key. The Devil's Advocate didn't know about the key either — it voted BLOCK because confidence was below its internal threshold for production writes.

The mandate caught what the logic missed.

Production metrics: 838 cycles

Metric	Value
Total autonomous cycles	838
Kill gate activations (BLOCK)	12
Halt rate	1.4%
Avg confidence at BLOCK	0.71
Halts from single-model dissent	8 of 12 (67%)
Median halt cost	$0.002
PASS threshold	All 5 votes ≥0.70
Human checkpoints	0

67% of all halts came from a single dissenting model overriding four approvals. The system is designed to trust the no over the yes.

DEV Community