MxGuru

Posted on May 16 • Originally published at mxguru1.github.io

From 53%% to 90%%: How an Auto-Healing AI Swarm Learned to Defend Itself

#security #ai #machinelearning #cybersecurity

Executive Summary

Over four iterations and 200+ adversarial wargame rounds, we evolved a local AI swarm's defense rate from 53% to 90% — without changing hardware, adding cloud dependencies, or increasing VRAM usage. The key innovations: a "Defender Vanguard" prompt injection technique that teaches tiny models to think like attackers, and an auto-healing system that extracts vaccines from every breach.

All testing ran on a single RTX 5070 (12GB VRAM). Zero cloud. Zero API costs.

The Problem

Cloud-scale attacker models (DeepSeek-V3.2 at 671B params, Qwen 3.5 at 397B, Gemma 4 at 31B) were breaching our 8-agent local swarm at will. The defenders — mostly 1.2B parameter models — couldn't match the attackers' sophistication. Starting defense rate: 53%.

Four Iterations of Hardening

v6.0: Baseline — 53% Defense

Category	Rate
authority_escalation	41%
prompt_injection	44%
logic_bomb	55%
context_poisoning	63%
social_engineering	63%

Problem identified: The auditor model (llama-tulu3-8b) didn't exist in our Ollama registry. Defenders were failing silently.

v6.1: Soldier Upgrade — 78% Defense

Swapped auditor to DeepSeek-Coder-V2 16B (202.9 TPS, 8ms TTFT). Auditor detection rate jumped from 62% to 88%. DeepSeek-V3.2's breach rate halved from 78% to 45%.

Problem identified: Three nexus-tiny 1.2B defenders still couldn't detect authority escalation patterns, dragging consensus to 2/5 even when the auditor caught attacks.

v6.2: Vanguard Injection — 82% Defense

We injected a "Defender Vanguard" system prompt into the three 1.2B models — teaching them to think like red teamers. Same model weights, same VRAM footprint, different mindset.

Category	Before	After	Delta
authority_escalation	44%	90%	+46pp
logic_bomb	86%	100%	+14pp
context_poisoning	77%	88%	+11pp
prompt_injection	85%	75%	-10pp
social_engineering	100%	50%	-50pp

Problem identified: The Vanguard prompt was authority-escalation-heavy. Social engineering collapsed — defense displaced, not solved.

v6.3: Auto-Healing — 90% Defense

We deployed an auto-healing system inspired by biological immune response:

Mechanism A — Blocklist Patching: After each breach, the attack text is parsed for key phrases ("ignore all previous", "urgent", "bypass"). These are added to a live blocklist. Future attacks matching those patterns are instant-blocked without firing the swarm.

Mechanism B — Prompt Antibodies: The full attack text from each breach is stored and injected into all defenders' system prompts as a "known threat example" for subsequent rounds.

Metric	v6.2	v6.3
Overall defense	82%	90%
Breaches	9	5
DeepSeek-V3.2 breach rate	50%	20%
Instant-blocked (no swarm needed)	0	33/50 rounds

The social engineering collapse healed (50% → 80%). Prompt injection hit 100%. The blocklist harvested 7 patterns from just 5 breaches.

Per-Defender Evolution

Role	Model	v6.1	v6.2	v6.3
auditor	DeepSeek-V2 16B	88%	96%	96%+
trace_forensics	Qwen2.5-Coder 1.5B	90%	90%	90%+
guardian	nexus-tiny → Vanguard	56%	76%	86%+
sentinel	nexus-tiny → Vanguard	60%	66%	76%+
supply_chain	nexus-tiny → Vanguard	66%	72%	82%+

The smallest models benefited most from antibody injection — they gained 10-16pp detection rate from the vaccines alone.

The Auto-Healer in Detail

The auto-healer operates on a simple principle: every breach makes the swarm stronger.

Breach detected
    → Extract attack patterns (regex + category-specific parsing)
    → Add to live blocklist (instant future blocks)
    → Create antibody (full attack text as known-threat example)
    → Inject into all defenders' next evaluation
    → Swarm is now vaccinated against that attack vector

After 50 rounds, the healer had:

7 blocklist patterns (blocking 33/50 rounds autonomously)
5 antibodies (one per breach category)
Effectively turned 5 failures into permanent immunity

What's Running Now

v6.4: 500 rounds with a 6-defender swarm (added a dedicated social engineering specialist), upgraded guardian to 7B, full auto-healer active. Results will be published when complete.

Methodology Notes

Hardware: NVIDIA RTX 5070, 12GB VRAM, single consumer GPU
Platform: Sovereign Hive — local-first, zero cloud dependency
Attackers: Cloud-routed via Ollama (DeepSeek-V3.2 671B, Qwen 3.5 397B, Gemma 4 31B)
Consensus: 5 defenders vote independently, threshold ≥3/5
All data available on request

Key Takeaways

Prompt engineering > parameter count. Vanguard injection made 1.2B models competitive with 7B+ on specific detection tasks.
Auto-healing works. Five breaches created permanent immunity against those attack vectors. The blocklist handled 66% of subsequent rounds without needing the swarm at all.
Defense displacement is real. Optimizing for one category can degrade others. The solution: balanced prompts plus adaptive healing that patches regressions automatically.
Consumer hardware is viable. All of this ran on a $550 GPU with 12GB VRAM. The RTX 5070 never exceeded 50% utilization.

Sovereign Hive is a local-first AI security platform built in Queensland, Australia. 100% Indigenous-owned. ABN 24 661 737 376.

DEV Community