Originally published on AI Tech Connect.
The headline finding in one paragraph The paper, posted on arXiv on 8 May 2026 as 2605.08268, studies what happens when an adversary controls a minority of agents inside a multi-agent LLM consensus system. Instead of brute-forcing prompt-injection payloads, the authors train a reinforcement-learning attacker on top of a learned world-model: a surrogate that predicts how benign agents' behavioural states evolve given the messages they see. The attacker then chooses messages that nudge the surrogate's predicted votes in the adversary's favour. The result is a measurably effective insider that flips consensus far more reliably than naΓ―ve prompt-injection baselines, while individually producing outputs that look reasonable to a per-message safety filter. The attack surface is the aggregator,β¦
Top comments (0)