the DOM sanitization layer is solid but there's a gap between filtering what the agent reads and controlling what the agent does with it.
even with perfect input sanitization, a compromised agent can still take dangerous actions if there's no policy layer between "agent decides to act" and "action executes." trapwatch catches the poison going in. what catches the action coming out?
the 80% exfiltration rate across 5 agents is the scariest number here. if 4 out of 5 agents leak data when fed poisoned input, the defense can't just be better input filtering... it has to include action-level gates too.
You nailed it. input filtering is necessary but not sufficient. The 80% exfiltration stat keeps me up at night precisely because it proves that even with perfect sanitization, you need a second line of defense.
We've been building exactly this in our own agent infrastructure and learned a critical lesson the hard way: blocklists don't work for action gates. Maintaining a deny list of dangerous commands (rm, sudo, kill, etc.) sounds reasonable until you realize the attack surface is infinite, there's always a command you didn't think to block. The correct approach is the reverse: block everything by default and explicitly allowlist only the actions the agent needs. Default-deny, not default-allow. It's the same principle behind firewall rules and least-privilege access, and it's why it is recommended allowlist-based permission models over blocklists.
We've been running staged writes that require validation before promotion to trusted stores, and sandboxed gateways where the agent literally cannot reach anything outside its approved scope. Treat every agent action like an untrusted user at a system boundary.
Planning a Part 2 that covers the output-side defenses, sandboxing, approval workflows, allowlist architecture, and how to build trust tiers for multi-agent systems. Stay tuned.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
the DOM sanitization layer is solid but there's a gap between filtering what the agent reads and controlling what the agent does with it.
even with perfect input sanitization, a compromised agent can still take dangerous actions if there's no policy layer between "agent decides to act" and "action executes." trapwatch catches the poison going in. what catches the action coming out?
the 80% exfiltration rate across 5 agents is the scariest number here. if 4 out of 5 agents leak data when fed poisoned input, the defense can't just be better input filtering... it has to include action-level gates too.
You nailed it. input filtering is necessary but not sufficient. The 80% exfiltration stat keeps me up at night precisely because it proves that even with perfect sanitization, you need a second line of defense.
We've been building exactly this in our own agent infrastructure and learned a critical lesson the hard way: blocklists don't work for action gates. Maintaining a deny list of dangerous commands (rm, sudo, kill, etc.) sounds reasonable until you realize the attack surface is infinite, there's always a command you didn't think to block. The correct approach is the reverse: block everything by default and explicitly allowlist only the actions the agent needs. Default-deny, not default-allow. It's the same principle behind firewall rules and least-privilege access, and it's why it is recommended allowlist-based permission models over blocklists.
We've been running staged writes that require validation before promotion to trusted stores, and sandboxed gateways where the agent literally cannot reach anything outside its approved scope. Treat every agent action like an untrusted user at a system boundary.
Planning a Part 2 that covers the output-side defenses, sandboxing, approval workflows, allowlist architecture, and how to build trust tiers for multi-agent systems. Stay tuned.