I tried to secure an AI agent in production — here’s what actually broke

I’ve been working on a runtime security layer for AI agents — mainly focused on preventing prompt injection, tool abuse, and data exfiltration.
I expected the usual stuff to fail (basic jailbreaks, “ignore previous instructions”, etc.). That part was actually the easy problem.
What surprised me was everything else.
I ran a bunch of adversarial tests (including Garak and some custom scenarios), and here’s what broke:

Injection didn’t look like injection A lot of attacks came in as: encoded payloads (base64 / unicode tricks) structured inputs (JSON that looked valid but carried hidden instructions) multi-step reasoning traps (“first summarize this… then do X…”) Most “prompt filters” didn’t catch these at all.
Tool abuse looked completely legitimate The model wasn’t doing anything obviously wrong — it was calling tools exactly as expected. The problem was: slightly expanded scope (accessing more data than needed) chaining tools in ways that created unintended side effects Basically: syntactically valid, semantically dangerous.
Data exfiltration was slow and subtle I expected a single “leak everything” response. Instead: small pieces leaked across multiple turns hidden inside normal-looking outputs sometimes triggered indirectly via tool responses This was by far the hardest to detect.

The main takeaway for me:
👉 Securing the prompt is not enough.
I ended up treating the agent as an untrusted runtime:
strict validation on every tool call (not free-form)
policy enforcement using Open Policy Agent
continuous context inspection (not just input filtering)
output filtering for DLP / sensitive data
It started looking less like “prompt engineering” and more like runtime security + control plane design.

I’m still finding edge cases that break assumptions, especially around:
multi-step attacks
cross-session leakage
indirect tool chaining
Curious if others here have seen similar patterns — especially in real systems, not just demos.

If anyone’s interested, I shared a more complete breakdown + architecture here: LinkedIn → [https://www.linkedin.com/pulse/orbyx-ai-spm-security-posture-management-dany-shapiro-3zlof/]
And I open-sourced parts of the system here: GitHub → [https://github.com/dshapi/AI-SPM]
Check it out on LinkedIn : or on GitHub:
Please comment , share, collaborate let me know what you think in the comments

DEV Community

I tried to secure an AI agent in production — here’s what actually broke

Top comments (0)