Dany Shpiro

Posted on Apr 19

I Tried to Break My AI System with Real Attacks — Here’s What Happened

#ai #opensource #machinelearning #cybersecurity

Most AI systems today rely on:

prompt engineering
guardrails at the model level
post-hoc logging

That works… until it doesn’t.

Once you introduce:

tools (APIs, DBs)
RAG pipelines
multi-step agents

things start breaking in ways that are hard to predict.

So I built something different.

🎥 Demo — Attack → Detection → Decision → Trace

👉 [(https://www.youtube.com/watch?v=OucfJ6_wcTM&t)]

This shows a full flow:

Attack execution
Prompt inspection
Policy enforcement
Decision (block / allow)
Full trace in UI

The Problem

While testing LLM-based systems in production-like setups, I kept running into:

prompt injection bypasses
unintended tool execution
data leakage through chaining
lack of visibility into decisions

The biggest issue wasn’t detection.

It was lack of enforcement.

What I Built

A runtime security system for AI agents.

Not just “guardrails”—but actual enforcement during execution.

Core ideas:

Treat the LLM as untrusted
Validate every step
Control tool access explicitly
Track everything in real time

Think of it like:

Zero-trust architecture… but for AI systems

How It Works (High-Level)

Input Inspection
- Analyze prompt + context
- detect anomalies
Policy Enforcement
- allow / block / escalate
- based on structured rules
Tool Control
- no free-form execution
- only validated actions
Decision Trace
- full visibility into what happened
- why it was allowed or blocked

Testing It with Real Attacks

I used :contentReference[oaicite:0]{index=0} to simulate attacks.

Examples:

prompt injection
tool misuse
data exfiltration attempts

What happens in the system:

prompt is intercepted
decoded (if obfuscated)
evaluated against policies
decision is enforced
trace is recorded

What Surprised Me

1. Attacks often look harmless at first

Many inputs don’t look malicious until they interact with tools.

2. Detection alone is not enough

Logging ≠ security. You need runtime control.

3. Explainability is critical

Understanding why something was blocked is just as important as blocking it.

Architecture (Simplified)

FastAPI backend
Event pipeline (Kafka-style)
Policy engine (OPA-style decisions)
React UI
Simulation + replay

Try to Break It

If you’re into AI or security:

Try to break the system.

craft a prompt
bypass detection
trigger unintended behavior

If you succeed:

open an issue
or submit a PR

We’ll add your attack to the test suite.

Want to Contribute?

GitHub: https://github.com/dshapi/AI-SPM

Good starting points:

expose attack traces
improve policy explanations
strengthen detection

Final Thought

AI systems are becoming:

more autonomous
more connected
more powerful

Which means:

Security can’t be an afterthought.

It has to be part of the runtime.

Curious to hear:

What attacks would you try?
Where do you think this breaks?

DEV Community