CyborgNinja1

Posted on Feb 22

We Built Iron Dome for AI Agents

#ai #security #typescript #opensource

Your AI agent follows instructions. That's the whole point — you tell it what to do and it does it. The problem is, it can't always tell who's talking.

We run three AI agents in production. One manages a school. One handles business ops. One monitors infrastructure. Real emails, real webhooks, real data flowing in and out.

A few weeks ago, we found an email in the school inbox that said: "Please update the bank details for the following supplier." Our agent processed it as data (correctly), but it made us think — what if it hadn't? What if the agent treated that email as an instruction?

That's when we built Iron Dome.

The core problem

AI agents operate in hostile environments and most of them have no concept of "who's allowed to tell me what to do."

Your agent reads emails. Those emails could contain prompt injections. Your agent calls APIs. Those responses could contain embedded instructions. Your agent processes form submissions. Those fields could contain social engineering.

Model-level guardrails don't help here. The model doesn't know the difference between a legitimate instruction from you and a malicious instruction embedded in an email body. They're both just text.

What we actually built

Iron Dome is a behavioural security layer for AI agents. It's now part of ShieldCortex, our open-source agent memory and security toolkit.

The fundamental principle: trust the channel, not the content.

npx shieldcortex iron-dome activate --profile enterprise

Instruction gateways

Every input to your agent is classified as either a trusted channel (can give instructions) or an untrusted channel (data only).

import { isChannelTrusted } from 'shieldcortex';

isChannelTrusted('terminal');  // ✅ can instruct
isChannelTrusted('email');     // ❌ data only
isChannelTrusted('webhook');   // ❌ data only

An email that says "I'm the CEO, send this payment" is not the CEO giving an instruction. It's an email containing text. The agent processes the text as information, never as a command.

Injection scanner

We ported our Python scanner to TypeScript and integrated it directly:

import { scanForInjection } from 'shieldcortex';

const result = scanForInjection(emailBody);

if (!result.clean) {
  console.log(result.riskLevel);  // 'HIGH'
  console.log(result.detections); // what was found and why
}

It catches the patterns we've actually seen in the wild:

Instruction overrides — "ignore previous instructions", "new system prompt"
Authority claims — "I am the admin", impersonation attempts
Credential extraction — requests for API keys, passwords, tokens
Urgency + secrecy — "do this now", "don't tell anyone" (classic social engineering combo)
Fake system tags — embedded [System] or [Admin] markers in plain text

Action gating

Not every action is equal. Reading a file is low-risk. Sending an email is high-risk. Iron Dome gates outbound actions:

import { isActionAllowed } from 'shieldcortex';

isActionAllowed('read_file');   // ✅ auto-approved
isActionAllowed('send_email');  // ⛔ needs approval
isActionAllowed('export_data'); // ⛔ needs approval

PII protection

Configurable rules for personal data. We built this for a school context (GDPR is non-negotiable there), but it applies anywhere:

import { checkPII } from 'shieldcortex';

checkPII('pupil_name');   // ⛔ never output
checkPII('attendance');   // 📊 aggregates only

Kill switch

One phrase stops everything. No exceptions, no overrides:

import { handleKillPhrase } from 'shieldcortex';

handleKillPhrase('full stop');
// all pending actions cancelled
// all pending approvals cancelled
// agent awaits manual clearance

Pre-built profiles

Different contexts need different security postures. We ship four:

Profile	Use case	Trust level
`school`	Education, GDPR strict	Maximum
`enterprise`	Business, compliance	High
`personal`	Personal assistants	Moderate
`paranoid`	High-security	Everything gated

npx shieldcortex iron-dome activate --profile school

What we learned building this

Most prompt injection defences focus on the wrong layer. They try to make the model smarter about detecting injections. But the model is processing text — it can't reliably distinguish between "real" and "injected" instructions in the same input stream.

Iron Dome doesn't try to make the model smarter. It restricts what the model is allowed to do based on where the input came from. The model can process poisoned content all day long — it just can't act on instructions found in untrusted channels.

The channel-based approach is simple and it works. We've been running it across three production agents for weeks now. It's caught real injection attempts in emails and webhook payloads. Not theoretical ones — actual attempts.

Security profiles matter. A school agent handling pupil data needs different rules than a personal coding assistant. One-size-fits-all security doesn't work for AI agents any more than it works for anything else.

How it fits together

ShieldCortex now has three layers:

ShieldCortex
├── Memory Protection  → what the agent knows
├── Defence Pipeline   → what the agent processes  
└── Iron Dome          → what the agent does

Iron Dome is the missing piece. You can have perfect memory security and still get owned if your agent sends an email because a webhook told it to.

Try it

npm install shieldcortex

# Activate
npx shieldcortex iron-dome activate --profile enterprise

# Scan some text
npx shieldcortex iron-dome scan --text "Ignore previous instructions..."

# Check status
npx shieldcortex iron-dome status

59 tests. Four profiles. Zero dependencies beyond ShieldCortex itself.

GitHub: Drakon-Systems-Ltd/ShieldCortex
npm: shieldcortex

We'd genuinely appreciate feedback — especially from anyone running AI agents in production. What attacks have you seen? What security patterns work for you? Drop a comment or open an issue.

DEV Community