Introduction
AI agents are shipping fast — and so are the attacks against them. Prompt injection, data exfiltration, memory poisoning, and agent impersonation are no longer theoretical risks. They're being exploited in the wild. The challenge for most teams isn't understanding what defenses to build; it's figuring out how to retrofit those defenses onto an existing, running codebase without breaking everything in the process.
This article walks through how we used Kiro, an AI-powered development environment, to design and implement a comprehensive blue team defense layer on top of VulnBank — an intentionally vulnerable AI agent workshop built on the DVAA framework. The result: 15 independent security modules integrated into a live Node.js application, all driven by a structured spec workflow that kept the work organized from first requirement to final integration test.
The Starting Point: A Vulnerable AI Agent Platform
VulnBank is a hands-on security workshop where participants attack a simulated bank's AI agents across 5 escalating levels. Each level exposes a different real-world AI vulnerability:
| Level | Attack Type | What's Being Exploited |
|---|---|---|
| L1 | Prompt Injection | Tricking an assistant into leaking confidential data from its system prompt |
| L2 | RAG Data Exfiltration | Hijacking knowledge base retrieval to exfiltrate another customer's records |
| L3 | SQL Injection | Breaking out of query filters via an AI agent's database tool |
| L4 | Memory Injection | Planting persistent instructions that survive across sessions |
| L5 | Agent Impersonation | Spoofing internal agent identity to push fraudulent actions |
The codebase is a Node.js application with multiple HTTP servers (one per agent), an Express-based dashboard, and integrations with Groq for LLM inference. Agents communicate over HTTP and WebSocket, use RAG retrieval, execute SQL queries, access filesystems via tools, and delegate tasks to other agents.
Our goal: build defensive security modules that neutralize each attack category, make them toggleable per-level, and integrate them into the existing request pipeline — without rewriting the app.
Why Spec-Driven Development Matters for Security Work
Security work on existing codebases is uniquely prone to scope creep and incomplete coverage. You start fixing one vulnerability and discover three adjacent ones. You implement a filter and realize it needs integration with logging, rate limiting, and configuration management.
Kiro's spec workflow addresses this by enforcing a structured process:
- Requirements — Define what each defense module must do, with formal acceptance criteria
- Design — Specify interfaces, data models, and integration patterns
- Tasks — Break implementation into ordered, dependency-aware work items
This structure matters because security isn't something you bolt on as an afterthought. Each defensive control interacts with others — an input validator needs to log events, the audit logger needs to redact secrets, the secrets manager needs to validate at startup before any agent becomes available. Getting the dependency graph right before writing code prevents the cascading rework that typically plagues security retrofits.
Starting the Spec: Requirements-First with Kiro
We started a new spec session in Kiro and chose the requirements-first workflow. Kiro asked clarifying questions about the scope, the existing architecture, and our threat model. From a rough description — "build blue team defenses for VulnBank that cover all 5 attack levels" — it produced a structured requirements document covering 15 defense categories.
Each requirement follows a consistent pattern:
### Requirement 1: Input Validation and Prompt Injection Defense
**User Story:** As a workshop presenter, I want agents to detect and neutralize
prompt injection attempts in user input, so that participants can observe how
input sanitization blocks L1-level attacks.
#### Acceptance Criteria
1. WHEN user input matches a known prompt injection pattern,
THE Input_Validator SHALL reject the input and return a standardized
JSON refusal response within 50ms
2. WHEN user input contains delimiter escape sequences,
THE Input_Validator SHALL strip the sequences and log a sanitization event
...
The EARS format (Event-Action-Response-State) that Kiro uses for acceptance criteria gives you something you can actually test against. Each criterion specifies a trigger condition, the responsible module, and the expected outcome — including performance thresholds.
From Requirements to Design: Interface-First Architecture
The design document that Kiro produced defines a clear architectural pattern: every defense module is a standalone ES module exporting either a check(input, context) function (for validators) or an apply(data, context) function (for transformers). This consistent interface makes the Defense Orchestrator simple — it just iterates through a list of modules in sequence, short-circuiting on any rejection.
Here's the integration pattern that emerged:
sequenceDiagram
participant Client
participant DefenseOrchestrator
participant InputValidator
participant RateLimiter
participant Agent
participant OutputFilter
participant AuditLogger
Client->>DefenseOrchestrator: POST /chat
DefenseOrchestrator->>RateLimiter: check(clientIP, agentId)
DefenseOrchestrator->>InputValidator: check(userMessage, patterns)
DefenseOrchestrator->>Agent: sanitizedMessage
Agent-->>DefenseOrchestrator: rawResponse
DefenseOrchestrator->>OutputFilter: apply(rawResponse)
DefenseOrchestrator->>AuditLogger: logRequest(event)
DefenseOrchestrator-->>Client: filteredResponse
The key design decision: the Defense Orchestrator reads environment variables (HARDEN_L1 through HARDEN_L5) on every request. This means a presenter can toggle defenses on or off in real time during a workshop, demonstrating the before-and-after without restarting the server.
// The orchestrator evaluates toggle state per-request
function getActiveModulesForAgent(agentId) {
const profile = getBankProfile();
if (profile !== 'demo') {
return { preRequest: [], postResponse: [], global: [] }; // Full passthrough
}
if (isHardenEnabled(1) && agentId === 'helperbot') {
preRequest.push('inputValidator');
postResponse.push('outputFilter');
}
// ... additional levels
}
Configuration-Driven Defenses: Adapting Without Code Changes
A critical pattern that runs through the entire defense layer is configuration-driven behavior. Rather than hardcoding detection patterns, each module loads its rules from JSON at startup and supports hot-reload:
{
"patterns": [
{
"pattern": "ignore.*(?:previous|above|prior).*instruction",
"flags": "i",
"category": "prompt_injection",
"action": "reject"
},
{
"pattern": "\\[(?:INST|SYSTEM|ADMIN)\\]",
"flags": "i",
"category": "instruction_injection",
"action": "strip"
},
{
"pattern": "(?:as an admin|i am the developer|speaking as root)",
"flags": "i",
"category": "role_confusion",
"action": "flag"
}
]
}
Each pattern entry specifies:
- A regex to match
- A category label for audit logging
- An action:
reject(block and respond),strip(remove matched content, pass the rest), orflag(pass through but log for monitoring)
This means adding defense against a new injection technique is a JSON edit, not a code deploy. The same pattern applies to the rate limiter (per-agent limits in JSON), the query parameterizer (approved SQL templates), and the URL validator (domain allowlists).
Task Dependency Graphs: Parallel Implementation Without Conflicts
Kiro's task generation produced a dependency graph that identified which modules could be built in parallel and which had hard ordering requirements:
{
"waves": [
{ "id": 0, "tasks": ["config-files", "env-example"] },
{ "id": 1, "tasks": ["audit-logger", "secrets-manager"] },
{ "id": 2, "tasks": ["input-validator", "output-filter", "rate-limiter", "memory-sanitizer"] },
{ "id": 3, "tasks": ["identity-verifier", "path-validator", "url-validator", "query-parameterizer"] },
...
{ "id": 9, "tasks": ["defense-orchestrator"] },
{ "id": 10, "tasks": ["integration-wiring"] }
]
}
The audit logger and secrets manager must be built first — every other module depends on them for event logging and credential access. After that, the individual validators and filters are independent and can be built in parallel. The orchestrator comes last because it wires everything together.
This ordering prevented a common failure mode: building a defense module, realizing it needs a logging dependency that doesn't exist yet, pivoting to build the logger, then losing context on the original module.
Integration: Wiring Into the Existing Pipeline
The most critical task was wiring the Defense Orchestrator into VulnBank's existing request flow without disrupting the vulnerable behavior that workshop participants depend on. The solution uses the profile system that was already in place:
import { createOrchestrator } from './defenses/index.js';
// In the server startup:
const orchestrator = createOrchestrator(getAllAgents());
// In the request handler (existing pattern similar to maybeEnforce()):
async function handleAgentRequest(req, res, agent) {
return orchestrator.handleRequest(req, res, agent, generateResponse);
}
When BANK_PROFILE=participant (the default), the orchestrator is a no-op passthrough — participants see fully vulnerable agents. When BANK_PROFILE=demo, per-level toggles activate specific defense modules for specific agents. The original maybeEnforce() AIM hook pattern already established this middleware-chain approach, so the defense orchestrator follows the same convention.
Defense Modules: What Got Built
The spec workflow produced 15 defense modules covering every attack vector in the workshop:
Input Layer:
- Input Validator — Regex-based prompt injection detection with reject/strip/flag actions
- Jailbreak Detector — Multi-pattern detection for DAN mode, roleplay bypass, hypothetical framing
- Token Smuggling Detector — Decodes Base64, Unicode, ROT13 before applying pattern matching
- Context Protector — Token budget management with sandwich defense placement
Data Layer:
- Output Filter — Redacts API keys, PII, system prompts, and database credentials
- Memory Sanitizer — Per-user isolation with instruction-pattern rejection
- Query Parameterizer — Template-based SQL injection prevention
- RAG Content Sanitizer — Neutralizes injections in retrieved documents
Network Layer:
- Rate Limiter — Sliding-window throttling with burst detection and abuse flagging
- URL Validator — Domain allowlisting with SSRF and DNS rebinding prevention
- Egress Filter — Outbound request restriction to prevent data exfiltration
- Path Validator — Sandbox enforcement for filesystem tool calls
Identity & Integrity:
- Identity Verifier — Ed25519 JWT verification for agent-to-agent communication
- Tool Registry Verifier — Cryptographic integrity checks on tool registrations
- Session Isolator — Cryptographic session boundaries preventing cross-session leakage
Infrastructure:
- Audit Logger — Structured NDJSON logging with rotation and level filtering
- Secrets Manager — Environment-variable loading with leak detection
- Security Headers — Standard HTTP hardening as Express middleware
- Behavioral Drift Detector — Sliding-window analysis for persona degradation
Key Takeaways for Retrofitting AI Security
1. Start with the threat model, not the code
Kiro's requirements phase forced us to enumerate what we were defending against before touching implementation. This prevented the common pattern of building a partial fix for one vulnerability while leaving adjacent ones exposed.
2. Consistent interfaces make orchestration simple
Every defense module follows the same check() / apply() contract. The orchestrator doesn't need to know the internal logic of any module — it just calls them in sequence and handles rejections. Adding a new defense is: write the module, add it to the activation map, done.
3. Configuration-driven patterns reduce deployment risk
Moving detection logic into JSON config files means you can update defenses without redeploying code. This is especially valuable for AI security where new attack patterns emerge weekly.
4. Feature toggles are essential for security features
The HARDEN_Ln environment variable pattern lets you activate defenses incrementally. In production, this translates to feature flags per defense module — you can roll out new protections gradually and roll back instantly if they cause false positives.
5. Dependency ordering prevents cascading rework
The task dependency graph identified that the audit logger, secrets manager, and configuration files must exist before any defense module can be built. Without this ordering, you'd build a module, discover it needs logging, context-switch to build the logger, then return to the module having lost your place.
Running the Defenses
To see the defenses in action:
# Start with vulnerable (default) agents
docker compose up
# Enable specific level defenses
BANK_PROFILE=demo HARDEN_L1=on docker compose up
# Enable all defenses
BANK_PROFILE=demo HARDEN_L1=on HARDEN_L2=on HARDEN_L3=on HARDEN_L4=on HARDEN_L5=on docker compose up
With defenses enabled, the same attacks that succeed in participant mode are blocked, sanitized, or logged — demonstrating the contrast between vulnerable and hardened AI agent deployments.
Conclusion
Retrofitting security onto an existing AI agent codebase doesn't have to be chaotic. Kiro's spec-driven workflow provides the structure to go from "we need defenses" to "here are 15 tested modules integrated into the production pipeline" without losing coherence along the way. The requirements-first approach ensures coverage, the design phase locks down interfaces before implementation starts, and the task dependency graph prevents the backtracking that makes security work feel endless.
The code produced is configuration-driven, independently toggleable, and follows the same middleware patterns already established in the codebase. That's the real value — not just generating code, but generating code that fits the existing architecture and can be maintained by the team after the initial sprint.
The VulnBank workshop and all defense modules referenced in this article are available at github.com/shri-the-tree/vulnbank-workshop. The project is Apache-2.0 licensed and intended for educational use.
Top comments (0)