Sandeep Chakravartty

Posted on Jul 5

Building AI Security Blue Team Defenses on Existing Codebases with Kiro

#ai #security #kiro

Introduction

AI agents are shipping fast — and so are the attacks against them. Prompt injection, data exfiltration, memory poisoning, and agent impersonation are no longer theoretical risks. They're being exploited in the wild. The challenge for most teams isn't understanding what defenses to build; it's figuring out how to retrofit those defenses onto an existing, running codebase without breaking everything in the process.

This article walks through how we used Kiro, an AI-powered development environment, to design and implement a comprehensive blue team defense layer on top of VulnBank — an intentionally vulnerable AI agent workshop built on the DVAA framework. The result: 15 independent security modules integrated into a live Node.js application, all driven by a structured spec workflow that kept the work organized from first requirement to final integration test.

The Starting Point: A Vulnerable AI Agent Platform

VulnBank is a hands-on security workshop where participants attack a simulated bank's AI agents across 5 escalating levels. Each level exposes a different real-world AI vulnerability:

Level	Attack Type	What's Being Exploited
L1	Prompt Injection	Tricking an assistant into leaking confidential data from its system prompt
L2	RAG Data Exfiltration	Hijacking knowledge base retrieval to exfiltrate another customer's records
L3	SQL Injection	Breaking out of query filters via an AI agent's database tool
L4	Memory Injection	Planting persistent instructions that survive across sessions
L5	Agent Impersonation	Spoofing internal agent identity to push fraudulent actions

The codebase is a Node.js application with multiple HTTP servers (one per agent), an Express-based dashboard, and integrations with Groq for LLM inference. Agents communicate over HTTP and WebSocket, use RAG retrieval, execute SQL queries, access filesystems via tools, and delegate tasks to other agents.

Our goal: build defensive security modules that neutralize each attack category, make them toggleable per-level, and integrate them into the existing request pipeline — without rewriting the app.

Why Spec-Driven Development Matters for Security Work

Security work on existing codebases is uniquely prone to scope creep and incomplete coverage. You start fixing one vulnerability and discover three adjacent ones. You implement a filter and realize it needs integration with logging, rate limiting, and configuration management.

Kiro's spec workflow addresses this by enforcing a structured process:

Requirements — Define what each defense module must do, with formal acceptance criteria
Design — Specify interfaces, data models, and integration patterns
Tasks — Break implementation into ordered, dependency-aware work items

This structure matters because security isn't something you bolt on as an afterthought. Each defensive control interacts with others — an input validator needs to log events, the audit logger needs to redact secrets, the secrets manager needs to validate at startup before any agent becomes available. Getting the dependency graph right before writing code prevents the cascading rework that typically plagues security retrofits.

Starting the Spec: Requirements-First with Kiro

We started a new spec session in Kiro and chose the requirements-first workflow. Kiro asked clarifying questions about the scope, the existing architecture, and our threat model. From a rough description — "build blue team defenses for VulnBank that cover all 5 attack levels" — it produced a structured requirements document covering 15 defense categories.

Each requirement follows a consistent pattern:

### Requirement 1: Input Validation and Prompt Injection Defense

**User Story:** As a workshop presenter, I want agents to detect and neutralize 
prompt injection attempts in user input, so that participants can observe how 
input sanitization blocks L1-level attacks.

#### Acceptance Criteria

1. WHEN user input matches a known prompt injection pattern, 
   THE Input_Validator SHALL reject the input and return a standardized 
   JSON refusal response within 50ms
2. WHEN user input contains delimiter escape sequences, 
   THE Input_Validator SHALL strip the sequences and log a sanitization event
...

The EARS format (Event-Action-Response-State) that Kiro uses for acceptance criteria gives you something you can actually test against. Each criterion specifies a trigger condition, the responsible module, and the expected outcome — including performance thresholds.

From Requirements to Design: Interface-First Architecture

The design document that Kiro produced defines a clear architectural pattern: every defense module is a standalone ES module exporting either a check(input, context) function (for validators) or an apply(data, context) function (for transformers). This consistent interface makes the Defense Orchestrator simple — it just iterates through a list of modules in sequence, short-circuiting on any rejection.

Here's the integration pattern that emerged:

sequenceDiagram
    participant Client
    participant DefenseOrchestrator
    participant InputValidator
    participant RateLimiter
    participant Agent
    participant OutputFilter
    participant AuditLogger

    Client->>DefenseOrchestrator: POST /chat
    DefenseOrchestrator->>RateLimiter: check(clientIP, agentId)
    DefenseOrchestrator->>InputValidator: check(userMessage, patterns)
    DefenseOrchestrator->>Agent: sanitizedMessage
    Agent-->>DefenseOrchestrator: rawResponse
    DefenseOrchestrator->>OutputFilter: apply(rawResponse)
    DefenseOrchestrator->>AuditLogger: logRequest(event)
    DefenseOrchestrator-->>Client: filteredResponse

The key design decision: the Defense Orchestrator reads environment variables (HARDEN_L1 through HARDEN_L5) on every request. This means a presenter can toggle defenses on or off in real time during a workshop, demonstrating the before-and-after without restarting the server.

// The orchestrator evaluates toggle state per-request
function getActiveModulesForAgent(agentId) {
  const profile = getBankProfile();
  if (profile !== 'demo') {
    return { preRequest: [], postResponse: [], global: [] }; // Full passthrough
  }

  if (isHardenEnabled(1) && agentId === 'helperbot') {
    preRequest.push('inputValidator');
    postResponse.push('outputFilter');
  }
  // ... additional levels
}

Configuration-Driven Defenses: Adapting Without Code Changes

A critical pattern that runs through the entire defense layer is configuration-driven behavior. Rather than hardcoding detection patterns, each module loads its rules from JSON at startup and supports hot-reload:

{
  "patterns": [
    {
      "pattern": "ignore.*(?:previous|above|prior).*instruction",
      "flags": "i",
      "category": "prompt_injection",
      "action": "reject"
    },
    {
      "pattern": "\\[(?:INST|SYSTEM|ADMIN)\\]",
      "flags": "i",
      "category": "instruction_injection",
      "action": "strip"
    },
    {
      "pattern": "(?:as an admin|i am the developer|speaking as root)",
      "flags": "i",
      "category": "role_confusion",
      "action": "flag"
    }
  ]
}

Each pattern entry specifies:

A regex to match
A category label for audit logging
An action: reject (block and respond), strip (remove matched content, pass the rest), or flag (pass through but log for monitoring)

This means adding defense against a new injection technique is a JSON edit, not a code deploy. The same pattern applies to the rate limiter (per-agent limits in JSON), the query parameterizer (approved SQL templates), and the URL validator (domain allowlists).

Task Dependency Graphs: Parallel Implementation Without Conflicts

Kiro's task generation produced a dependency graph that identified which modules could be built in parallel and which had hard ordering requirements:

{
  "waves": [
    { "id": 0, "tasks": ["config-files", "env-example"] },
    { "id": 1, "tasks": ["audit-logger", "secrets-manager"] },
    { "id": 2, "tasks": ["input-validator", "output-filter", "rate-limiter", "memory-sanitizer"] },
    { "id": 3, "tasks": ["identity-verifier", "path-validator", "url-validator", "query-parameterizer"] },
    ...
    { "id": 9, "tasks": ["defense-orchestrator"] },
    { "id": 10, "tasks": ["integration-wiring"] }
  ]
}

The audit logger and secrets manager must be built first — every other module depends on them for event logging and credential access. After that, the individual validators and filters are independent and can be built in parallel. The orchestrator comes last because it wires everything together.

This ordering prevented a common failure mode: building a defense module, realizing it needs a logging dependency that doesn't exist yet, pivoting to build the logger, then losing context on the original module.

Integration: Wiring Into the Existing Pipeline

The most critical task was wiring the Defense Orchestrator into VulnBank's existing request flow without disrupting the vulnerable behavior that workshop participants depend on. The solution uses the profile system that was already in place:

import { createOrchestrator } from './defenses/index.js';

// In the server startup:
const orchestrator = createOrchestrator(getAllAgents());

// In the request handler (existing pattern similar to maybeEnforce()):
async function handleAgentRequest(req, res, agent) {
  return orchestrator.handleRequest(req, res, agent, generateResponse);
}

When BANK_PROFILE=participant (the default), the orchestrator is a no-op passthrough — participants see fully vulnerable agents. When BANK_PROFILE=demo, per-level toggles activate specific defense modules for specific agents. The original maybeEnforce() AIM hook pattern already established this middleware-chain approach, so the defense orchestrator follows the same convention.

Defense Modules: What Got Built

The spec workflow produced 15 defense modules covering every attack vector in the workshop:

Input Layer:

Input Validator — Regex-based prompt injection detection with reject/strip/flag actions
Jailbreak Detector — Multi-pattern detection for DAN mode, roleplay bypass, hypothetical framing
Token Smuggling Detector — Decodes Base64, Unicode, ROT13 before applying pattern matching
Context Protector — Token budget management with sandwich defense placement

Data Layer:

Output Filter — Redacts API keys, PII, system prompts, and database credentials
Memory Sanitizer — Per-user isolation with instruction-pattern rejection
Query Parameterizer — Template-based SQL injection prevention
RAG Content Sanitizer — Neutralizes injections in retrieved documents

Network Layer:

Rate Limiter — Sliding-window throttling with burst detection and abuse flagging
URL Validator — Domain allowlisting with SSRF and DNS rebinding prevention
Egress Filter — Outbound request restriction to prevent data exfiltration
Path Validator — Sandbox enforcement for filesystem tool calls

Identity & Integrity:

Identity Verifier — Ed25519 JWT verification for agent-to-agent communication
Tool Registry Verifier — Cryptographic integrity checks on tool registrations
Session Isolator — Cryptographic session boundaries preventing cross-session leakage

Infrastructure:

Audit Logger — Structured NDJSON logging with rotation and level filtering
Secrets Manager — Environment-variable loading with leak detection
Security Headers — Standard HTTP hardening as Express middleware
Behavioral Drift Detector — Sliding-window analysis for persona degradation

Key Takeaways for Retrofitting AI Security

1. Start with the threat model, not the code

Kiro's requirements phase forced us to enumerate what we were defending against before touching implementation. This prevented the common pattern of building a partial fix for one vulnerability while leaving adjacent ones exposed.

2. Consistent interfaces make orchestration simple

Every defense module follows the same check() / apply() contract. The orchestrator doesn't need to know the internal logic of any module — it just calls them in sequence and handles rejections. Adding a new defense is: write the module, add it to the activation map, done.

3. Configuration-driven patterns reduce deployment risk

Moving detection logic into JSON config files means you can update defenses without redeploying code. This is especially valuable for AI security where new attack patterns emerge weekly.

4. Feature toggles are essential for security features

The HARDEN_Ln environment variable pattern lets you activate defenses incrementally. In production, this translates to feature flags per defense module — you can roll out new protections gradually and roll back instantly if they cause false positives.

5. Dependency ordering prevents cascading rework

The task dependency graph identified that the audit logger, secrets manager, and configuration files must exist before any defense module can be built. Without this ordering, you'd build a module, discover it needs logging, context-switch to build the logger, then return to the module having lost your place.

Running the Defenses

To see the defenses in action:

# Start with vulnerable (default) agents
docker compose up

# Enable specific level defenses
BANK_PROFILE=demo HARDEN_L1=on docker compose up

# Enable all defenses
BANK_PROFILE=demo HARDEN_L1=on HARDEN_L2=on HARDEN_L3=on HARDEN_L4=on HARDEN_L5=on docker compose up

With defenses enabled, the same attacks that succeed in participant mode are blocked, sanitized, or logged — demonstrating the contrast between vulnerable and hardened AI agent deployments.

Conclusion

Retrofitting security onto an existing AI agent codebase doesn't have to be chaotic. Kiro's spec-driven workflow provides the structure to go from "we need defenses" to "here are 15 tested modules integrated into the production pipeline" without losing coherence along the way. The requirements-first approach ensures coverage, the design phase locks down interfaces before implementation starts, and the task dependency graph prevents the backtracking that makes security work feel endless.

The code produced is configuration-driven, independently toggleable, and follows the same middleware patterns already established in the codebase. That's the real value — not just generating code, but generating code that fits the existing architecture and can be maintained by the team after the initial sprint.

The VulnBank workshop and all defense modules referenced in this article are available at github.com/shri-the-tree/vulnbank-workshop. The project is Apache-2.0 licensed and intended for educational use.

DEV Community