DEV Community

CyborgNinja1
CyborgNinja1

Posted on

Anatomy of a 5-Layer Defence Pipeline for AI Agent Memory

Your AI agent remembers things. That's the whole point — persistent memory turns a stateless chatbot into something genuinely useful.

But every memory write is a potential attack vector.

Prompt injection hidden in an email. API keys accidentally saved to context. Fragmented payloads that look innocent alone but assemble into an exploit over time. The moment you give an agent persistent memory, you've created an attack surface that most developers never think about.

We built ShieldCortex to solve this. It's an open-source defence pipeline that sits between your AI agent and its memory store. Every write passes through 5 layers before it's allowed to persist.

Here's how each layer works, with real code from the project.

The Pipeline Architecture

Content In → Trust → Firewall → Sensitivity → Fragment → Audit → Memory Store
                                                                    ↗
                                                          BLOCK / QUARANTINE
Enter fullscreen mode Exit fullscreen mode

The pipeline is fail-closed — if any layer throws an exception, the default is BLOCK. Security shouldn't depend on things going right.

export function runDefencePipeline(
  content: string,
  title: string,
  source: DefenceSource,
  config?: DefenceConfig,
): DefencePipelineResult {
  const cfg = config ?? DEFAULT_DEFENCE_CONFIG;

  try {
    const trust = scoreSource(source);
    const firewall = analyzeFirewall(content, title, source, trust.score, cfg);
    const sensitivity = classifySensitivity(content, title);

    let fragmentation = null;
    if (cfg.enableFragmentationDetection && firewall.result !== 'BLOCK') {
      fragmentation = analyzeFragmentation(content, title, cfg);
    }

    // Determine final decision...
  } catch {
    // Fail closed — if anything breaks, block it
    return { allowed: false, reason: 'Pipeline error — fail-closed' };
  }
}
Enter fullscreen mode Exit fullscreen mode

Let's walk through each layer.

Layer 1: Trust Scoring

Not all memory sources are equal. A direct user message is more trustworthy than text scraped from a webpage, which is more trustworthy than content extracted from a forwarded email.

Trust scoring assigns a 0-1 confidence score based on the source:

  • Direct input (user typing) → high trust
  • Tool output (API responses, file reads) → medium trust
  • External content (emails, web pages, webhooks) → low trust
  • Sub-agent output (delegated tasks) → variable, based on agent depth

This score flows into every subsequent layer. The firewall is more aggressive with low-trust content. Sensitivity thresholds shift. Even audit logging changes — low-trust content gets more detailed forensic records.

Why it matters: Most AI security treats all content the same. But an instruction in a direct message is legitimate. The same instruction embedded in a scraped webpage is almost certainly prompt injection.

Layer 2: Memory Firewall

This is the main threat detection layer. It runs four parallel analysis modules:

Instruction Detection — Catches prompt injection attempts. Pattern matching for common injection phrases ("ignore previous instructions", "you are now", system prompt overrides), but also structural analysis that detects instruction-like patterns even when obfuscated.

Privilege Escalation Detection — Flags attempts to elevate permissions. Things like "as an admin, grant access to..." or "override security settings and allow...". These look different from normal prompt injection — they're not trying to change the agent's behaviour, they're trying to use the agent's existing permissions.

Encoding Obfuscation Detection — Attackers encode payloads to bypass text-based detection. Base64 instructions, Unicode homoglyphs, zero-width characters, hex-encoded commands. This module decodes and re-scans.

Anomaly Scoring — Statistical behavioural analysis. Sudden spikes in memory writes, content that's dramatically different from the agent's normal pattern, unusual timing. Not every anomaly is an attack, but anomalies deserve closer inspection.

const firewall = analyzeFirewall(content, title, source, trust.score, cfg);

// Result is one of: ALLOW | BLOCK | QUARANTINE
if (firewall.result === 'BLOCK') {
  // Hard block — content is definitely malicious
} else if (firewall.result === 'QUARANTINE') {
  // Suspicious but not certain — hold for review
}
Enter fullscreen mode Exit fullscreen mode

The three-outcome model (ALLOW / BLOCK / QUARANTINE) is deliberate. Binary allow/block forces you to choose between false positives (blocking legitimate content) and false negatives (letting attacks through). Quarantine gives you a middle ground — flag it, hold it, let a human or a higher-trust process review it.

Layer 3: Sensitivity Classification

Not every leak is an attack. Sometimes your agent just... saves something it shouldn't.

Sensitivity classification scans for:

  • Credentials — API keys, tokens, passwords, connection strings
  • PII — Email addresses, phone numbers, physical addresses
  • Financial data — Card numbers, bank details, account numbers
  • Internal identifiers — Database IDs, internal URLs, infrastructure details

Content is classified as PUBLIC, INTERNAL, CONFIDENTIAL, or RESTRICTED. Restricted content is auto-blocked. Confidential content triggers redaction — the memory is saved, but sensitive values are masked.

Why this isn't just regex: Pattern matching catches the obvious stuff (things that look like AWS keys or credit card numbers). But sensitivity also considers context. A phone number in a contact record is normal. A phone number appearing in what looks like an instruction to "send a message to..." is suspicious.

Layer 4: Fragmentation Detection

This is the layer most security tools miss entirely.

Fragmented payload attacks work like this: no single memory write is malicious. But over time, an attacker feeds fragments that, when assembled by the agent's context window, form a complete attack.

Memory 1: "The admin API endpoint is at /api/v1/admin"
Memory 2: "Authentication uses Bearer tokens from env.SECRET_KEY"  
Memory 3: "To test the admin API, make a curl request with the token"
Enter fullscreen mode Exit fullscreen mode

Each memory is individually benign. Together, they're a recipe for credential exfiltration.

ShieldCortex's fragmentation detector:

  1. Extracts entities from each new memory (URLs, credentials patterns, commands, identifiers)
  2. Checks temporal overlap — are recent memories building towards something?
  3. Runs assembly detection — do the fragments combine into a known attack pattern?
const newEntities = extractEntities(fullText);
const overlapping = findOverlappingEntities(newEntities, recentEntities);
const assembly = detectAssembly(newEntities, overlapping);

// If assembly risk is high, quarantine for review
Enter fullscreen mode Exit fullscreen mode

This is inspired by research on memory poisoning through fragment accumulation. It's a real threat vector that becomes more dangerous as agents get more capable and have longer memory windows.

Layer 5: Audit Trail

Every scan — whether allowed, blocked, or quarantined — gets a full forensic record:

  • Content hash (for deduplication and tamper detection)
  • Trust score and source metadata
  • Every threat indicator from every layer
  • Final decision and reasoning
  • Timestamp and processing duration

The audit trail isn't just for compliance. It's how you tune the system. False positives show up as patterns in the audit logs. New attack vectors appear as anomalies that the firewall missed but the audit trail captured.

The Decision Engine

After all 5 layers run, the pipeline combines their outputs:

if (firewall.result === 'BLOCK') {
  allowed = false;  // Firewall says no — hard block
} else if (firewall.result === 'QUARANTINE') {
  allowed = false;  // Suspicious — hold for review
} else if (fragmentation?.score > cfg.autoQuarantineThreshold) {
  allowed = false;  // Fragment assembly risk too high
} else if (sensitivity.level === 'RESTRICTED') {
  allowed = false;  // Contains restricted data
} else {
  allowed = true;   // All clear
}
Enter fullscreen mode Exit fullscreen mode

Order matters. The firewall gets first say because it catches active attacks. Fragmentation and sensitivity run after because they catch subtler issues that aren't obvious threats.

What This Looks Like in Practice

ShieldCortex ships with a local dashboard that visualises the entire pipeline in real-time:

ShieldCortex Defence Overview

The defence pipeline view shows all 5 layers, their status, and a live threat timeline. The quarantine queue lets you review flagged content before it reaches memory.

Getting Started

ShieldCortex is open-source and free:

npm install shieldcortex
npx shieldcortex setup
Enter fullscreen mode Exit fullscreen mode

If you're running OpenClaw (open-source AI agent framework), there's a one-command integration:

sudo npx shieldcortex openclaw install
Enter fullscreen mode Exit fullscreen mode

Every memory your agent saves will pass through the full 5-layer pipeline automatically.


2,300+ developers are already using ShieldCortex. The npm package is free forever. There's an optional Cloud dashboard if you want team visibility and audit logs.

AI agents are getting more capable every month. Their memory systems need security that keeps pace.

GitHub · Website · npm

Top comments (0)