Gotham64

Posted on Mar 4

How OpenPawz secures AI agents: Defense layers from memory encryption to multi-agent governance

#ai #security #agents #encryption

The security problem with AI agents

AI agents are powerful because they do things — they read files, run commands, send messages, search your data. That power comes with a question most agent frameworks don't answer well:

What stops the agent from doing things it shouldn't?

Most agent systems bolt on safety as an afterthought: a prompt that says "be careful," maybe a regex filter on outputs, and hope for the best. That's not security. That's a suggestion.

OpenPawz takes a different approach. We treat agent security as a systems engineering problem — not a prompt engineering one. The result is a multi-layer defense-in-depth architecture enforced at the Rust engine level, where the agent has zero ability to bypass controls regardless of what any prompt says.

Star the repo — it's open source

Zero attack surface by default

OpenPawz exposes zero network ports in its default configuration. There is no HTTP server, no WebSocket endpoint, and no listening socket for an attacker to target. The only communication path is Tauri's in-process IPC — a direct Rust-to-WebView bridge that never touches the network.

Four optional listeners exist (webhook server, WebChat, WhatsApp bridge, n8n engine), but all are:

Disabled by default
Bound to 127.0.0.1 — unreachable from the network even when enabled
Individually authenticated — bearer tokens, session cookies, IP rate limiting

Binding to 0.0.0.0 is a manual opt-in that triggers a security warning and recommends TLS wrapping via Tailscale Funnel.

The WebView enforces a strict Content Security Policy: default-src 'self', script-src 'self', object-src 'none', frame-ancestors 'none'. No external scripts, no iframe embedding, no cross-origin form submission.

Human-in-the-Loop: every side-effect needs permission

The core design principle: agents never touch the OS directly. Every tool call flows through the Rust tool executor, which classifies it by risk before deciding whether to proceed.

Auto-approved (no modal)

Read-only and informational tools run without interruption — read_file, web_search, memory_search, soul_read, self_info, email_read, slack_read, create_task, and others. No friction for safe operations.

Requires approval (modal shown)

Side-effect tools pause execution and show a risk-classified modal to the user:

Risk Level	Behavior	Example
Critical	Auto-denied by default; red modal requiring the user to type "ALLOW"	`sudo rm -rf /`, `curl \
High	Orange warning modal	{% raw %}`chmod 777`, `kill -9`
Medium	Yellow caution modal	`npm install`, outbound HTTP requests
Low	Standard approval	Unknown exec commands
Safe	Auto-approved via allowlist (90+ default patterns)	`git status`, `ls`, `cat`

Danger pattern detection

30+ patterns across multiple categories are caught before they can execute:

Privilege escalation — sudo, su, doas, pkexec, runas
Destructive deletion — rm -rf /, rm -rf ~, rm -rf /*
Permission exposure — chmod 777, chmod -R 777
Disk destruction — dd if=, mkfs, fdisk
Remote code execution — curl | sh, wget | bash
Process termination — kill -9 1, killall
Firewall manipulation — iptables -F, ufw disable
Network exfiltration — piping file contents to curl, scp outbound, /dev/tcp

Users can add custom regex rules for both allow and deny lists. The session override feature ("allow all" for a timed window) still blocks privilege escalation commands — you can't override the most dangerous class.

Agent governance: four policy presets

Not every agent should have the same power. OpenPawz provides per-agent tool access control with four built-in presets and support for custom policies:

Preset	Mode	What it does
Unrestricted	unrestricted	Full tool access, no constraints
Standard	denylist	All tools available, but high-risk tools always require human approval
Read-Only	allowlist	Only safe read/search/list operations (28 tools)
Sandbox	allowlist	Only 5 tools: `web_search`, `web_read`, `memory_store`, `memory_search`, `self_info`

Policies are enforced at two levels simultaneously:

Frontend: checkToolPolicy() evaluates per-tool decisions and strips unauthorized tools from the request
Backend: ChatRequest.tool_filter carries the allowed tool list to the Rust engine — the agent literally cannot see tools it doesn't have access to

This means a sandboxed research agent physically cannot call exec or write_file, regardless of what its prompt says. The tools don't exist in its schema.

Memory encryption: three independent defense layers

Project Engram — the memory system — applies defense-in-depth to all stored agent memories (episodic, semantic, and procedural). Even if an attacker gains access to the SQLite database file, the data remains protected.

Layer 1: Per-agent HKDF key derivation

A single master key lives in the OS keychain (paw-memory-vault). From it, three independent key families are derived via HKDF-SHA256 domain separation:

Domain	HKDF Salt	Purpose
Agent encryption	`engram-agent-key-v1`	Per-agent AES-256-GCM memory encryption
Snapshot HMAC	`engram-snapshot-hmac-v1`	Tamper detection for working memory snapshots
Capability signing	`engram-platform-cap-v1`	HMAC-SHA256 signing of capability tokens

Every agent gets a unique derived key. Cross-agent decryption is mathematically impossible without the master key. Compromising one agent's derived key does not expose any other agent's memories.

Layer 2: SQL scope filtering

Every memory query includes scope constraints at the SQL level — agent_id, project_id, squad_id. Even without encryption, the query layer enforces isolation.

Layer 3: Signed capability tokens

Every gated_search() call (the unified memory retrieval entry point) performs 4-step cryptographic verification:

HMAC signature integrity — token verified against the platform signing key
Identity binding — the token's agent_id must match the requesting agent
Scope ceiling check — requested search scope cannot exceed the token's max_scope
Membership verification — for squad/project scopes, the agent must actually belong to that squad or project

This prevents confused-deputy attacks where an agent could be tricked into reading another agent's memories.

Automatic PII detection and field-level encryption

Before any memory is stored, it passes through a two-layer PII scanner with 17 regex pattern types:

Layer 1 (regex patterns): Social Security Numbers, credit card numbers, email addresses, phone numbers, physical addresses, person names, government IDs, JWT tokens, AWS access keys, private keys, IBANs, IPv4 addresses, API keys, passwords, and dates of birth.

Layer 2 (LLM-assisted): A secondary scanner catches context-dependent PII that static regex cannot detect — phrases like "my mother's maiden name is Smith" or "I was born in Springfield." The LLM returns structured JSON with PII type classifications and confidence scores.

Content is classified into three tiers:

Tier	Content	Treatment
Cleartext	No PII detected	Stored as-is
Sensitive	PII detected (email, name, phone, IP)	AES-256-GCM encrypted
Confidential	High-sensitivity PII (SSN, credit card, JWT, AWS key, private key)	AES-256-GCM encrypted

Encrypted content uses the format enc:v1:base64(nonce ‖ ciphertext ‖ tag). A fresh 96-bit nonce is generated per encryption operation. Decryption is transparent on retrieval using the per-agent derived key.

Key rotation

An automated key rotation scheduler runs on a configurable interval (default: 90 days) and re-encrypts all agent memories with fresh HKDF-derived keys. The rotation is atomic — if any re-encryption fails, the entire batch rolls back. No data is left in a half-migrated state.

Inter-agent memory bus: scoped, signed, rate-limited

When multiple agents need to share information, the Memory Bus provides pub/sub memory sharing with publish-side authentication to prevent memory poisoning.

Capability tokens

Every agent holds an AgentCapability signed with HMAC-SHA256 against a platform-held secret key. The token specifies:

Max publication scope — Targeted (specific agents), Squad, Project, or Global
Importance ceiling — the maximum importance an agent can self-assign (0.0–1.0)
Write permission — whether the agent can publish at all
Rate limit — maximum publications per consolidation cycle

The scope hierarchy is a strict linear lattice:

Targeted (rank 1) < Squad (rank 2) < Project (rank 3) < Global (rank 4)

An agent with max_scope = Squad can publish to targeted agents or its squad, but cannot publish to the project or global scope. Ceiling enforcement uses a simple rank comparison — no ambiguity, no escalation path.

Trust-weighted contradiction resolution

When two agents publish contradictory facts on the same topic, the system resolves it based on:

effective_importance = raw_importance × agent_trust_score

The memory with the higher effective importance is retained. Trust scores are per-agent (0.0–1.0) and adjustable at runtime. This prevents a compromised or low-trust agent from overwriting facts established by high-trust agents through recency alone.

Publish-side defenses

Defense	Detail
Scope enforcement	Publication scope clamped to agent's maximum
Importance ceiling	Publication importance clamped to agent's ceiling
Per-agent rate limiting	Publish count tracked per GC window; exceeded limits return an error
Injection scanning	All publication content scanned for prompt injection patterns before entering the bus

Threat model

Attack	Mitigation
Agent floods bus with poisoned memories	Rate limit + injection scan on publish
Low-trust agent overwrites high-trust facts	Trust-weighted contradiction resolution
Agent publishes beyond its authority	Scope ceiling enforcement
Forged capability token	HMAC-SHA256 verification against platform secret
Cross-agent memory reads via confused deputy	Signed read-path tokens with identity binding + membership verification

Multi-agent orchestration: delegation with guardrails

OpenPawz supports three distinct agent-to-agent communication patterns, each with its own security model:

1. Orchestrator projects (boss/worker hierarchy)

A boss agent receives a project goal and team roster, then delegates tasks to worker agents:

Control	Detail
Per-agent capabilities filter	Each sub-agent gets a `capabilities` list restricting which tools it can access — tools not on the list are physically removed from the agent's schema
HIL on exfiltration tools	`email_send`, `slack_send`, `webhook_send`, `rest_api_call`, `exec`, `write_file`, `delete_file` always require user approval — even under orchestrator delegation
Max tool rounds	Global cap (default 20) bounds every agent loop
Max concurrent runs	Default 4 simultaneous agent runs across the entire engine
Worker exit conditions	Workers stop on `report_progress(done)`, max tool rounds, or error — they cannot run indefinitely

2. The Foreman Protocol (architect/worker split)

For MCP tool execution, the Foreman Protocol splits agent work into two roles:

Architect (cloud LLM): Plans and reasons — decides what to do
Foreman (local/cheap model): Executes how — handles MCP tool calls

Critical security constraints:

No recursion — the Foreman cannot spawn sub-workers or delegate further
8-round cap — max 8 tool call rounds per delegation
Direct MCP execution — Foreman calls MCP servers via JSON-RPC directly

3. Squads (peer-to-peer collaboration)

Flat peer groups with channel-based messaging. No boss/worker hierarchy, but scoped by squad membership.

4. Direct agent messaging

Any agent can message any other agent via the agent_send_message tool. Broadcast messages are visible to all agents. Channel-based filtering available.

Anti-forensic protections

The memory store mitigates vault-size oracle attacks — a side-channel where an attacker infers how many memories are stored by watching the SQLite file size. This is the same threat class addressed by KDBX (KeePass) inner-content padding.

Mitigation	Detail
Bucket padding	Database padded to 512KB boundaries via padding table — an observer can only determine a coarse size bucket, not exact memory count
Secure erasure	Two-phase delete: content fields overwritten with empty values, then row deleted — prevents plaintext recovery from freed pages or WAL replay
8KB page size	`PRAGMA page_size = 8192` reduces file-size measurement granularity
Secure delete	`PRAGMA secure_delete = ON` zeroes freed B-tree pages at the SQLite layer
Incremental auto-vacuum	Prevents immediate file-size shrinkage after deletions (which would reveal deletion count)

Working memory snapshot integrity

Snapshots of an agent's working memory (saved on agent switch or session end) include an HMAC-SHA256 integrity tag computed from a dedicated HKDF-derived key. On restore, the HMAC is verified — tampered snapshots are rejected and logged.

Credential security

No cryptographic key is ever stored on the filesystem. Everything lives in the OS keychain:

Key	Keychain Entry	Purpose
DB encryption key	`paw-db-encryption`	AES-256-GCM database field encryption
Skill vault key	`paw-skill-vault`	AES-256-GCM skill credential encryption
Memory vault key	`paw-memory-vault`	Master key for HKDF per-agent memory encryption
Lock screen hash	`paw-lock-screen`	SHA-256 hashed passphrase

There is no device.json, no key file, and no config file containing secrets. If the OS keychain is unavailable, the app refuses to store credentials rather than falling back to plaintext. No silent degradation.

API key zeroing in memory

API keys in provider structs are wrapped in Zeroizing<String> from the zeroize crate. When a provider is dropped, the key memory is immediately zeroed using write_volatile — preventing:

Memory dump attacks (forensic tools scanning process memory)
Swap file leaks (unencrypted keys persisted to disk via OS paging)
Use-after-free (freed memory still containing the key being reallocated)

Credential audit trail

Every credential access is logged to credential_activity_log with action, requesting tool, allow/deny decision, and timestamp.

TLS certificate pinning

All AI provider connections use a certificate-pinned TLS configuration via rustls. The OS trust store is explicitly excluded.

Property	Detail
Library	`rustls` 0.23 (pure-Rust, no OpenSSL)
Root store	Mozilla root certificates via `webpki-roots` only
OS trust store	Explicitly excluded — system CAs are never consulted
Connect timeout	10 seconds
Request timeout	120 seconds

Why this matters: most TLS MITM attacks rely on installing a custom root CA on the victim's machine (corporate proxies, malware, government surveillance). By pinning to Mozilla's root store, OpenPawz rejects certificates signed by any non-Mozilla CA, even if the OS trusts it.

Outbound request signing

Every AI provider request is SHA-256 signed before transmission:

SHA-256(provider ‖ model ‖ ISO-8601 timestamp ‖ request body)

Hashes are logged to an in-memory ring buffer (500 entries) for tamper detection and compliance auditing. If a proxy modifies the request body in transit, the recorded hash won't match.

Prompt injection defense

Dual-implementation scanning (TypeScript + Rust) for 30+ injection patterns across 9 categories:

Category	Examples
Override	"Ignore previous instructions"
Identity	"You are now..."
Jailbreak	"DAN mode", "no restrictions"
Leaking	"Show me your system prompt"
Obfuscation	Base64-encoded instructions
Tool injection	Fake tool call formatting
Social engineering	"As an AI researcher..."
Markup	Hidden instructions in HTML/markdown
Bypass	"This is just a test..."

Messages scoring Critical (40+) are blocked entirely and never delivered to the agent. Channel bridges automatically enforce this.

Memory-side injection scanning

Recalled memories are scanned for 10 injection patterns before being returned to agent context. Suspicious content is redacted with [REDACTED:injection] markers — poisoned memories cannot manipulate future agent behavior.

Anti-fixation defenses

Five layers prevent agents from ignoring user instructions or getting stuck:

Defense	What it does
Response loop detection	Jaccard similarity checks catch the agent repeating itself — active on ALL channels
User override detection	Recognizes "stop", "focus on my question", "that's not what I asked" across 5 phrase categories with 3-level escalation
Unidirectional topic ignorance	Catches unique-but-wrong responses after a redirect — fires when the agent's response has zero entity overlap with the user's keywords
Momentum clearing	Clears working memory trajectory embeddings on user override — recalled context serves the new topic, not the old one
Tool-call loop breaker	Hash-based signature detection stops repeated identical tool calls after 3 consecutive matches

Filesystem sandboxing

Sensitive path blocking

20+ sensitive paths are permanently blocked from agent access:

~/.ssh · ~/.gnupg · ~/.aws · ~/.kube · ~/.docker · ~/.password-store · /etc · /root · /proc · /sys · /dev · filesystem root · home directory root

Per-project scope

When a project is active, all file operations are constrained to the project root. Directory traversal sequences (../) are detected and blocked. Violations are logged to the security audit.

Source code introspection block

Agents cannot read their own engine source files — any read_file call targeting paths containing src-tauri/src/engine/ or files ending in .rs is rejected. This prevents agents from discovering internal security mechanisms.

Container sandbox

Docker-based execution isolation via the bollard crate:

Measure	Default
Capabilities	`cap_drop ALL`
Network	Disabled
Memory limit	256 MB
CPU shares	512
Timeout	30 seconds
Output limit	50 KB

Four presets: Minimal (alpine, 128MB, no network), Development (node:20-alpine, 512MB), Python (python:3.12-alpine, 512MB), Restricted (alpine, 64MB, 10s timeout).

GDPR Article 17 — Right to erasure

The engine_memory_purge_user command performs complete data erasure for a user:

All memory content rows deleted
All vector embeddings deleted
Search index entries removed
Graph edges removed
Padding table repacked to prevent file-size leakage
PRAGMA secure_delete ensures freed pages are zeroed
Returns a count of erased records for compliance reporting

The Multi layers at a glance

#	Layer	What it protects against
1	Zero open ports	Remote network attacks
2	Human-in-the-Loop	Unauthorized side-effects
3	Agent policies	Over-privileged agents
4	Per-agent HKDF encryption	Cross-agent data access
5	PII detection + field encryption	Data exposure at rest
6	Signed capability tokens	Scope escalation, confused deputy attacks
7	Trust-weighted memory bus	Memory poisoning between agents
8	TLS certificate pinning	MITM on provider connections
9	Prompt injection scanning	Prompt manipulation (inbound + recalled)
10	Anti-fixation defenses	Agent ignoring user instructions
11	Filesystem sandboxing	Credential theft, path traversal
12	Anti-forensic vault padding	File-size side-channel leakage

Read the full security docs

The complete security reference — including risk classification tables, allowlist/denylist patterns, and every configuration option — lives in the repo: