The security problem with AI agents
AI agents are powerful because they do things — they read files, run commands, send messages, search your data. That power comes with a question most agent frameworks don't answer well:
What stops the agent from doing things it shouldn't?
Most agent systems bolt on safety as an afterthought: a prompt that says "be careful," maybe a regex filter on outputs, and hope for the best. That's not security. That's a suggestion.
OpenPawz takes a different approach. We treat agent security as a systems engineering problem — not a prompt engineering one. The result is a multi-layer defense-in-depth architecture enforced at the Rust engine level, where the agent has zero ability to bypass controls regardless of what any prompt says.
Star the repo — it's open source
Zero attack surface by default
OpenPawz exposes zero network ports in its default configuration. There is no HTTP server, no WebSocket endpoint, and no listening socket for an attacker to target. The only communication path is Tauri's in-process IPC — a direct Rust-to-WebView bridge that never touches the network.
Four optional listeners exist (webhook server, WebChat, WhatsApp bridge, n8n engine), but all are:
- Disabled by default
-
Bound to
127.0.0.1— unreachable from the network even when enabled - Individually authenticated — bearer tokens, session cookies, IP rate limiting
Binding to 0.0.0.0 is a manual opt-in that triggers a security warning and recommends TLS wrapping via Tailscale Funnel.
The WebView enforces a strict Content Security Policy: default-src 'self', script-src 'self', object-src 'none', frame-ancestors 'none'. No external scripts, no iframe embedding, no cross-origin form submission.
Human-in-the-Loop: every side-effect needs permission
The core design principle: agents never touch the OS directly. Every tool call flows through the Rust tool executor, which classifies it by risk before deciding whether to proceed.
Auto-approved (no modal)
Read-only and informational tools run without interruption — read_file, web_search, memory_search, soul_read, self_info, email_read, slack_read, create_task, and others. No friction for safe operations.
Requires approval (modal shown)
Side-effect tools pause execution and show a risk-classified modal to the user:
| Risk Level | Behavior | Example |
|---|---|---|
| Critical | Auto-denied by default; red modal requiring the user to type "ALLOW" |
sudo rm -rf /, `curl \ |
| High | Orange warning modal | {% raw %}chmod 777, kill -9
|
| Medium | Yellow caution modal |
npm install, outbound HTTP requests |
| Low | Standard approval | Unknown exec commands |
| Safe | Auto-approved via allowlist (90+ default patterns) |
git status, ls, cat
|
Danger pattern detection
30+ patterns across multiple categories are caught before they can execute:
-
Privilege escalation —
sudo,su,doas,pkexec,runas -
Destructive deletion —
rm -rf /,rm -rf ~,rm -rf /* -
Permission exposure —
chmod 777,chmod -R 777 -
Disk destruction —
dd if=,mkfs,fdisk -
Remote code execution —
curl | sh,wget | bash -
Process termination —
kill -9 1,killall -
Firewall manipulation —
iptables -F,ufw disable -
Network exfiltration — piping file contents to
curl,scpoutbound,/dev/tcp
Users can add custom regex rules for both allow and deny lists. The session override feature ("allow all" for a timed window) still blocks privilege escalation commands — you can't override the most dangerous class.
Agent governance: four policy presets
Not every agent should have the same power. OpenPawz provides per-agent tool access control with four built-in presets and support for custom policies:
| Preset | Mode | What it does |
|---|---|---|
| Unrestricted | unrestricted | Full tool access, no constraints |
| Standard | denylist | All tools available, but high-risk tools always require human approval |
| Read-Only | allowlist | Only safe read/search/list operations (28 tools) |
| Sandbox | allowlist | Only 5 tools: web_search, web_read, memory_store, memory_search, self_info
|
Policies are enforced at two levels simultaneously:
-
Frontend:
checkToolPolicy()evaluates per-tool decisions and strips unauthorized tools from the request -
Backend:
ChatRequest.tool_filtercarries the allowed tool list to the Rust engine — the agent literally cannot see tools it doesn't have access to
This means a sandboxed research agent physically cannot call exec or write_file, regardless of what its prompt says. The tools don't exist in its schema.
Memory encryption: three independent defense layers
Project Engram — the memory system — applies defense-in-depth to all stored agent memories (episodic, semantic, and procedural). Even if an attacker gains access to the SQLite database file, the data remains protected.
Layer 1: Per-agent HKDF key derivation
A single master key lives in the OS keychain (paw-memory-vault). From it, three independent key families are derived via HKDF-SHA256 domain separation:
| Domain | HKDF Salt | Purpose |
|---|---|---|
| Agent encryption | engram-agent-key-v1 |
Per-agent AES-256-GCM memory encryption |
| Snapshot HMAC | engram-snapshot-hmac-v1 |
Tamper detection for working memory snapshots |
| Capability signing | engram-platform-cap-v1 |
HMAC-SHA256 signing of capability tokens |
Every agent gets a unique derived key. Cross-agent decryption is mathematically impossible without the master key. Compromising one agent's derived key does not expose any other agent's memories.
Layer 2: SQL scope filtering
Every memory query includes scope constraints at the SQL level — agent_id, project_id, squad_id. Even without encryption, the query layer enforces isolation.
Layer 3: Signed capability tokens
Every gated_search() call (the unified memory retrieval entry point) performs 4-step cryptographic verification:
- HMAC signature integrity — token verified against the platform signing key
-
Identity binding — the token's
agent_idmust match the requesting agent -
Scope ceiling check — requested search scope cannot exceed the token's
max_scope - Membership verification — for squad/project scopes, the agent must actually belong to that squad or project
This prevents confused-deputy attacks where an agent could be tricked into reading another agent's memories.
Automatic PII detection and field-level encryption
Before any memory is stored, it passes through a two-layer PII scanner with 17 regex pattern types:
Layer 1 (regex patterns): Social Security Numbers, credit card numbers, email addresses, phone numbers, physical addresses, person names, government IDs, JWT tokens, AWS access keys, private keys, IBANs, IPv4 addresses, API keys, passwords, and dates of birth.
Layer 2 (LLM-assisted): A secondary scanner catches context-dependent PII that static regex cannot detect — phrases like "my mother's maiden name is Smith" or "I was born in Springfield." The LLM returns structured JSON with PII type classifications and confidence scores.
Content is classified into three tiers:
| Tier | Content | Treatment |
|---|---|---|
| Cleartext | No PII detected | Stored as-is |
| Sensitive | PII detected (email, name, phone, IP) | AES-256-GCM encrypted |
| Confidential | High-sensitivity PII (SSN, credit card, JWT, AWS key, private key) | AES-256-GCM encrypted |
Encrypted content uses the format enc:v1:base64(nonce ‖ ciphertext ‖ tag). A fresh 96-bit nonce is generated per encryption operation. Decryption is transparent on retrieval using the per-agent derived key.
Key rotation
An automated key rotation scheduler runs on a configurable interval (default: 90 days) and re-encrypts all agent memories with fresh HKDF-derived keys. The rotation is atomic — if any re-encryption fails, the entire batch rolls back. No data is left in a half-migrated state.
Inter-agent memory bus: scoped, signed, rate-limited
When multiple agents need to share information, the Memory Bus provides pub/sub memory sharing with publish-side authentication to prevent memory poisoning.
Capability tokens
Every agent holds an AgentCapability signed with HMAC-SHA256 against a platform-held secret key. The token specifies:
- Max publication scope — Targeted (specific agents), Squad, Project, or Global
- Importance ceiling — the maximum importance an agent can self-assign (0.0–1.0)
- Write permission — whether the agent can publish at all
- Rate limit — maximum publications per consolidation cycle
The scope hierarchy is a strict linear lattice:
Targeted (rank 1) < Squad (rank 2) < Project (rank 3) < Global (rank 4)
An agent with max_scope = Squad can publish to targeted agents or its squad, but cannot publish to the project or global scope. Ceiling enforcement uses a simple rank comparison — no ambiguity, no escalation path.
Trust-weighted contradiction resolution
When two agents publish contradictory facts on the same topic, the system resolves it based on:
effective_importance = raw_importance × agent_trust_score
The memory with the higher effective importance is retained. Trust scores are per-agent (0.0–1.0) and adjustable at runtime. This prevents a compromised or low-trust agent from overwriting facts established by high-trust agents through recency alone.
Publish-side defenses
| Defense | Detail |
|---|---|
| Scope enforcement | Publication scope clamped to agent's maximum |
| Importance ceiling | Publication importance clamped to agent's ceiling |
| Per-agent rate limiting | Publish count tracked per GC window; exceeded limits return an error |
| Injection scanning | All publication content scanned for prompt injection patterns before entering the bus |
Threat model
| Attack | Mitigation |
|---|---|
| Agent floods bus with poisoned memories | Rate limit + injection scan on publish |
| Low-trust agent overwrites high-trust facts | Trust-weighted contradiction resolution |
| Agent publishes beyond its authority | Scope ceiling enforcement |
| Forged capability token | HMAC-SHA256 verification against platform secret |
| Cross-agent memory reads via confused deputy | Signed read-path tokens with identity binding + membership verification |
Multi-agent orchestration: delegation with guardrails
OpenPawz supports three distinct agent-to-agent communication patterns, each with its own security model:
1. Orchestrator projects (boss/worker hierarchy)
A boss agent receives a project goal and team roster, then delegates tasks to worker agents:
| Control | Detail |
|---|---|
| Per-agent capabilities filter | Each sub-agent gets a capabilities list restricting which tools it can access — tools not on the list are physically removed from the agent's schema |
| HIL on exfiltration tools |
email_send, slack_send, webhook_send, rest_api_call, exec, write_file, delete_file always require user approval — even under orchestrator delegation |
| Max tool rounds | Global cap (default 20) bounds every agent loop |
| Max concurrent runs | Default 4 simultaneous agent runs across the entire engine |
| Worker exit conditions | Workers stop on report_progress(done), max tool rounds, or error — they cannot run indefinitely |
2. The Foreman Protocol (architect/worker split)
For MCP tool execution, the Foreman Protocol splits agent work into two roles:
- Architect (cloud LLM): Plans and reasons — decides what to do
- Foreman (local/cheap model): Executes how — handles MCP tool calls
Critical security constraints:
- No recursion — the Foreman cannot spawn sub-workers or delegate further
- 8-round cap — max 8 tool call rounds per delegation
- Direct MCP execution — Foreman calls MCP servers via JSON-RPC directly
3. Squads (peer-to-peer collaboration)
Flat peer groups with channel-based messaging. No boss/worker hierarchy, but scoped by squad membership.
4. Direct agent messaging
Any agent can message any other agent via the agent_send_message tool. Broadcast messages are visible to all agents. Channel-based filtering available.
Anti-forensic protections
The memory store mitigates vault-size oracle attacks — a side-channel where an attacker infers how many memories are stored by watching the SQLite file size. This is the same threat class addressed by KDBX (KeePass) inner-content padding.
| Mitigation | Detail |
|---|---|
| Bucket padding | Database padded to 512KB boundaries via padding table — an observer can only determine a coarse size bucket, not exact memory count |
| Secure erasure | Two-phase delete: content fields overwritten with empty values, then row deleted — prevents plaintext recovery from freed pages or WAL replay |
| 8KB page size |
PRAGMA page_size = 8192 reduces file-size measurement granularity |
| Secure delete |
PRAGMA secure_delete = ON zeroes freed B-tree pages at the SQLite layer |
| Incremental auto-vacuum | Prevents immediate file-size shrinkage after deletions (which would reveal deletion count) |
Working memory snapshot integrity
Snapshots of an agent's working memory (saved on agent switch or session end) include an HMAC-SHA256 integrity tag computed from a dedicated HKDF-derived key. On restore, the HMAC is verified — tampered snapshots are rejected and logged.
Credential security
No cryptographic key is ever stored on the filesystem. Everything lives in the OS keychain:
| Key | Keychain Entry | Purpose |
|---|---|---|
| DB encryption key | paw-db-encryption |
AES-256-GCM database field encryption |
| Skill vault key | paw-skill-vault |
AES-256-GCM skill credential encryption |
| Memory vault key | paw-memory-vault |
Master key for HKDF per-agent memory encryption |
| Lock screen hash | paw-lock-screen |
SHA-256 hashed passphrase |
There is no device.json, no key file, and no config file containing secrets. If the OS keychain is unavailable, the app refuses to store credentials rather than falling back to plaintext. No silent degradation.
API key zeroing in memory
API keys in provider structs are wrapped in Zeroizing<String> from the zeroize crate. When a provider is dropped, the key memory is immediately zeroed using write_volatile — preventing:
- Memory dump attacks (forensic tools scanning process memory)
- Swap file leaks (unencrypted keys persisted to disk via OS paging)
- Use-after-free (freed memory still containing the key being reallocated)
Credential audit trail
Every credential access is logged to credential_activity_log with action, requesting tool, allow/deny decision, and timestamp.
TLS certificate pinning
All AI provider connections use a certificate-pinned TLS configuration via rustls. The OS trust store is explicitly excluded.
| Property | Detail |
|---|---|
| Library |
rustls 0.23 (pure-Rust, no OpenSSL) |
| Root store | Mozilla root certificates via webpki-roots only |
| OS trust store | Explicitly excluded — system CAs are never consulted |
| Connect timeout | 10 seconds |
| Request timeout | 120 seconds |
Why this matters: most TLS MITM attacks rely on installing a custom root CA on the victim's machine (corporate proxies, malware, government surveillance). By pinning to Mozilla's root store, OpenPawz rejects certificates signed by any non-Mozilla CA, even if the OS trusts it.
Outbound request signing
Every AI provider request is SHA-256 signed before transmission:
SHA-256(provider ‖ model ‖ ISO-8601 timestamp ‖ request body)
Hashes are logged to an in-memory ring buffer (500 entries) for tamper detection and compliance auditing. If a proxy modifies the request body in transit, the recorded hash won't match.
Prompt injection defense
Dual-implementation scanning (TypeScript + Rust) for 30+ injection patterns across 9 categories:
| Category | Examples |
|---|---|
| Override | "Ignore previous instructions" |
| Identity | "You are now..." |
| Jailbreak | "DAN mode", "no restrictions" |
| Leaking | "Show me your system prompt" |
| Obfuscation | Base64-encoded instructions |
| Tool injection | Fake tool call formatting |
| Social engineering | "As an AI researcher..." |
| Markup | Hidden instructions in HTML/markdown |
| Bypass | "This is just a test..." |
Messages scoring Critical (40+) are blocked entirely and never delivered to the agent. Channel bridges automatically enforce this.
Memory-side injection scanning
Recalled memories are scanned for 10 injection patterns before being returned to agent context. Suspicious content is redacted with [REDACTED:injection] markers — poisoned memories cannot manipulate future agent behavior.
Anti-fixation defenses
Five layers prevent agents from ignoring user instructions or getting stuck:
| Defense | What it does |
|---|---|
| Response loop detection | Jaccard similarity checks catch the agent repeating itself — active on ALL channels |
| User override detection | Recognizes "stop", "focus on my question", "that's not what I asked" across 5 phrase categories with 3-level escalation |
| Unidirectional topic ignorance | Catches unique-but-wrong responses after a redirect — fires when the agent's response has zero entity overlap with the user's keywords |
| Momentum clearing | Clears working memory trajectory embeddings on user override — recalled context serves the new topic, not the old one |
| Tool-call loop breaker | Hash-based signature detection stops repeated identical tool calls after 3 consecutive matches |
Filesystem sandboxing
Sensitive path blocking
20+ sensitive paths are permanently blocked from agent access:
~/.ssh · ~/.gnupg · ~/.aws · ~/.kube · ~/.docker · ~/.password-store · /etc · /root · /proc · /sys · /dev · filesystem root · home directory root
Per-project scope
When a project is active, all file operations are constrained to the project root. Directory traversal sequences (../) are detected and blocked. Violations are logged to the security audit.
Source code introspection block
Agents cannot read their own engine source files — any read_file call targeting paths containing src-tauri/src/engine/ or files ending in .rs is rejected. This prevents agents from discovering internal security mechanisms.
Container sandbox
Docker-based execution isolation via the bollard crate:
| Measure | Default |
|---|---|
| Capabilities | cap_drop ALL |
| Network | Disabled |
| Memory limit | 256 MB |
| CPU shares | 512 |
| Timeout | 30 seconds |
| Output limit | 50 KB |
Four presets: Minimal (alpine, 128MB, no network), Development (node:20-alpine, 512MB), Python (python:3.12-alpine, 512MB), Restricted (alpine, 64MB, 10s timeout).
GDPR Article 17 — Right to erasure
The engine_memory_purge_user command performs complete data erasure for a user:
- All memory content rows deleted
- All vector embeddings deleted
- Search index entries removed
- Graph edges removed
- Padding table repacked to prevent file-size leakage
-
PRAGMA secure_deleteensures freed pages are zeroed - Returns a count of erased records for compliance reporting
The Multi layers at a glance
| # | Layer | What it protects against |
|---|---|---|
| 1 | Zero open ports | Remote network attacks |
| 2 | Human-in-the-Loop | Unauthorized side-effects |
| 3 | Agent policies | Over-privileged agents |
| 4 | Per-agent HKDF encryption | Cross-agent data access |
| 5 | PII detection + field encryption | Data exposure at rest |
| 6 | Signed capability tokens | Scope escalation, confused deputy attacks |
| 7 | Trust-weighted memory bus | Memory poisoning between agents |
| 8 | TLS certificate pinning | MITM on provider connections |
| 9 | Prompt injection scanning | Prompt manipulation (inbound + recalled) |
| 10 | Anti-fixation defenses | Agent ignoring user instructions |
| 11 | Filesystem sandboxing | Credential theft, path traversal |
| 12 | Anti-forensic vault padding | File-size side-channel leakage |
Read the full security docs
The complete security reference — including risk classification tables, allowlist/denylist patterns, and every configuration option — lives in the repo:
- SECURITY.md — Security overview and threat model
- Security Reference — Full technical reference
- ENGRAM.md — Memory architecture whitepaper
If you find a vulnerability, please report it responsibly via the contact information in the repo rather than opening a public issue.
Star the repo if you want to track progress. 🙏

Top comments (0)