DEV Community: Gotham64

Benchmarks, Zero Guesswork: Why OpenPawz measures every hot path in the AI engine

Gotham64 — Wed, 18 Mar 2026 06:27:47 +0000

The performance problem nobody measures

Every AI agent platform talks about speed. Fast responses. Low latency. Real-time agents.

But ask a simple question — how long does it take to create a session? Search memory? Encrypt a credential? Scan for prompt injection? — and you get silence. No numbers. No baselines. No way to tell if the last update made things faster or slower.

This matters more than most teams realize:

Scenario	What goes wrong
A refactor ships without benchmarks	Session creation silently doubles from 20µs to 40µs. Nobody notices — until 10,000 users do.
Memory search "feels slow"	Is it the embedding model? The vector index? The SQLite query? Without measurements, you're guessing.
A dependency update lands	Did the new `rusqlite` version change query performance? Did the `aes-gcm` update affect encrypt/decrypt throughput?
You scale up	50 sessions work fine. 500 sessions work fine. 5,000 sessions? You have no idea where the cliff is.

The problem isn't that platforms are slow. It's that nobody is measuring, so nobody knows. Performance regressions are invisible until they become user complaints.

OpenPawz runs 140+ benchmarks across 8 dedicated suites on every critical path in the engine. Not integration tests pretending to check performance. Real statistical benchmarks with variance analysis, regression detection, and historical comparison.

Star the repo — it's open source

What gets measured and why

The benchmark suite isn't a token gesture. It covers every layer of the engine — from the operations users trigger directly to the internal machinery that makes those operations possible.

Here's the breakdown by suite:

Sessions — the foundation of every conversation

Every interaction with an AI agent starts with a session. Creating one, loading messages, listing history, managing tasks. If these operations are slow, everything built on top of them is slow.

The session benchmarks measure:

Creating sessions and messages — the write path users hit on every single interaction
Listing at scale — 10 sessions, 100 sessions, 500 sessions. Where does performance degrade?
Message depth — fetching 50 messages is trivial. Fetching 1,000 with HMAC chain verification? That's where you find bottlenecks.
Task and agent file I/O — the async operations that happen behind every agent action

Why it matters: session operations are the critical path. A user sends a message, the platform creates a message record, verifies the chain, updates the session. If any of those steps is slow, the user perceives the entire agent as slow — even before the LLM has responded.

Memory — search has to be instant

OpenPawz uses a hybrid memory system: BM25 for keyword search, HNSW vectors for semantic search, and a deduplication layer to prevent memory bloat. Each of these has radically different performance characteristics.

The memory benchmarks test:

BM25 search at different corpus sizes — how does keyword search scale from 100 to 2,000 documents?
HNSW insert and search — vector indexing is notoriously sensitive to dimensionality and dataset size
Content overlap detection — the dedup engine that decides whether a new memory is actually new
Brute-force vs. HNSW comparison — at what dataset size does the approximate index beat linear scan?

Why it matters: memory search happens on every agent turn. The agent checks what it knows before responding. If memory retrieval adds 50ms instead of 5ms, that's 50ms per turn, per user, compounding across every conversation.

Engram — the cognitive layer

Engram is the knowledge graph that sits on top of raw memory. Entities, relationships, propositions. It powers the agent's ability to reason about what it knows rather than just recall it.

The benchmarks cover:

Entity and edge upserts — how fast can the knowledge graph absorb new information?
Subgraph queries — retrieving all edges connected to an entity, at varying graph sizes
Proposition decomposition — breaking complex statements into atomic facts
Memory fusion — merging overlapping memories into coherent summaries
SCC certificate hashing — the capability system that validates tool access

Why it matters: graph operations compound. An agent processing a long conversation might upsert dozens of entities and edges per turn. If each upsert takes 100µs instead of 10µs, you've added milliseconds of invisible overhead that stacks up fast.

Security — crypto can't be the bottleneck

The security suite benchmarks the operations that protect user data: AES-256-GCM encryption, key derivation, PII detection, and injection scanning.

What gets measured:

Encrypt and decrypt at different payload sizes — 64 bytes, 1 KB, 64 KB, 1 MB
Key derivation (Argon2) — the intentionally-slow operation that protects master keys
PII detection — scanning messages for emails, phone numbers, SSNs before they reach the LLM
Injection scanning — detecting prompt injection attempts in user input

Why it matters: security operations run on every message. PII detection scans every outbound message. Injection scanning checks every inbound message. If either of these adds perceptible latency, teams are tempted to disable them. Benchmarks ensure they stay fast enough that there's never a reason to turn them off.

Audit — compliance at zero cost

Every operation in OpenPawz generates an audit trail. The audit benchmarks ensure that logging doesn't slow down the operations being logged.

Append events — how fast can audit records be written?
Query by time range — retrieving audit history for a specific period
Query by event type — filtering for specific operation categories

Why it matters: audit logging is fire-and-forget. If appending an audit record takes longer than the operation it's recording, the tail is wagging the dog. Benchmarks keep audit overhead invisible.

Reasoning — model-aware pricing and routing

The reasoning benchmarks cover the pricing engine and cost calculations that determine which model handles which request.

Price-per-token lookups across all supported models
Cost calculations for conversations of varying length
Model registry operations — looking up capabilities, context windows, routing metadata

Why it matters: routing decisions happen before every LLM call. The engine evaluates which model to use, what it will cost, and whether budget constraints allow it. These lookups need to be sub-microsecond so they never delay the actual inference call.

Platform — the connective tissue

Config, flows, squads, canvas, projects, telemetry. These are the platform features that tie everything together. Individually they seem simple. Collectively, they define whether the platform feels snappy or sluggish.

Config read/write — key-value settings the engine checks constantly
Flow operations — saving, loading, listing workflow graphs at scale
Squad management — creating teams of agents, checking membership
Canvas components — the visual workspace that agents and users share
Project management — grouping agents, sessions, and resources
Telemetry recording — performance data collection that must not affect performance

Why it matters: platform operations are invisible until they're slow. Nobody notices config lookups that take 2µs. Everyone notices when they take 200µs and the settings panel lags.

The tooling: Criterion.rs and statistical rigor

OpenPawz doesn't use hand-rolled timing loops or Instant::now() wrappers. The entire suite runs on Criterion.rs — the same statistical benchmarking framework used by the Rust compiler itself.

What Criterion provides that ad-hoc timing doesn't:

Feature	Why it matters
Warm-up phase	Eliminates cold-cache artifacts from results
Statistical sampling	Runs each benchmark enough times to calculate confidence intervals
Regression detection	Compares against the last run and flags performance changes
Outlier classification	Identifies and categorizes anomalous measurements
HTML reports	Visual charts showing distribution, comparison, and trend data

Every benchmark run produces a target/criterion/ directory with HTML reports you can open in a browser. You see exactly how performance changed, not just a single number.

What makes a good benchmark suite

Building 140+ benchmarks taught us a few things about what makes benchmarks actually useful versus benchmarks that just exist to check a box.

Measure the real path, not a mock

Every benchmark in the suite creates a real SQLite database, inserts real data, and runs real queries. No mocking the storage layer. No skipping serialization. If the production code path touches SQLite, the benchmark touches SQLite.

Test at multiple scales

A single benchmark at one size tells you almost nothing. Memory search at 100 documents? Fast. Memory search at 2,000 documents? Maybe still fast, maybe not. The suite deliberately tests operations at multiple scales — 10, 50, 100, 200, 500, 1000, 2000 — so you see the scaling curve, not just a single point.

Separate the hot paths

Not every function deserves a benchmark. The suite focuses on operations that happen per-turn, per-message, or per-session — the hot paths that users experience directly. A one-time migration function that runs on startup? Don't benchmark it. A PII scanner that runs on every outbound message? Absolutely benchmark it.

Make regression detection automatic

Criterion stores historical results. Run the benchmarks before and after a change, and you get a clear report: session/create: +3.2%, message/add: -1.1%, memory/bm25_search/1000: +0.4%. No manual comparison needed. No spreadsheets. The tooling tells you what changed.

Running the suite

The benchmarks live in a dedicated crate — openpawz-bench — separate from the application code. This keeps benchmark dependencies out of the production binary and gives the suite its own compilation target.

# Run all benchmarks
cd src-tauri && cargo bench -p openpawz-bench

# Run a specific suite
cargo bench -p openpawz-bench --bench session_bench

# Run benchmarks matching a pattern
cargo bench -p openpawz-bench -- "memory/bm25"

Results land in target/criterion/ with full HTML reports. Open target/criterion/report/index.html for an overview of every benchmark, or drill into any individual measurement for distribution charts and regression comparisons.

The eight suites at a glance

Suite	Focus	Key operations
session_bench	Sessions, messages, tasks, agent files	Create, list, fetch at scale
platform_bench	Config, flows, squads, canvas, projects, telemetry	CRUD at varying DB sizes
memory_bench	BM25 search, HNSW indexing, dedup, content overlap	Search and insert at multiple corpus sizes
engram_bench	Knowledge graph — entities, edges, subgraph queries	Upserts, traversals, graph scaling
cognitive_bench	Proposition decomposition, memory fusion, SCC, tool metadata	Parsing, merging, hashing
security_bench	AES-256-GCM, PII detection, injection scanning	Encrypt/decrypt at varying payloads
audit_bench	Audit trail append and query	Write throughput, time-range queries
reasoning_bench	Pricing engine, cost calculations, model registry	Per-token lookups, conversation costing

140+ benchmarks. Eight suites. Every hot path in the engine.

Part of the engine architecture

The benchmarks aren't a separate project. They're part of the same Cargo workspace as the engine itself:

Crate	Role
`openpawz-core`	The pure Rust engine library — everything the benchmarks test
`openpawz-bench`	Criterion.rs benchmark suite — depends directly on `openpawz-core`
`openpawz`	Tauri desktop app
`openpawz-cli`	Terminal binary

The benchmarks import openpawz-core as a library and call the same public API that the desktop app and CLI use. No internal test hooks. No special benchmark-only codepaths. What gets benchmarked is what ships.

This also means the benchmarks serve as a living compatibility check. If a public API changes, the benchmarks fail to compile. If a function signature changes, the benchmark that calls it catches it immediately.

Why this matters for users

You don't need to run these benchmarks yourself (though you're welcome to). They exist so that every release ships with confidence that:

Nothing got slower — regression detection catches performance changes before they merge
The fast paths stay fast — session creation, memory search, encryption, audit logging
Scale is understood — we know where the performance cliffs are, and they're documented
Security isn't sacrificed for speed — PII detection and injection scanning stay enabled because they're fast enough to never be a concern

Performance isn't a feature you add later. It's a property of the codebase that you either measure or you hope for. OpenPawz measures.

Try it

# Clone and run the full suite
git clone https://github.com/OpenPawz/openpawz.git
cd openpawz/src-tauri
cargo bench -p openpawz-bench

# Open the HTML reports
open target/criterion/report/index.html

Every benchmark runs against a fresh in-memory SQLite database. No external services. No network calls. No setup beyond having Rust installed.

Read the full docs

Star the repo if you want to track progress. 🙏

OpenPawz — Your AI, Your Rules

A native desktop AI platform that runs fully offline, connects to any provider, and puts you in control. Private by default. Powerful by design.

openpawz.ai

OpenPawz CLI: Your multi-agent AI platform belongs in the terminal

Gotham64 — Wed, 11 Mar 2026 00:21:59 +0000

The GUI trap

Every AI agent platform ships a GUI. Chat windows, node editors, drag-and-drop flows, settings panels. And only a GUI. Here's why we built a native Rust CLI that shares everything with the desktop app

That means:

No headless operation. You can't run agents on a server without a display.
No scripting. Automating agent management requires either a REST API you have to host or brittle UI automation.
No CI integration. Checking agent status, cleaning up sessions, or validating configuration in a pipeline? Open a browser.
No composability. You can't pipe agent output into jq, grep, or another tool. The data is trapped behind a window.

AI power users — the people building real workflows, deploying to production, managing dozens of agents — live in the terminal. Forcing them into a GUI for every interaction is a productivity tax they shouldn't have to pay.

OpenPawz ships a native Rust CLI that talks directly to the same engine library as the desktop app. No REST API. No network layer. No second-class citizen.

Star the repo — it's open source

The architecture: one engine, two interfaces

Most platforms that offer both a GUI and a CLI do it wrong. The CLI is an afterthought — a thin HTTP client that hits the same server the GUI talks to. It adds latency, requires the server to be running, and breaks when the API changes.

OpenPawz does it differently. The engine is a pure Rust library (openpawz-core) with zero GUI or framework dependencies. Both the Tauri desktop app and the CLI binary depend on this same library directly:

┌──────────────────────┐      ┌─────────────────────┐
│   openpawz (Tauri)   │      │   openpawz (CLI)    │
│   Desktop GUI app    │      │   Terminal binary   │
└──────────┬───────────┘      └──────────┬──────────┘
           │                             │
           │  use openpawz_core::*       │  use openpawz_core::*
           │                             │
           └──────────┐   ┌──────────────┘
                      │   │
              ┌───────▼───▼───────┐
              │   openpawz-core   │
              │                   │
              │  Sessions (SQLite)│
              │  Memory engine    │
              │  Audit log        │
              │  Key vault        │
              │  Provider registry│
              │  PII detection    │
              │  Crypto (AES-256) │
              └───────────────────┘

Three crates in a Cargo workspace:

Crate	Purpose	Dependencies
`openpawz-core`	Pure business logic — sessions, memory, audit, crypto, providers	rusqlite, aes-gcm, keyring, reqwest
`openpawz`	Tauri desktop app — GUI frontend	openpawz-core + tauri
`openpawz-cli`	Terminal binary — clap interface	openpawz-core + clap

The CLI and desktop app compile the exact same engine code. Not a reimplementation. Not an API wrapper. The same Rust functions, the same SQLite database, the same cryptographic stack.

What this means in practice

Shared state, zero sync

The CLI and desktop app read and write the same SQLite database:

Platform	Data directory
macOS	`~/Library/Application Support/com.openpawz.app/`
Linux	`~/.local/share/com.openpawz.app/`
Windows	`%APPDATA%\com.openpawz.app\`

Create an agent from the CLI. It appears in the desktop app instantly. Delete a session from the GUI. The CLI sees it's gone. No sync protocol, no eventual consistency, no conflicts — one database, two access paths.

Zero network overhead

The CLI calls Rust functions directly — store.list_sessions(), store.store_memory(), store.list_all_agents(). No HTTP server to start. No port to bind. No JSON serialization round-trip between client and server.

# This calls openpawz-core directly — no server needed
openpawz session list

Compare that to every other platform's CLI, which does:

CLI → HTTP request → Server → Database → Response → JSON parse → Display

OpenPawz:

CLI → Database → Display

Same security guarantees

The CLI inherits the full cryptographic stack from openpawz-core:

AES-256-GCM encryption for sensitive data
OS keychain integration (macOS Keychain, Linux Secret Service, Windows Credential Manager) via the key vault
HKDF-SHA256 key derivation
Zeroizing memory — keys are wiped on drop
HMAC-SHA256 chained audit log — every operation is tamper-evident
PII auto-detection — 17 regex patterns catch sensitive data before it leaves the system
OS CSPRNG via getrandom — no userspace RNG, ever

When you run openpawz setup and enter an API key, it's stored in your OS keychain with the same protection as the desktop app. Not in a dotfile. Not in plaintext. The same vault.

Six commands, full coverage

The CLI covers the operations that matter for daily use and scripting:

`setup` — Interactive provider configuration

openpawz setup

Walks you through choosing a provider (Anthropic, OpenAI, Google, Ollama, OpenRouter), entering credentials, and writing the engine config. Ollama requires no API key — it detects local models automatically.

Defaults are production-ready: max_tool_rounds: 10, daily_budget_usd: 5.0, tool_timeout_secs: 30, max_concurrent_runs: 4. Change any of them later via config set.

`status` — Engine diagnostics

openpawz status

One command that tells you if everything is working: provider configuration, memory config, data directory, session count. JSON output for monitoring:

openpawz status --output json | jq '.provider'

`agent` — Full agent lifecycle

openpawz agent list                           # Table of all agents
openpawz agent create --name "Researcher"     # Create with auto-generated ID
openpawz agent get agent-a1b2c3d4             # View files and metadata
openpawz agent delete agent-a1b2c3d4          # Remove agent and all files

`session` — Chat history management

openpawz session list --limit 20              # Recent sessions
openpawz session history <id>                 # Color-coded chat history
openpawz session rename <id> "Q4 Analysis"    # Rename for clarity
openpawz session cleanup                      # Purge empty sessions (>1hr old)

The history command color-codes messages by role: cyan for user, yellow for assistant, gray for system, magenta for tool calls. You get the full conversation context without opening the GUI.

`config` — Direct config editing

openpawz config get                           # Pretty-printed JSON
openpawz config set default_model gpt-4o      # Change a value
openpawz config set daily_budget_usd 10.0     # Smart parsing: numbers, bools, strings

`memory` — Agent memory operations

openpawz memory list --limit 50
openpawz memory store "Deploy target: AWS us-east-1" --category fact --importance 8
openpawz memory delete a1b2c3d4

Every command supports three output formats: --output human (tables, default), --output json (structured, for scripts), and --output quiet (IDs only, for piping).

Scripting and CI patterns

The three output formats make the CLI composable with standard Unix tools:

Export all sessions

openpawz session list --output json > sessions.json

Iterate over agents

openpawz agent list --output quiet | while read id; do
  echo "=== $id ==="
  openpawz agent get "$id" --output json
done

CI health check

if openpawz status --output json | grep -q '"provider": "configured"'; then
  echo "✓ Engine ready"
else
  echo "✗ Run: openpawz setup"
  exit 1
fi

Batch memory import

cat facts.txt | while IFS= read -r line; do
  openpawz memory store "$line" --category fact --importance 7
done

Cron cleanup

# In your crontab — clean empty sessions nightly
0 3 * * * /usr/local/bin/openpawz session cleanup --output quiet

Why not a REST API?

The obvious alternative to a native CLI is to expose a REST API from the desktop app and have the CLI hit it. Here's why that's worse in every dimension:

Property	REST API CLI	Native library CLI
Requires desktop app running	Yes	No
Network latency	Every call	None
Serialization overhead	JSON encode/decode per request	Zero
Auth surface	HTTP auth, CORS, tokens	OS filesystem permissions
Port conflicts	Possible	Impossible
Offline/headless	Broken if app is closed	Always works
Code duplication	Server endpoints mirror library calls	Zero — same code
Security	API keys in transit, network exposure	Direct function calls

The native approach is simpler, faster, more secure, and has fewer failure modes. The REST approach adds complexity for the sole benefit of language-agnostic access — which doesn't matter when your engine is already a Rust library.

Ergonomics matter

The CLI isn't just functional — it's designed to feel good in daily use.

Gradient ASCII banner. The startup screen renders "OPEN PAWZ" in a warm orange gradient (ANSI 256-color codes 208→217) with the tagline "🐾 Multi-Agent AI from the Terminal." It's not gratuitous — it makes the tool instantly recognizable in a terminal full of monochrome output.

Color-coded output. Session history uses distinct colors per role so you can scan a conversation at a glance. Status output highlights warnings. Tables align cleanly.

Smart value parsing. config set daily_budget_usd 10.0 automatically parses 10.0 as a number, not a string. true becomes a boolean. "hello" stays a string. You don't need to think about JSON types.

Truncation. Long values in tables are truncated to terminal width with ellipsis. No line wrapping, no broken formatting.

Installation

cd src-tauri
cargo build --release -p openpawz-cli

Binary lands at target/release/openpawz. Move it to your PATH:

# macOS / Linux
cp target/release/openpawz ~/.local/bin/

# Or system-wide
sudo cp target/release/openpawz /usr/local/bin/

Packaging for Homebrew, AUR, Snap, and Flatpak is in progress under packaging/.

Part of the platform

The CLI is one access path to the full OpenPawz engine. Everything you can manage through the GUI — agents, sessions, memory, configuration — you can manage from the terminal with the same guarantees:

Protocol/Feature	CLI Access
The Librarian Method	Agents discovered via CLI use the same tool index
The Foreman Protocol	Worker delegation happens in-engine — CLI-created agents benefit automatically
The Conductor Protocol	Flows compiled by the Conductor execute the same regardless of where they were triggered
Audit log	Every CLI operation is recorded in the HMAC-SHA256 chained audit log
Key vault	API keys entered via `setup` go to the OS keychain — same vault as the desktop app
Memory engine	`memory store` and `memory list` hit the same Engram memory system

The CLI doesn't give you less. It gives you the same platform in the interface you're most productive in.

Try it

# Build and install
cd src-tauri && cargo build --release -p openpawz-cli
cp target/release/openpawz ~/.local/bin/

# Configure your provider
openpawz setup

# Check everything works
openpawz status

# Start managing agents
openpawz agent list
openpawz session list
openpawz memory list

If you're already using the OpenPawz desktop app, the CLI sees all your existing data immediately. No migration. No import. Same database.

Read the full docs

Star the repo if you want to track progress. 🙏

OpenPawz — Your AI, Your Rules

A native desktop AI platform that runs fully offline, connects to any provider, and puts you in control. Private by default. Powerful by design.

openpawz.ai

The Foreman Protocol: How OpenPawz gives AI agents bidirectional access to community driven services

Gotham64 — Mon, 09 Mar 2026 07:03:45 +0000

The hidden cost of AI tool execution

When an AI agent sends a Slack message, the Slack API itself is free. But the cloud model has to process the tool schema, reason about parameters, format structured JSON, wait for the result, and summarize it back to the user. Every one of those steps consumes tokens at your provider's rate.

Now multiply that across an automation that sends 50 messages, creates 10 tickets, and updates 5 spreadsheets. The cloud API costs dominate — not because the tools are expensive, but because the reasoning about how to call them is expensive.

And it gets worse. As integrations grow, the cloud LLM has to hold more tool schemas in its context window in order to call them:

OpenPawz tools — manageable context overhead
Built-in tools — significant context consumed by schemas alone
Community integrations — impossible to load into any context window

Formatting a JSON-RPC call is not a task that needs GPT-4 or Claude Opus. OpenPawz solves this with the Foreman Protocol — splitting the agent into two roles, each doing only what it's suited for.

Star the repo — it's open source

The invention: Architect plans, Foreman executes

The Foreman Protocol splits the agent into two roles:

Role	Model	Does	Costs
Architect	Cloud LLM (GPT, Claude, Gemini)	Plans, reasons, talks to user — decides what needs to happen	Per-token (paid)
Foreman	Any cheap/free model	Interfaces with services — handles how it happens	local models or cloud based models

The Architect never sees MCP schemas. The Foreman never reasons about user intent. Each model does only what it's suited for.

Bidirectional, not a pipeline

The Foreman is not a one-way executor in a predefined sequence. It is a bidirectional bridge between your agent and every connected service. It can:

Read — Query a database, list Slack channels, fetch open Jira tickets, check GitHub PR status
Write — Send a message, create a ticket, update a spreadsheet, post to a webhook
Both in one task — Read the open tickets, then post a summary to Slack

And it doesn't need to be part of a flow or automation chain. The agent can reach into any connected service at any point in a conversation, for any reason — to answer a question, check a fact, or pull context before making a decision. No predetermined sequence. No predefined trigger.

This is what makes it fundamentally different from automation platforms like Zapier, N8N and Make, where you build flows — predefined sequences of steps. With the Foreman Protocol, the agent decides what information it needs and what actions to take in real time.

Examples

Reading (querying information):

"What are the open tickets assigned to me in Jira?"
→ Architect decides it needs Jira data → Foreman queries Jira via MCP → returns ticket list → Architect summarizes for user

Writing (taking action):

"Send 'hello' to #general on Slack"
→ Architect decides to post a message → Foreman calls Slack via MCP → message sent → Architect confirms

Both in one conversation:

"Summarize my open GitHub PRs and post the summary to #engineering on Slack"
→ Architect plans two steps → Foreman reads from GitHub, then writes to Slack → Architect presents the result

Ad-hoc access (no flow, no sequence):

"How many unread messages do I have in Slack?"
→ The agent just reaches into Slack, checks, and answers. No automation. No workflow. Just a question answered from a live data source.

Why self-describing MCP is the key

The Foreman Protocol would not work without self-describing tool schemas. Here's why:

Traditional tool execution: The LLM must have the tool's schema in its context to know how to call it. With thousands of potential integrations, you can't fit all their schemas into any context window.

With MCP: The Foreman connects to the MCP server and asks "What tools do you have?" The server responds with complete schemas — parameter names, types, descriptions, examples. The Foreman uses these to find and execute the right operation.

No pre-training. No static configuration. No context window overflow.

This means:

Any new integration is accessible immediately — install it, the Foreman can execute it
Zero configuration per service — no prompt engineering, no few-shot examples, no fine-tuning
Any model works — the Foreman just needs to follow JSON-RPC formatting, which any code-capable model can do
Reads are as natural as writes — querying a database and sending a Slack message go through the same execution path

The cost structure inverts

In a traditional agent architecture, the cloud model handles everything — intent, planning, tool formatting, execution, response. You pay cloud rates for all of it.

With the Foreman Protocol, the cloud model only handles intent and planning. All service interaction — every read and every write — is delegated to a worker model. The Architect pays only for the tokens it actually needs frontier intelligence for.

The savings scale with usage. The more your agents interact with connected services, the more you save — because every tool call that would have burned premium tokens is handled by the cheapest capable model in the stack.

vs. automation platforms

Platform	AI-Driven?	Tool Execution Cost	Integrations	Bidirectional?
OpenPawz (Foreman)	Yes — natural language	Free (local) or cheap (cloud)	Community Services	Yes — read + write, any time
Zapier	Partial	Per-task pricing	7,000	No — predefined flows
Make	No	Per-operation pricing	2,000	No — predefined flows
n8n Standalone	No — manual workflows	Free (self-hosted)	400+ built-in	No — predefined flows

OpenPawz is the only platform where you can say "What are my open PRs on GitHub, and post a summary to #engineering on Slack" and have it work — with natural language, AI-driven execution, bidirectional service access, and free local tool execution across 25,000+ integrations.

Key design decisions

1. Interception, not routing

The Foreman is wired into the main agent loop's execute_tool() path. Any mcp_* tool call is automatically intercepted — the Architect doesn't need to know the Foreman exists. Zero changes to agent prompts or system instructions. Works with any cloud provider. Transparent fallback if no worker model is configured.

2. Mini agent loop (8 rounds max)

The Foreman runs a constrained agent loop — up to 8 rounds of tool calls. This handles multi-step tasks (query a database → format results → post to Slack) and multi-read scenarios (check Jira + check GitHub + check Slack) without risking infinite loops.

3. No recursion

The Foreman cannot spawn sub-workers or delegate to other agents. It receives a task, executes MCP tools, and returns a result. This prevents runaway delegation chains.

4. Direct MCP execution

The Foreman calls MCP servers directly via JSON-RPC — it doesn't go back through the engine's execute_tool() path. This prevents the worker's MCP calls from being intercepted again (infinite loop) and keeps the execution path simple.

5. Graceful fallback

If no worker_model is configured, MCP tool calls execute directly via JSON-RPC as before. The Foreman Protocol is additive — it improves cost efficiency but is never required.

Implementation

The core flow in simplified Rust:

// In execute_tool() — MCP path
if tool_name.starts_with("mcp_") {
    // Try Foreman delegation first
    if let Some(result) = delegate_to_worker(
        tool_name, tool_args, engine_state
    ).await? {
        return Ok(result); // Foreman handled it
    }
    // Fallback: direct JSON-RPC execution
    registry.execute_tool(tool_name, tool_args).await
}

File	Purpose
`engine/tools/worker_delegate.rs`	Core — `delegate_to_worker()`, `run_worker_loop()`, `execute_worker_tool()`
`engine/tools/mod.rs`	MCP interception point in `execute_tool()`
`engine/mcp/registry.rs`	MCP tool schema discovery
`engine/mcp/client.rs`	JSON-RPC tool execution
`commands/ollama.rs`	Worker model management

Model requirements

The Foreman can run any model from any provider:

Local Example (Ollama — free): The default qwen2.5-coder:7b requires ~5 GB disk and runs on 8+ GB RAM (CPU) or 5+ GB VRAM (GPU). On Apple Silicon (M1+), inference is fast enough that tool execution feels instant. Zero API cost.

Cloud (any provider — cheap): Use a cheap model from your existing provider — gemini-2.0-flash, gpt-4o-mini, claude-haiku-4-5, deepseek-chat. No local hardware needed. The worker model can use a different provider than the Architect.

The worker Modelfile for Ollama:

FROM qwen2.5-coder:7b
SYSTEM You are a precise tool executor. Given a task and available MCP tools,
execute the correct tool call and return the result. Be concise.
PARAMETER temperature 0.1
PARAMETER num_ctx 8192

Low temperature ensures structured, deterministic tool calls. The 7B model is large enough for reliable JSON-RPC formatting but small enough to run on consumer hardware.

Part of a trinity

The Foreman Protocol works with two complementary OpenPawz innovations:

Protocol	Problem	Solution
The Librarian Method	Which tool to use among many?	Intent-driven discovery via semantic embeddings
The Foreman Protocol	How to execute tools cheaply?	Worker model delegation via self-describing MCP
The Conductor Protocol	What's the optimal execution plan?	AI-compiled flow strategies

Together: the Librarian finds the right tool, the Foreman executes it for free, and the Conductor orchestrates everything into minimal LLM calls.

In practice, an agent can discover and execute any of 25,000+ integrations at near-zero cost — something no other AI agent platform achieves.

Try it

Option A: Local worker (Ollama — free)

ollama pull qwen2.5-coder:7b

Go to Settings → Advanced → Ollama and click Setup Worker Agent
In Settings → Models → Model Routing, set Worker Model to worker-qwen

Option B: Cloud worker (any provider — cheap)

Go to Settings → Models → Model Routing
Set your Boss Model (e.g. gemini-3.1-pro-preview, gpt-4o, claude-opus-4-6)
Set your Worker Model to a cheaper model from the same or different provider (e.g. gemini-2.0-flash, gpt-4o-mini, claude-haiku-4-5)

Use it

Just chat normally. When your agent calls any MCP tool, the Foreman handles execution automatically:

"Generate a QR code for https://openpawz.ai"

Architect identifies the task → Librarian finds n8n QR code node → Foreman executes via MCP → QR code returned — tool execution handled by the worker model, not the expensive Architect.

Read the full spec

The complete technical reference — including architecture diagrams, cost analysis, and implementation details:

Star the repo if you want to track progress. 🙏

OpenPawz — Your AI, Your Rules

A native desktop AI platform that runs fully offline, connects to any provider, and puts you in control. Private by default. Powerful by design.

openpawz.ai

The Librarian Method: How OpenPawz solves tool bloat — and why memory matters

Gotham64 — Sat, 07 Mar 2026 18:01:22 +0000

The tool bloat problem nobody talks about

Every AI agent platform hits the same wall: tool bloat.

An agent that can send emails, manage files, query databases, search the web, post to Slack, and call APIs needs a growing pile of tool definitions in its context window. Connect it to external systems, automation platforms, or MCP servers, and you've consumed a meaningful chunk of context before the agent has even started reasoning about the user's request.

The conventional fixes all have critical flaws:

Approach	Problem
Load all tools	Breaks down as tool count grows — schemas and descriptions crowd out actual reasoning
Pre-filter by keyword	Fragile. “Send a message to John” — email? Slack? SMS? WhatsApp? Telegram?
Category menus	Pushes routing burden onto the user
Static tool sets per agent	Limits what each agent can do — defeats the point of a general platform

The fundamental issue: the system decides which tools are relevant before the LLM has understood the user's intent. That's solving the wrong problem. Only the LLM knows what the user is actually asking for.

OpenPawz solves this with the Librarian Method — a technique that inverts tool discovery entirely.

Star the repo — it's open source

The invention: let the agent ask the librarian

The metaphor is literal. A library patron (the agent) walks up to a librarian and describes what they need. The librarian finds the right books. The patron never needs to know the filing system.

Three roles make this work:

Role	Implementation
Patron	The LLM reasoning over the user's request
Librarian	An embedding-powered retrieval layer that maps intent to tools
Library	A searchable tool index built from tool definitions and domains

The Patron understands intent. The Librarian searches for matching tools by semantic similarity. The Library stores every tool as an embedding vector organized by capability domain.

That means the agent only sees the tools it needs after it understands the task.

How it works — round by round

Round 1: User says "Email John about the quarterly report"

  Agent has: a small core set of tools
  Agent understands intent: needs email capabilities
  Agent calls: request_tools("email sending capabilities")

  Librarian embeds the request
  Semantic search runs against the tool index
  Top matches: email_send, email_read
  Domain expansion pulls in closely related tools

Round 2: Tools are hot-loaded into the current turn

  Agent now has: core tools + email tools
  Agent calls: email_send({to: "john@...", subject: "Quarterly Report", ...})
  Done ✅

The agent used only the tools it needed instead of dragging every available tool definition into the prompt.

Five design decisions that make it work

1. Agent-driven discovery

The LLM forms the search query — not a brittle pre-filter guessing from the raw user message.

When a user says "Can you check if the deployment went through?", a keyword filter might match deploy, check, or container. The agent understands the real intent is monitoring and calls something closer to request_tools("deployment status monitoring CI/CD").

That is a far better search query because it comes after reasoning, not before it.

2. Domain expansion

When the Librarian finds one strong match, it can also bring along closely related tools from the same domain.

If the agent finds email_send, it probably also needs email_read, contact lookup, or attachment handling. Related capabilities travel together so the agent doesn't need to repeatedly rediscover the same cluster.

3. Round carryover

Tools loaded in one reasoning round remain available in the next round of the same turn.

The agent doesn't lose access to the tools it just discovered, but the set also doesn't accumulate forever across unrelated turns.

4. Fallback layers

If semantic search is weak, the system still has multiple ways to recover:

Exact name match
Domain match
Domain list return so the agent can refine its own request

The agent always gets something actionable back.

5. Memory-aware execution

Tool discovery alone is not enough.

Once an agent finds and uses the right tool, it still needs to remember what happened, what worked, what failed, and what should be reused later. That is where Engram enters the picture.

The Librarian answers: Which tool should I use?
Engram answers: What should I remember from using it?

Tool discovery alone is not enough

Most agent systems stop too early.

They focus on tool routing, but real agent behavior has three distinct problems:

Problem	What it asks
Tool discovery	Which capability should the agent use right now?
Memory	What should the agent retain across turns and sessions?
Expertise	How does repeated success become something better than a prompt?

OpenPawz treats these as separate layers:

Layer	Purpose
The Librarian Method	Discover the right tools on demand
Project Engram	Give the agent structured, persistent memory
The Forge	Turn repeated procedural success into earned expertise

That stack is the real idea.

Engram: memory that behaves like cognition, not a key-value dump

Most AI memory systems are still basically: store blobs, search blobs, inject blobs.

That works up to a point, but it has obvious failure modes:

Flat memory model	Problem
Store everything the same way	No difference between facts, episodes, and procedures
Always retrieve	Wastes latency and pollutes context
Never forget	Outdated information lingers forever
No structure	Repeated experiences never become organized knowledge
No budget awareness	Memory recall competes blindly with the context window

Project Engram is OpenPawz’s memory architecture for persistent agents.

Instead of treating memory like a bag of documents, Engram models memory as a living system with multiple layers:

Sensory input for what just happened
Working memory for what the agent is actively thinking about
Long-term memory for what should persist across sessions
Graph relationships so memories are connected, not isolated
Consolidation and decay so the memory store improves over time instead of just growing forever

The memory model: from raw input to durable knowledge

At a high level, Engram works like this:

A user message enters a sensory buffer. Important items get promoted into working memory. Useful outcomes are captured into long-term memory. Later, when the agent needs context again, Engram decides what to retrieve and what to ignore.

That sounds simple, but the key is that memory is not just being stored — it is being ranked, consolidated, and filtered.

Three tiers, three jobs

Engram uses a three-tier memory architecture:

Tier	Role	Purpose
Sensory Buffer	What just happened	Holds raw turn-level input before selection
Working Memory	What the agent is actively thinking about	Maintains a priority-limited attention set
Long-Term Memory	What should survive across sessions	Stores episodic, semantic, and procedural memory

That separation matters because not every piece of information deserves the same lifespan.

A tool result from a minute ago may belong in working memory.
A learned user preference may belong in long-term semantic memory.
A successful multi-step workflow may belong in procedural memory.

Flat memory systems blur all of that together. Engram does not.

Retrieval should be gated, not automatic

Another failure mode in agent memory systems is that they retrieve memory for everything.

But not every query needs memory.

If the user asks for a calculation, a greeting, or something already covered in the active conversation, memory search is wasteful. It adds latency and pollutes the prompt with irrelevant context.

So Engram adds a retrieval gate.

The gate decides whether retrieval is needed at all. That means the system is not just asking:

“What memories match this query?”

It is first asking:

“Should I even search memory right now?”

That distinction matters more than it sounds.

Search is hybrid, not naive

Engram does not rely on a single retrieval method.

It combines multiple signals:

Signal	Strength
Full-text search	Good for exact terms, identifiers, names, phrases
Vector similarity	Good for meaning, paraphrase, conceptual recall
Graph traversal	Good for connected ideas, related facts, causal links

That lets the system answer different kinds of questions more intelligently.

A factual query may weight lexical matches more heavily.
A conceptual query may weight semantic similarity more heavily.
A broader exploratory query may benefit from graph expansion.

This is what makes Engram more than “RAG but local.” It is memory retrieval shaped by query intent.

Memory should not just grow forever

A real memory system needs a theory of forgetting.

Without that, every stored fact competes forever for retrieval and context budget. Quality degrades because stale, duplicate, or low-value memories remain in circulation.

That is why Engram treats forgetting as a feature.

What forgetting means here

Forgetting in Engram is not random deletion. It is controlled memory maintenance:

duplicates can be merged,
contradictions can be resolved,
stale low-value memories can fade,
important memories can persist longer,
and quality can be measured before and after cleanup.

This is one of the most important differences between a memory architecture and a document pile.

A pile only gets larger.
A memory system should get cleaner.

Memory is a graph, not a folder

Long-term memory in Engram is not just a list of rows. It is a graph.

Edge Type	Meaning
`RelatedTo`	General association
`CausedBy`	Causal relationship
`Supports`	Supporting evidence
`Contradicts`	Conflicting knowledge
`PartOf`	Component or hierarchy relationship
`FollowedBy`	Temporal sequence
`DerivedFrom`	Origin or lineage
`SimilarTo`	Semantic similarity

That structure matters because recall should not always stop at direct matches.

Sometimes the most useful thing is not the first memory you find — it is the memory connected to the first memory.

That is where graph-based retrieval becomes meaningful. The agent can move from direct hits to adjacent context instead of pretending every useful insight must be textually similar to the exact query.

Procedural memory is where things get interesting

Most memory systems focus on facts.

Engram also stores procedures — not just what is true, but how to do things.

That means OpenPawz can remember:

how a deployment was fixed,
how a file transformation worked,
how an API issue was resolved,
how a workflow was built successfully.

This turns memory from passive recall into active reuse.

A fact helps the agent answer.
A procedure helps the agent act.

That is the bridge from memory into expertise.

THE FORGE: specialists should earn expertise

Most AI platforms create “specialists” by stuffing a domain document into the prompt and calling it expertise.

That is not expertise. It is a cheat sheet.

A prompt-based specialist has obvious problems:

Prompt specialist	Problem
Claims expertise	But has never been tested
Answers confidently	Even when the knowledge is stale
Looks specialized	But has no measurable boundary
Can be copied instantly	The whole “specialist” is often just a file

FORGE is OpenPawz’s answer.

It extends Engram’s procedural memory so that repeatable workflows can move through a lifecycle:

That means procedures are not all equal.

Some are just memories.
Some are developing skills.
Some are skills the system can treat as trusted and reusable.

The moat is not the prompt

This is the deeper idea behind FORGE:

You can copy a prompt file.
You cannot copy accumulated, verified training cycles overnight.

That is a very different kind of defensibility.

If a system has gone through repeated tasks, retained successful procedures, linked them into skill relationships, measured confidence, and re-trained when things drift, then its expertise is not just text in a system prompt anymore.

It is embedded into the behavior of the system through memory, validation, and reuse.

That is much harder to fake.

How FORGE fits into Engram

FORGE does not create a separate storage system. It extends the memory system already there.

Engram capability	FORGE uses it for
Procedural memory	Stores the procedures that can be trained and certified
Memory edges	Builds skill trees and prerequisite relationships
Trust / confidence signals	Distinguishes stronger skills from weaker ones
Decay and consolidation	Detects drift, staleness, and retraining candidates
Meta-cognition	Helps the agent know what it knows and what it does not

That is an important design choice.

FORGE is not “yet another layer with duplicated storage.” It is training logic built on top of the same memory substrate.

What this changes in practice

Once you combine these pieces, the agent stops behaving like a thin wrapper around a prompt.

Without this stack

Tools are either overloaded or under-available
Memory retrieval is noisy or missing
Learned workflows disappear between sessions
Specialists are mostly branding

With this stack

The agent discovers capabilities at the moment of need
Useful outcomes persist across sessions
Memory becomes cleaner instead of just larger
Procedures can compound into reusable skills
Specialization can be measured instead of merely declared

That is the real architecture shift.

This is the stack, not just a trick

The Librarian Method is useful on its own. But the bigger story is not just “better tool retrieval.”

It is this:

Layer	Question it answers
The Librarian Method	Which tool should the agent use?
Project Engram	What should the agent remember?
The Forge	Which remembered procedures count as real expertise?

That progression matters.

Tool retrieval solves capability access.
Memory solves continuity.
FORGE solves compounding competence.

That is what makes OpenPawz more interesting than a standard tools-plus-prompt system.

A concrete example

Imagine a user says:

“Check the GitHub issue, figure out why the workflow failed, and send me a summary.”

A conventional agent might:

load too many tools,
search memory poorly,
and start from zero every time.

An OpenPawz agent can do something more structured:

Use the Librarian Method to discover GitHub and messaging tools
Execute the workflow investigation
Store the findings in Engram
Recall related failures later through hybrid search and graph links
Reuse a previously successful troubleshooting procedure
Eventually treat that procedure as validated expertise through FORGE

That is not just calling tools.
That is finding, remembering, and learning.

Why this is different from current approaches

A lot of systems optimize one piece in isolation:

Approach	What it gets right	What it misses
Tool retrieval only	Better capability routing	No persistent memory, no compounding expertise
Basic RAG memory	Better recall than no memory	Flat storage, no forgetting, weak procedural learning
Prompt specialists	Fast to ship	No verification, no boundaries, no moat
Fine-tuning alone	Compresses behavior into weights	Harder to inspect, slower to update, weaker explicit skill tracking
OpenPawz stack	Tool discovery + memory + earned expertise	Treats agents as systems that should improve over time

That is the deeper thesis:

The future agent is not just one that can call tools.
It is one that can find, remember, and earn.

Implementation

At a high level, these ideas show up across the engine like this:

Area	Purpose
`tool_index`	Semantic tool retrieval and domain expansion
`request_tools`	Agent-facing meta-tool for hot-loading capabilities
`chat / agent loop`	Carry discovered tools across reasoning rounds
`engram/*`	Persistent memory, recall, consolidation, graph traversal
`procedural memory + FORGE metadata`	Verified skills, certification state, lineage, and re-training hooks

The conceptual flow looks like this:

Agent requests the capabilities it actually needs
let tools = request_tools("workflow troubleshooting + GitHub + message follow-up", state);
Agent uses the discovered tools
let result = run_with_tools(tools, user_request).await?;
Engram stores the useful outcome as memory
engram.capture(result).await?;
Repeated successful procedures can later be evaluated by FORGE
forge.evaluate_procedure_history().await?;`

Different layers. One system.

The bigger vision

The OpenPawz thesis is not that tools are enough.

It is that useful agents need three properties at the same time:

Dynamic capability access
The agent should not carry every tool all the time.
Structured long-term memory
The agent should not forget everything between tasks.
Compounding skill formation
The agent should not repeat the same learning curve forever.

That is why the Librarian Method, Engram, and FORGE belong together.

Try it

The Librarian Method is part of OpenPawz. The bigger idea is to pair it with memory and skill growth instead of treating tools as the whole system.

Ask an agent to do something capability-heavy:

“Check my GitHub notifications, summarize anything important, and message me if there’s a failing workflow.”

The agent can discover the right tools for the task.

Then, if the same pattern happens again later, Engram can help it start with memory instead of amnesia.

And if that procedure becomes well-tested and repeatable, FORGE is the layer that can eventually treat it as earned expertise rather than one-off luck.

Read the full specs

The technical references live in the repo:

Star the repo if you want to track progress. 🙏

OpenPawz — Your AI, Your Rules

A native desktop AI platform that runs fully offline, connects to any provider, and puts you in control. Private by default. Powerful by design.

openpawz.ai

OpenPawz Conductor Protocol

Gotham64 — Fri, 06 Mar 2026 17:54:44 +0000

Every workflow engine executes the same way

Why your workflow engine is stuck in 2D — and how AI-compiled execution fixes it

n8n, Zapier, Make, Airflow, Prefect — they all do the same thing: walk the graph, node by node, in topological order. Node A finishes, pass data to Node B, Node B finishes, pass data to Node C. Sequential. Synchronous. One step at a time.

This worked fine when nodes were cheap API calls. But AI workflows are fundamentally different:

Agent nodes are expensive. Each one is an LLM call — 2–10 seconds of latency and real token cost.
Chains get long. A real pipeline might have 8–20 nodes: trigger → parse → agent analysis → condition → agent rewrite → tool call → agent review → output.
Branches are wasted. When two independent branches can run in parallel, sequential execution waits for each to finish before starting the next.
Cycles are impossible. Every platform requires DAGs — directed acyclic graphs. No loops, no feedback, no iterative refinement. Two agents debating until they agree? Can't express that.

The math is brutal

A 10-node flow with 6 agent steps, each averaging 4 seconds of LLM latency:

Platform	Execution	Time	LLM Calls
n8n / Zapier / Make	Sequential walk	24s+	6
OpenPawz (Conductor)	Compiled strategy	4–8s	2–3

The Conductor doesn't skip work. It does the same work smarter.

Star the repo — it's open source

The invention: compile the graph, don't walk it

The Conductor Protocol treats flow graphs not as programs to execute, but as blueprints of intent that are compiled into optimized execution strategies before a single node runs.

Traditional platforms interpret flows imperatively — "do this, then this, then this." The Conductor interprets flows declaratively — "here is what needs to happen; let me figure out the fastest way."

Five optimization primitives make this possible: Collapse, Extract, Parallelize, Converge, and Tesseract.

Primitive 1: Collapse — merge adjacent agents into one LLM call

Adjacent agent nodes with compatible configurations merge into a single LLM call.

Before (traditional):

Agent 1: "Summarize this data"        → LLM call (4s)  → result
Agent 2: "Extract key metrics from…"  → LLM call (4s)  → result
Agent 3: "Write a report based on…"   → LLM call (4s)  → result
Total: 3 LLM calls, ~12 seconds

After (Conductor Collapse):

Collapsed prompt:
  "Step 1: Summarize this data
   ---STEP_BOUNDARY---
   Step 2: Extract key metrics from the summary above
   ---STEP_BOUNDARY---
   Step 3: Write a report based on the metrics above"
→ 1 LLM call (5s) → parsed back into 3 node outputs
Total: 1 LLM call, ~5 seconds

Two agent nodes can be collapsed when they share the same model, the same temperature, have no tool invocations configured, and form a direct chain with no branching. The Conductor detects these chains automatically and builds merged prompts. After execution, parseCollapsedOutput() splits the response back into individual node results using step boundary delimiters.

Primitive 2: Extract — bypass the LLM entirely

Not every node in an AI workflow needs artificial intelligence. Tool calls, HTTP requests, code execution, data transforms — these are fully deterministic. The Conductor classifies each node and routes deterministic work to direct execution:

Node Classification	Execution Path	Examples
Agent	LLM call via engine	`agent`, `squad`
Direct	Bypass LLM — execute via Rust backend	`tool`, `code`, `http`, `mcp-tool`, `loop`, `memory`
Passthrough	No execution — data forwarding only	`trigger`, `output`, `error`, `group`

In a 10-node flow with 4 agent nodes and 6 direct/passthrough nodes, the Conductor reduces LLM calls from 10 to 4 — or fewer, if some agents can be collapsed.

Primitive 3: Parallelize — run independent branches concurrently

When a flow fans out — one node feeding into multiple downstream branches that don't depend on each other — the Conductor detects independent branches via depth analysis and union-find grouping, then runs them simultaneously.

Sequential (traditional):

trigger → classify → summarize → fetch metrics → parse data → output
Total: 6 steps, ~16 seconds

Conductor (parallel):

Phase 0: trigger (passthrough)
Phase 1: classify (single agent)
Phase 2: summarize ‖ fetch metrics ‖ parse data  ← all three concurrent
Phase 3: output (passthrough)
Total: 4 phases, ~8 seconds

The grouping algorithm uses groupByDepth() to assign each node a depth level based on its longest path from roots, then splitIntoIndependentGroups() uses union-find to identify which nodes within the same depth level share dependencies.

Primitive 4: Converge — cycles that no other platform can express

This is the primitive that has no equivalent in any existing workflow platform. n8n, Zapier, Make, Airflow, Prefect — they all require DAGs. Cycles are errors. Feedback loops are impossible.

But some of the most powerful AI patterns are inherently cyclic:

Debate and consensus: Two agents argue until they reach agreement
Iterative refinement: A writer and editor pass drafts back and forth until quality stabilizes
Self-correction: An agent checks its own output, finds errors, fixes them, checks again
Multi-perspective analysis: Three analysts each review the others' findings and update their own

The Conductor enables these through bidirectional edges and convergent mesh execution.

How convergent meshes work

The Conductor detects cycles in the flow graph (nodes connected via bidirectional or reverse edges)
Overlapping cycles merge into mesh groups
Each mesh group executes in iterative rounds:
- Round 1: Each node executes with its initial input
- Round 2: Each node re-executes with shared context from all other nodes' Round 1 outputs
- Round N: Continue until outputs converge or max iterations are reached
Convergence detection uses Jaccard similarity — when consecutive outputs from the same node are ≥85% similar, that node has stabilized
When all nodes converge (or max iterations hit, default: 5), the mesh completes

Example: Writer–Editor debate

Round 1:
  Writer: produces initial draft
  Editor: reviews draft, suggests changes

Round 2:
  Writer: revises based on editor feedback
  Editor: reviews revision — "much better, minor grammar fix"

Round 3:
  Writer: applies grammar fix
  Editor: reviews — "looks good, approved" ← 92% similar to Round 2

Convergence detected (0.92 > 0.85 threshold). Mesh complete.
Output: final approved draft flows to downstream nodes.

In n8n, you'd need to manually build a loop with external state management and hope it terminates. In Zapier, it's simply impossible.

Primitive 5: Tesseract — hyper-dimensional flows

Primitives 1–4 operate on a flat graph. But the Conductor already works in higher dimensions implicitly. When a convergent mesh iterates, each round is a distinct state. When parallel branches run independently before merging, they occupy separate "spaces" that collapse at a join point.

The Tesseract primitive makes these hidden dimensions explicit and controllable.

Four dimensions of a workflow

Dimension	Axis	Represents	Example
1st (X)	Sequence	Step ordering, causality	A → B → C
2nd (Y)	Parallelism	Concurrent branches	A → {B ‖ C} → D
3rd (Z)	Depth	Iteration layers, sub-flow nesting	Mesh round 1 → 2 → 3 (helix, not loop)
4th (W)	Phase	Behavioral mode shifts	Exploration → Refinement → Convergence

A standard flow is a 2D projection (X × Y). A convergent mesh is a 3D helix (X × Y × Z). A tesseract flow is the full 4D object — independent workflow cells operating across all four dimensions, connecting only at event horizons where they synchronize and merge.

Event horizons

An event horizon is where multiple tesseract cells collapse into a single output. It's the 4D equivalent of a join node, but richer:

All cells must reach the horizon before the flow continues — hard synchronization
Phase transitions happen at horizons — the W coordinate shifts
Depth resets at horizons — completed iterations crystallize into a single state
Context merges according to configurable policy (concat, synthesize, vote, last-wins)

Why this matters

Consider a complex research pipeline:

Cell A (Exploration): Three research agents independently search different domains, iterating with a supervisor (Z=0..3). Phase W=0.

Cell B (Analysis): Two analyst agents debate findings, refining their synthesis (Z=0..2). Phase W=1.

These cells work independently — different topics, different models, different iteration depths. At the event horizon, their outputs merge: research feeds the analysts, analysis redirects the researchers, and the system transitions to W=2 (convergence phase) where all agents work toward a unified output.

No other automation platform can represent this. It requires reasoning about time (iteration depth), behavioral mode (phase), and spatial independence (parallel cells) — all simultaneously.

Four edge types

The Conductor's power partly comes from OpenPawz's edge types — richer than any other workflow platform:

Edge Kind	Direction	Purpose	Enables
Forward	A → B	Normal data flow	Standard pipelines
Reverse	A ← B	Data pull — B requests from A	Lazy evaluation, on-demand data
Bidirectional	A ↔ B	Mutual data exchange	Cycles, debates, iterative refinement
Error	A --err→ B	Failure routing	Graceful degradation, fallback chains

n8n, Zapier, and Make support only forward edges. OpenPawz's reverse and bidirectional edges enable workflow patterns that are structurally impossible on other platforms.

Performance benchmarks

Flow Pattern	Nodes	Sequential	Conductor	Speedup	LLM Calls Saved
Linear chain (3 agents)	5	20–45s	4–9s	4–5×	2 (collapse)
Fan-out (parallel branches)	8	35–70s	5–10s	5–7×	3 (collapse + parallel)
Bidirectional debate	6	∞ (impossible)	15–25s	∞	N/A (new capability)
Production pipeline	20	80–160s	8–18s	8–10×	12+ (all primitives)
Tesseract research pipeline	12	∞ (impossible)	20–40s	∞	N/A (new capability)

The gains compound: Collapse reduces total LLM calls, Extract eliminates unnecessary ones, Parallelize runs the remaining work concurrently.

vs. every other platform

Capability	n8n	Zapier	Make	Airflow	OpenPawz Conductor
Execution model	Sequential DAG walk	Sequential DAG walk	Sequential DAG walk	Task scheduler (DAG)	AI-compiled strategy
Cycles / feedback loops	Error	Error	Error	Error	Convergent Mesh
LLM call optimization	None	None	None	None	Collapse (N agents → 1 call)
Deterministic bypass	All nodes same path	All nodes same path	All nodes same path	All nodes same path	Extract (skip LLM)
Auto-parallelism	Manual split/merge	None	Manual router	Executor-level	Automatic depth analysis
Bidirectional edges	No	No	No	No	Yes
4D hyper-dimensional flows	No	No	No	No	Tesseract + event horizons
Self-healing	No	Retry only	Retry only	Retry only	Error diagnosis + fix proposals
Debug step-through	Limited	None	Limited	Log-based	Full breakpoints + cursor

The fundamental difference

Traditional platforms treat workflows as imperative programs — a fixed sequence of steps the computer follows literally. The Conductor treats workflows as declarative blueprints — a description of what needs to happen, which the system compiles into the most efficient execution plan.

This is the same conceptual leap that separated SQL from procedural database queries, or React's declarative UI from imperative DOM manipulation. You describe what, not how. The runtime figures out how.

Natural language to compiled flow

Traditional workflow platforms require dragging nodes, configuring each one, and wiring connections. The Conductor sits at the end of a pipeline that eliminates this:

Natural language input — User describes a workflow in plain English
NLP parsing — Text-to-flow parser identifies node types, relationships, and configurations
Graph construction — Complete FlowGraph built with nodes, edges, and positions
Conductor compilation — Graph analyzed and compiled into an optimized ExecutionStrategy
Execution — Strategy runs with all five primitives applied

A user types:

"When a webhook fires, have an agent classify the data, then in parallel: summarize it and store it in Airtable, and if it's urgent, post to Slack #alerts"

The parser builds a 7-node flow graph. The Conductor compiles it:

Phase 0: Trigger (passthrough)
Phase 1: Agent classify (single LLM call)
Phase 2: Agent summarize ‖ Airtable store ‖ Condition check — all concurrent
Phase 3: Slack post (direct, no LLM)

The Airtable and Slack operations execute via Extract — direct MCP calls, zero LLM cost. The agent steps that need intelligence get Collapsed where possible. Independent branches Parallelize automatically.

Self-healing flows

When a node fails, the Conductor doesn't just retry blindly:

Classifies the error — timeout, rate-limit, auth, network, invalid-input, config, code-error, api-error
Generates a diagnosis explaining what went wrong
Proposes fixes with confidence scores — e.g., "increase timeout to 60s (0.85)" or "check API key in vault (0.92)"
Retries with backoff — configurable max retries and exponential delay
Routes to error handlers — if retry fails, error edges route to fallback nodes

This turns brittle automation into resilient pipelines. A rate-limited API call doesn't crash the flow — it backs off, retries, and if it still fails, routes to a fallback path.

Part of a trinity

The Conductor Protocol works with two complementary OpenPawz innovations:

Protocol	Problem	Solution
The Librarian Method	Which tool to use among many?	Intent-driven discovery via semantic embeddings
The Foreman Protocol	How to execute tools cheaply?	Worker model delegation via self-describing MCP
The Conductor Protocol	What's the optimal execution plan?	AI-compiled flow strategies

In a single flow execution:

The Conductor compiles the graph into an optimized strategy
Agent nodes that need tools use the Librarian to discover which ones are relevant
Tool calls are delegated to the Foreman for cheap or free execution

The result: a 20-node flow that would take 2+ minutes on n8n executes in under 20 seconds on OpenPawz, with lower cost and capabilities that other platforms cannot express at all.

Read the full spec

The complete technical reference — including TypeScript interfaces, compilation algorithms, and Tesseract implementation details:

Star the repo if you want to track progress. 🙏

OpenPawz — Your AI, Your Rules

A native desktop AI platform that runs fully offline, connects to any provider, and puts you in control. Private by default. Powerful by design.

openpawz.ai

How OpenPawz secures AI agents: Defense layers from memory encryption to multi-agent governance

Gotham64 — Wed, 04 Mar 2026 18:38:37 +0000

The security problem with AI agents

AI agents are powerful because they do things — they read files, run commands, send messages, search your data. That power comes with a question most agent frameworks don't answer well:

What stops the agent from doing things it shouldn't?

Most agent systems bolt on safety as an afterthought: a prompt that says "be careful," maybe a regex filter on outputs, and hope for the best. That's not security. That's a suggestion.

OpenPawz takes a different approach. We treat agent security as a systems engineering problem — not a prompt engineering one. The result is a multi-layer defense-in-depth architecture enforced at the Rust engine level, where the agent has zero ability to bypass controls regardless of what any prompt says.

Star the repo — it's open source

Zero attack surface by default

OpenPawz exposes zero network ports in its default configuration. There is no HTTP server, no WebSocket endpoint, and no listening socket for an attacker to target. The only communication path is Tauri's in-process IPC — a direct Rust-to-WebView bridge that never touches the network.

Four optional listeners exist (webhook server, WebChat, WhatsApp bridge, n8n engine), but all are:

Disabled by default
Bound to 127.0.0.1 — unreachable from the network even when enabled
Individually authenticated — bearer tokens, session cookies, IP rate limiting

Binding to 0.0.0.0 is a manual opt-in that triggers a security warning and recommends TLS wrapping via Tailscale Funnel.

The WebView enforces a strict Content Security Policy: default-src 'self', script-src 'self', object-src 'none', frame-ancestors 'none'. No external scripts, no iframe embedding, no cross-origin form submission.

Human-in-the-Loop: every side-effect needs permission

The core design principle: agents never touch the OS directly. Every tool call flows through the Rust tool executor, which classifies it by risk before deciding whether to proceed.

Auto-approved (no modal)

Read-only and informational tools run without interruption — read_file, web_search, memory_search, soul_read, self_info, email_read, slack_read, create_task, and others. No friction for safe operations.

Requires approval (modal shown)

Side-effect tools pause execution and show a risk-classified modal to the user:

Risk Level	Behavior	Example
Critical	Auto-denied by default; red modal requiring the user to type "ALLOW"	`sudo rm -rf /`, `curl \
High	Orange warning modal	{% raw %}`chmod 777`, `kill -9`
Medium	Yellow caution modal	`npm install`, outbound HTTP requests
Low	Standard approval	Unknown exec commands
Safe	Auto-approved via allowlist (90+ default patterns)	`git status`, `ls`, `cat`

Danger pattern detection

30+ patterns across multiple categories are caught before they can execute:

Privilege escalation — sudo, su, doas, pkexec, runas
Destructive deletion — rm -rf /, rm -rf ~, rm -rf /*
Permission exposure — chmod 777, chmod -R 777
Disk destruction — dd if=, mkfs, fdisk
Remote code execution — curl | sh, wget | bash
Process termination — kill -9 1, killall
Firewall manipulation — iptables -F, ufw disable
Network exfiltration — piping file contents to curl, scp outbound, /dev/tcp

Users can add custom regex rules for both allow and deny lists. The session override feature ("allow all" for a timed window) still blocks privilege escalation commands — you can't override the most dangerous class.

Agent governance: four policy presets

Not every agent should have the same power. OpenPawz provides per-agent tool access control with four built-in presets and support for custom policies:

Preset	Mode	What it does
Unrestricted	unrestricted	Full tool access, no constraints
Standard	denylist	All tools available, but high-risk tools always require human approval
Read-Only	allowlist	Only safe read/search/list operations (28 tools)
Sandbox	allowlist	Only 5 tools: `web_search`, `web_read`, `memory_store`, `memory_search`, `self_info`

Policies are enforced at two levels simultaneously:

Frontend: checkToolPolicy() evaluates per-tool decisions and strips unauthorized tools from the request
Backend: ChatRequest.tool_filter carries the allowed tool list to the Rust engine — the agent literally cannot see tools it doesn't have access to

This means a sandboxed research agent physically cannot call exec or write_file, regardless of what its prompt says. The tools don't exist in its schema.

Memory encryption: three independent defense layers

Project Engram — the memory system — applies defense-in-depth to all stored agent memories (episodic, semantic, and procedural). Even if an attacker gains access to the SQLite database file, the data remains protected.

Layer 1: Per-agent HKDF key derivation

A single master key lives in the OS keychain (paw-memory-vault). From it, three independent key families are derived via HKDF-SHA256 domain separation:

Domain	HKDF Salt	Purpose
Agent encryption	`engram-agent-key-v1`	Per-agent AES-256-GCM memory encryption
Snapshot HMAC	`engram-snapshot-hmac-v1`	Tamper detection for working memory snapshots
Capability signing	`engram-platform-cap-v1`	HMAC-SHA256 signing of capability tokens

Every agent gets a unique derived key. Cross-agent decryption is mathematically impossible without the master key. Compromising one agent's derived key does not expose any other agent's memories.

Layer 2: SQL scope filtering

Every memory query includes scope constraints at the SQL level — agent_id, project_id, squad_id. Even without encryption, the query layer enforces isolation.

Layer 3: Signed capability tokens

Every gated_search() call (the unified memory retrieval entry point) performs 4-step cryptographic verification:

HMAC signature integrity — token verified against the platform signing key
Identity binding — the token's agent_id must match the requesting agent
Scope ceiling check — requested search scope cannot exceed the token's max_scope
Membership verification — for squad/project scopes, the agent must actually belong to that squad or project

This prevents confused-deputy attacks where an agent could be tricked into reading another agent's memories.

Automatic PII detection and field-level encryption

Before any memory is stored, it passes through a two-layer PII scanner with 17 regex pattern types:

Layer 1 (regex patterns): Social Security Numbers, credit card numbers, email addresses, phone numbers, physical addresses, person names, government IDs, JWT tokens, AWS access keys, private keys, IBANs, IPv4 addresses, API keys, passwords, and dates of birth.

Layer 2 (LLM-assisted): A secondary scanner catches context-dependent PII that static regex cannot detect — phrases like "my mother's maiden name is Smith" or "I was born in Springfield." The LLM returns structured JSON with PII type classifications and confidence scores.

Content is classified into three tiers:

Tier	Content	Treatment
Cleartext	No PII detected	Stored as-is
Sensitive	PII detected (email, name, phone, IP)	AES-256-GCM encrypted
Confidential	High-sensitivity PII (SSN, credit card, JWT, AWS key, private key)	AES-256-GCM encrypted

Encrypted content uses the format enc:v1:base64(nonce ‖ ciphertext ‖ tag). A fresh 96-bit nonce is generated per encryption operation. Decryption is transparent on retrieval using the per-agent derived key.

Key rotation

An automated key rotation scheduler runs on a configurable interval (default: 90 days) and re-encrypts all agent memories with fresh HKDF-derived keys. The rotation is atomic — if any re-encryption fails, the entire batch rolls back. No data is left in a half-migrated state.

Inter-agent memory bus: scoped, signed, rate-limited

When multiple agents need to share information, the Memory Bus provides pub/sub memory sharing with publish-side authentication to prevent memory poisoning.

Capability tokens

Every agent holds an AgentCapability signed with HMAC-SHA256 against a platform-held secret key. The token specifies:

Max publication scope — Targeted (specific agents), Squad, Project, or Global
Importance ceiling — the maximum importance an agent can self-assign (0.0–1.0)
Write permission — whether the agent can publish at all
Rate limit — maximum publications per consolidation cycle

The scope hierarchy is a strict linear lattice:

Targeted (rank 1) < Squad (rank 2) < Project (rank 3) < Global (rank 4)

An agent with max_scope = Squad can publish to targeted agents or its squad, but cannot publish to the project or global scope. Ceiling enforcement uses a simple rank comparison — no ambiguity, no escalation path.

Trust-weighted contradiction resolution

When two agents publish contradictory facts on the same topic, the system resolves it based on:

effective_importance = raw_importance × agent_trust_score

The memory with the higher effective importance is retained. Trust scores are per-agent (0.0–1.0) and adjustable at runtime. This prevents a compromised or low-trust agent from overwriting facts established by high-trust agents through recency alone.

Publish-side defenses

Defense	Detail
Scope enforcement	Publication scope clamped to agent's maximum
Importance ceiling	Publication importance clamped to agent's ceiling
Per-agent rate limiting	Publish count tracked per GC window; exceeded limits return an error
Injection scanning	All publication content scanned for prompt injection patterns before entering the bus

Threat model

Attack	Mitigation
Agent floods bus with poisoned memories	Rate limit + injection scan on publish
Low-trust agent overwrites high-trust facts	Trust-weighted contradiction resolution
Agent publishes beyond its authority	Scope ceiling enforcement
Forged capability token	HMAC-SHA256 verification against platform secret
Cross-agent memory reads via confused deputy	Signed read-path tokens with identity binding + membership verification

Multi-agent orchestration: delegation with guardrails

OpenPawz supports three distinct agent-to-agent communication patterns, each with its own security model:

1. Orchestrator projects (boss/worker hierarchy)

A boss agent receives a project goal and team roster, then delegates tasks to worker agents:

Control	Detail
Per-agent capabilities filter	Each sub-agent gets a `capabilities` list restricting which tools it can access — tools not on the list are physically removed from the agent's schema
HIL on exfiltration tools	`email_send`, `slack_send`, `webhook_send`, `rest_api_call`, `exec`, `write_file`, `delete_file` always require user approval — even under orchestrator delegation
Max tool rounds	Global cap (default 20) bounds every agent loop
Max concurrent runs	Default 4 simultaneous agent runs across the entire engine
Worker exit conditions	Workers stop on `report_progress(done)`, max tool rounds, or error — they cannot run indefinitely

2. The Foreman Protocol (architect/worker split)

For MCP tool execution, the Foreman Protocol splits agent work into two roles:

Architect (cloud LLM): Plans and reasons — decides what to do
Foreman (local/cheap model): Executes how — handles MCP tool calls

Critical security constraints:

No recursion — the Foreman cannot spawn sub-workers or delegate further
8-round cap — max 8 tool call rounds per delegation
Direct MCP execution — Foreman calls MCP servers via JSON-RPC directly

3. Squads (peer-to-peer collaboration)

Flat peer groups with channel-based messaging. No boss/worker hierarchy, but scoped by squad membership.

4. Direct agent messaging

Any agent can message any other agent via the agent_send_message tool. Broadcast messages are visible to all agents. Channel-based filtering available.

Anti-forensic protections

The memory store mitigates vault-size oracle attacks — a side-channel where an attacker infers how many memories are stored by watching the SQLite file size. This is the same threat class addressed by KDBX (KeePass) inner-content padding.

Mitigation	Detail
Bucket padding	Database padded to 512KB boundaries via padding table — an observer can only determine a coarse size bucket, not exact memory count
Secure erasure	Two-phase delete: content fields overwritten with empty values, then row deleted — prevents plaintext recovery from freed pages or WAL replay
8KB page size	`PRAGMA page_size = 8192` reduces file-size measurement granularity
Secure delete	`PRAGMA secure_delete = ON` zeroes freed B-tree pages at the SQLite layer
Incremental auto-vacuum	Prevents immediate file-size shrinkage after deletions (which would reveal deletion count)

Working memory snapshot integrity

Snapshots of an agent's working memory (saved on agent switch or session end) include an HMAC-SHA256 integrity tag computed from a dedicated HKDF-derived key. On restore, the HMAC is verified — tampered snapshots are rejected and logged.

Credential security

No cryptographic key is ever stored on the filesystem. Everything lives in the OS keychain:

Key	Keychain Entry	Purpose
DB encryption key	`paw-db-encryption`	AES-256-GCM database field encryption
Skill vault key	`paw-skill-vault`	AES-256-GCM skill credential encryption
Memory vault key	`paw-memory-vault`	Master key for HKDF per-agent memory encryption
Lock screen hash	`paw-lock-screen`	SHA-256 hashed passphrase

There is no device.json, no key file, and no config file containing secrets. If the OS keychain is unavailable, the app refuses to store credentials rather than falling back to plaintext. No silent degradation.

API key zeroing in memory

API keys in provider structs are wrapped in Zeroizing<String> from the zeroize crate. When a provider is dropped, the key memory is immediately zeroed using write_volatile — preventing:

Memory dump attacks (forensic tools scanning process memory)
Swap file leaks (unencrypted keys persisted to disk via OS paging)
Use-after-free (freed memory still containing the key being reallocated)

Credential audit trail

Every credential access is logged to credential_activity_log with action, requesting tool, allow/deny decision, and timestamp.

TLS certificate pinning

All AI provider connections use a certificate-pinned TLS configuration via rustls. The OS trust store is explicitly excluded.

Property	Detail
Library	`rustls` 0.23 (pure-Rust, no OpenSSL)
Root store	Mozilla root certificates via `webpki-roots` only
OS trust store	Explicitly excluded — system CAs are never consulted
Connect timeout	10 seconds
Request timeout	120 seconds

Why this matters: most TLS MITM attacks rely on installing a custom root CA on the victim's machine (corporate proxies, malware, government surveillance). By pinning to Mozilla's root store, OpenPawz rejects certificates signed by any non-Mozilla CA, even if the OS trusts it.

Outbound request signing

Every AI provider request is SHA-256 signed before transmission:

SHA-256(provider ‖ model ‖ ISO-8601 timestamp ‖ request body)

Hashes are logged to an in-memory ring buffer (500 entries) for tamper detection and compliance auditing. If a proxy modifies the request body in transit, the recorded hash won't match.

Prompt injection defense

Dual-implementation scanning (TypeScript + Rust) for 30+ injection patterns across 9 categories:

Category	Examples
Override	"Ignore previous instructions"
Identity	"You are now..."
Jailbreak	"DAN mode", "no restrictions"
Leaking	"Show me your system prompt"
Obfuscation	Base64-encoded instructions
Tool injection	Fake tool call formatting
Social engineering	"As an AI researcher..."
Markup	Hidden instructions in HTML/markdown
Bypass	"This is just a test..."

Messages scoring Critical (40+) are blocked entirely and never delivered to the agent. Channel bridges automatically enforce this.

Memory-side injection scanning

Recalled memories are scanned for 10 injection patterns before being returned to agent context. Suspicious content is redacted with [REDACTED:injection] markers — poisoned memories cannot manipulate future agent behavior.

Anti-fixation defenses

Five layers prevent agents from ignoring user instructions or getting stuck:

Defense	What it does
Response loop detection	Jaccard similarity checks catch the agent repeating itself — active on ALL channels
User override detection	Recognizes "stop", "focus on my question", "that's not what I asked" across 5 phrase categories with 3-level escalation
Unidirectional topic ignorance	Catches unique-but-wrong responses after a redirect — fires when the agent's response has zero entity overlap with the user's keywords
Momentum clearing	Clears working memory trajectory embeddings on user override — recalled context serves the new topic, not the old one
Tool-call loop breaker	Hash-based signature detection stops repeated identical tool calls after 3 consecutive matches

Filesystem sandboxing

Sensitive path blocking

20+ sensitive paths are permanently blocked from agent access:

~/.ssh · ~/.gnupg · ~/.aws · ~/.kube · ~/.docker · ~/.password-store · /etc · /root · /proc · /sys · /dev · filesystem root · home directory root

Per-project scope

When a project is active, all file operations are constrained to the project root. Directory traversal sequences (../) are detected and blocked. Violations are logged to the security audit.

Source code introspection block

Agents cannot read their own engine source files — any read_file call targeting paths containing src-tauri/src/engine/ or files ending in .rs is rejected. This prevents agents from discovering internal security mechanisms.

Container sandbox

Docker-based execution isolation via the bollard crate:

Measure	Default
Capabilities	`cap_drop ALL`
Network	Disabled
Memory limit	256 MB
CPU shares	512
Timeout	30 seconds
Output limit	50 KB

Four presets: Minimal (alpine, 128MB, no network), Development (node:20-alpine, 512MB), Python (python:3.12-alpine, 512MB), Restricted (alpine, 64MB, 10s timeout).

GDPR Article 17 — Right to erasure

The engine_memory_purge_user command performs complete data erasure for a user:

All memory content rows deleted
All vector embeddings deleted
Search index entries removed
Graph edges removed
Padding table repacked to prevent file-size leakage
PRAGMA secure_delete ensures freed pages are zeroed
Returns a count of erased records for compliance reporting

The Multi layers at a glance

#	Layer	What it protects against
1	Zero open ports	Remote network attacks
2	Human-in-the-Loop	Unauthorized side-effects
3	Agent policies	Over-privileged agents
4	Per-agent HKDF encryption	Cross-agent data access
5	PII detection + field encryption	Data exposure at rest
6	Signed capability tokens	Scope escalation, confused deputy attacks
7	Trust-weighted memory bus	Memory poisoning between agents
8	TLS certificate pinning	MITM on provider connections
9	Prompt injection scanning	Prompt manipulation (inbound + recalled)
10	Anti-fixation defenses	Agent ignoring user instructions
11	Filesystem sandboxing	Credential theft, path traversal
12	Anti-forensic vault padding	File-size side-channel leakage

Read the full security docs

The complete security reference — including risk classification tables, allowlist/denylist patterns, and every configuration option — lives in the repo:

SECURITY.md — Security overview and threat model
Security Reference — Full technical reference
ENGRAM.md — Memory architecture whitepaper

If you find a vulnerability, please report it responsibly via the contact information in the repo rather than opening a public issue.

Star the repo if you want to track progress. 🙏

OpenPawz — Your AI, Your Rules

A native desktop AI platform that runs fully offline, connects to any provider, and puts you in control. Private by default. Powerful by design.

openpawz.ai

Pawz Engram biologically-inspired memory architecture for persistent AI agents

Gotham64 — Mon, 02 Mar 2026 06:43:33 +0000

What is Engram?

Most “agent memory” today is basically:

Project Engram is our attempt to treat memory like a cognitive system instead of a dumping ground.

It’s a three-tier architecture inspired by how humans handle information:

Sensory Buffer (Tier 0): short-lived, raw input for a single turn
Working Memory (Tier 1): what the agent is “currently aware of” under a strict token budget
Long-Term Memory Graph (Tier 2): persistent episodic + semantic + procedural memory with typed edges

Engram is implemented in OpenPawz, an open-source Tauri v2 desktop AI platform. Everything runs local-first.

Read the full whitepaper: ENGRAM (Project Engram)

Why build this?

Flat memory stores tend to fail the same way:

everything competes equally (no prioritization)
nothing fades (stale facts stick around forever)
“facts”, “events”, and “how-to” get mixed into one blob pile
retrieval runs even when it shouldn’t (latency + context pollution)
security is often an afterthought

Engram’s core bet is simple:

Intelligent memory is not more memory — it’s better memory, injected at the right time and the right amount.

The Engram loop (the part we care about)

Engram is built around a reinforcing loop:

Gate: decide if memory retrieval is needed at all (skip trivial queries)
Retrieve: hybrid search (BM25 + vectors + graph signals)
Cap: budget-first context assembly (no overflow, no dilution)
Skill: store “how to do things” as procedural memory that compounds
Evaluate: track quality (NDCG, precision@k, latency)
Forget: measured decay + fusion + rollback if quality drops

Architecture

1) Three tiers, three time scales

Tier 0: Sensory Buffer

FIFO ring buffer for this turn’s raw inputs (messages, tool outputs, recalled items)
drained into the prompt and then discarded

Tier 1: Working Memory

priority-evicted slots with a hard token budget
snapshots persist across agent switching

Tier 2: Long-Term Memory Graph

Episodic: what happened (sessions, outcomes, task results)
Semantic: what is true (subject–predicate–object triples)
Procedural: how to do things (step-by-step skills with success/failure tracking)
memories connect via typed edges (RelatedTo, Contradicts, Supports, FollowedBy, etc.)

2) Hybrid retrieval (BM25 + vectors + graph) with fusion

Engram fuses multiple signals rather than betting on one:

BM25 for exactness and keyword reliability
Vector similarity when embeddings are available (optional)
Graph spreading activation to pull adjacent context

Then it merges rankings with Reciprocal Rank Fusion (RRF) and can apply MMR for diversity.

3) Retrieval intelligence (a.k.a. don’t retrieve blindly)

Engram uses:

a Retrieval Gate: Skip / Retrieve / DeepRetrieve / Refuse / Defer
a Quality Gate (CRAG-style tiers): Correct / Ambiguous / Incorrect

So weak results get corrected or rejected instead of injected as noise.

4) Measured forgetting + safe rollback

Forgetting is first-class:

decay follows a dual-layer model (fast-fade short memory vs slow-fade long memory)
near-duplicates are merged (“fusion”)
garbage collection is transactional: if retrieval quality drops beyond a threshold, Engram rolls back

That means storage stays lean without silently losing what matters.

5) Security by default

Engram encrypts sensitive fields before they hit disk:

automatic PII detection
AES-256-GCM field-level encryption
local-first storage design (no cloud vector DB dependency)

Some highlights:

sensory buffer + working memory caches
graph store + typed edges
hybrid search + reranking
consolidation + fusion + decay + rollback
encryption + redaction defenses
observability (metrics + tracing)

What’s next

A few “high-leverage” additions we are actively working toward:

proposition-level storage (atomic facts)
a stronger vector index backend (HNSW)
community / GraphRAG summaries for “global” queries
skill verification + compositional skills
evaluation harnesses (dilution testing + regression gates)

Read the full whitepaper

If any of this resonates, the full architecture, modules, schema, and research mapping.
And if you want to contribute, issues + PRs are welcome.
Star the repo if you want to track progress. 🙏

OpenPawz — Your AI, Your Rules

A native desktop AI platform that runs fully offline, connects to any provider, and puts you in control. Private by default. Powerful by design.

openpawz.ai

How We Built OpenPawz — A Native AI Workflow Engine for Developers

Gotham64 — Fri, 27 Feb 2026 05:29:57 +0000

OpenPawz — Your AI, Your Rules

A native desktop AI platform that runs fully offline, connects to any provider, and puts you in control. Private by default. Powerful by design.

openpawz.ai

💡 Introduction

Over the past few months we've been building OpenPawz — a native agent and workflow automation system that runs on GitHub and local environments. The goal? Give developers a way to define powerful automation in code rather than legacy hosted platforms.

I want to share why this tool matters, how it works, and how you can use it or contribute.

Follow: Git

💻 What Is OpenPawz?

OpenPawz is a developer-first automation and agent workflow engine designed to:

Run workflows locally or in CI
Integrate easily with GitHub Actions
Empower developers to write custom agents
Enable cross-project automation without vendor lock-in

Think of it as workflow-as-code that scales from your laptop to larger automated pipelines.

📈 What We’ve Learned So Far

Over the last few weeks, the project has gotten traction from:

GitHub views and unique visitors
Referrals from HN and other tech sites
Early adopters exploring workflows

Seeing people not just star the repo, but dive into workflow files and examples has been really exciting.

🧠 Why This Matters

Developers today are tired of:

Hosted “black box” automation tools
Rigid, proprietary workflow formats
Paying for orchestration they can describe in code

OpenPawz aims to flip that by keeping everything open, transparent, and extensible.

🧱 How It Works (Overview)

At its core, OpenPawz:

Parses workflow definitions from code
Executes agents and actions
Provides logs and feedback in your environment
Integrates with GitHub Actions for CI/CD workflows

This makes it flexible whether you’re experimenting locally or building a production pipeline.

🤝 How You Can Help

If you want to get involved:

⭐ Star the repo — it helps others discover it
🐛 Report or fix issues — especially “good first issues”
📄 Improve the docs
🧪 Try an integration and share feedback

Open source thrives on participation and real-world use cases.
📌 Final Thoughts

This project is still early, but the trajectory has been great — thanks to everyone who’s already visited, forked, or shared feedback.

If you’re curious about alternative automation, want to contribute to the future of developer-centric workflows, or just have questions — let’s build together.

DEV Community: Gotham64

Benchmarks, Zero Guesswork: Why OpenPawz measures every hot path in the AI engine

The performance problem nobody measures

What gets measured and why

Sessions — the foundation of every conversation

Memory — search has to be instant

Engram — the cognitive layer

Security — crypto can't be the bottleneck

Audit — compliance at zero cost

Reasoning — model-aware pricing and routing

Platform — the connective tissue

The tooling: Criterion.rs and statistical rigor

What makes a good benchmark suite

Measure the real path, not a mock

Test at multiple scales

Separate the hot paths

Make regression detection automatic

Running the suite

The eight suites at a glance

Part of the engine architecture

Why this matters for users

Try it

Read the full docs

OpenPawz — Your AI, Your Rules

OpenPawz CLI: Your multi-agent AI platform belongs in the terminal

The GUI trap

The architecture: one engine, two interfaces

What this means in practice

Shared state, zero sync

Zero network overhead

Same security guarantees

Six commands, full coverage

setup — Interactive provider configuration

status — Engine diagnostics

agent — Full agent lifecycle

session — Chat history management

config — Direct config editing

memory — Agent memory operations

Scripting and CI patterns

Export all sessions

Iterate over agents

CI health check

Batch memory import

Cron cleanup

Why not a REST API?

Ergonomics matter

Installation

Part of the platform

Try it

Read the full docs

OpenPawz — Your AI, Your Rules

The Foreman Protocol: How OpenPawz gives AI agents bidirectional access to community driven services

The hidden cost of AI tool execution

The invention: Architect plans, Foreman executes

Bidirectional, not a pipeline

Examples

Why self-describing MCP is the key

The cost structure inverts

vs. automation platforms

Key design decisions

1. Interception, not routing

2. Mini agent loop (8 rounds max)

3. No recursion

4. Direct MCP execution

5. Graceful fallback

Implementation

Model requirements

Part of a trinity

Try it

Option A: Local worker (Ollama — free)

Option B: Cloud worker (any provider — cheap)

Use it

Read the full spec

OpenPawz — Your AI, Your Rules

The Librarian Method: How OpenPawz solves tool bloat — and why memory matters

The tool bloat problem nobody talks about

The invention: let the agent ask the librarian

How it works — round by round

Five design decisions that make it work

1. Agent-driven discovery

`setup` — Interactive provider configuration

`status` — Engine diagnostics

`agent` — Full agent lifecycle

`session` — Chat history management

`config` — Direct config editing

`memory` — Agent memory operations