The performance problem nobody measures
Every AI agent platform talks about speed. Fast responses. Low latency. Real-time agents.
But ask a simple question — how long does it take to create a session? Search memory? Encrypt a credential? Scan for prompt injection? — and you get silence. No numbers. No baselines. No way to tell if the last update made things faster or slower.
This matters more than most teams realize:
| Scenario | What goes wrong |
|---|---|
| A refactor ships without benchmarks | Session creation silently doubles from 20µs to 40µs. Nobody notices — until 10,000 users do. |
| Memory search "feels slow" | Is it the embedding model? The vector index? The SQLite query? Without measurements, you're guessing. |
| A dependency update lands | Did the new rusqlite version change query performance? Did the aes-gcm update affect encrypt/decrypt throughput? |
| You scale up | 50 sessions work fine. 500 sessions work fine. 5,000 sessions? You have no idea where the cliff is. |
The problem isn't that platforms are slow. It's that nobody is measuring, so nobody knows. Performance regressions are invisible until they become user complaints.
OpenPawz runs 140+ benchmarks across 8 dedicated suites on every critical path in the engine. Not integration tests pretending to check performance. Real statistical benchmarks with variance analysis, regression detection, and historical comparison.
Star the repo — it's open source
What gets measured and why
The benchmark suite isn't a token gesture. It covers every layer of the engine — from the operations users trigger directly to the internal machinery that makes those operations possible.
Here's the breakdown by suite:
Sessions — the foundation of every conversation
Every interaction with an AI agent starts with a session. Creating one, loading messages, listing history, managing tasks. If these operations are slow, everything built on top of them is slow.
The session benchmarks measure:
- Creating sessions and messages — the write path users hit on every single interaction
- Listing at scale — 10 sessions, 100 sessions, 500 sessions. Where does performance degrade?
- Message depth — fetching 50 messages is trivial. Fetching 1,000 with HMAC chain verification? That's where you find bottlenecks.
- Task and agent file I/O — the async operations that happen behind every agent action
Why it matters: session operations are the critical path. A user sends a message, the platform creates a message record, verifies the chain, updates the session. If any of those steps is slow, the user perceives the entire agent as slow — even before the LLM has responded.
Memory — search has to be instant
OpenPawz uses a hybrid memory system: BM25 for keyword search, HNSW vectors for semantic search, and a deduplication layer to prevent memory bloat. Each of these has radically different performance characteristics.
The memory benchmarks test:
- BM25 search at different corpus sizes — how does keyword search scale from 100 to 2,000 documents?
- HNSW insert and search — vector indexing is notoriously sensitive to dimensionality and dataset size
- Content overlap detection — the dedup engine that decides whether a new memory is actually new
- Brute-force vs. HNSW comparison — at what dataset size does the approximate index beat linear scan?
Why it matters: memory search happens on every agent turn. The agent checks what it knows before responding. If memory retrieval adds 50ms instead of 5ms, that's 50ms per turn, per user, compounding across every conversation.
Engram — the cognitive layer
Engram is the knowledge graph that sits on top of raw memory. Entities, relationships, propositions. It powers the agent's ability to reason about what it knows rather than just recall it.
The benchmarks cover:
- Entity and edge upserts — how fast can the knowledge graph absorb new information?
- Subgraph queries — retrieving all edges connected to an entity, at varying graph sizes
- Proposition decomposition — breaking complex statements into atomic facts
- Memory fusion — merging overlapping memories into coherent summaries
- SCC certificate hashing — the capability system that validates tool access
Why it matters: graph operations compound. An agent processing a long conversation might upsert dozens of entities and edges per turn. If each upsert takes 100µs instead of 10µs, you've added milliseconds of invisible overhead that stacks up fast.
Security — crypto can't be the bottleneck
The security suite benchmarks the operations that protect user data: AES-256-GCM encryption, key derivation, PII detection, and injection scanning.
What gets measured:
- Encrypt and decrypt at different payload sizes — 64 bytes, 1 KB, 64 KB, 1 MB
- Key derivation (Argon2) — the intentionally-slow operation that protects master keys
- PII detection — scanning messages for emails, phone numbers, SSNs before they reach the LLM
- Injection scanning — detecting prompt injection attempts in user input
Why it matters: security operations run on every message. PII detection scans every outbound message. Injection scanning checks every inbound message. If either of these adds perceptible latency, teams are tempted to disable them. Benchmarks ensure they stay fast enough that there's never a reason to turn them off.
Audit — compliance at zero cost
Every operation in OpenPawz generates an audit trail. The audit benchmarks ensure that logging doesn't slow down the operations being logged.
- Append events — how fast can audit records be written?
- Query by time range — retrieving audit history for a specific period
- Query by event type — filtering for specific operation categories
Why it matters: audit logging is fire-and-forget. If appending an audit record takes longer than the operation it's recording, the tail is wagging the dog. Benchmarks keep audit overhead invisible.
Reasoning — model-aware pricing and routing
The reasoning benchmarks cover the pricing engine and cost calculations that determine which model handles which request.
- Price-per-token lookups across all supported models
- Cost calculations for conversations of varying length
- Model registry operations — looking up capabilities, context windows, routing metadata
Why it matters: routing decisions happen before every LLM call. The engine evaluates which model to use, what it will cost, and whether budget constraints allow it. These lookups need to be sub-microsecond so they never delay the actual inference call.
Platform — the connective tissue
Config, flows, squads, canvas, projects, telemetry. These are the platform features that tie everything together. Individually they seem simple. Collectively, they define whether the platform feels snappy or sluggish.
- Config read/write — key-value settings the engine checks constantly
- Flow operations — saving, loading, listing workflow graphs at scale
- Squad management — creating teams of agents, checking membership
- Canvas components — the visual workspace that agents and users share
- Project management — grouping agents, sessions, and resources
- Telemetry recording — performance data collection that must not affect performance
Why it matters: platform operations are invisible until they're slow. Nobody notices config lookups that take 2µs. Everyone notices when they take 200µs and the settings panel lags.
The tooling: Criterion.rs and statistical rigor
OpenPawz doesn't use hand-rolled timing loops or Instant::now() wrappers. The entire suite runs on Criterion.rs — the same statistical benchmarking framework used by the Rust compiler itself.
What Criterion provides that ad-hoc timing doesn't:
| Feature | Why it matters |
|---|---|
| Warm-up phase | Eliminates cold-cache artifacts from results |
| Statistical sampling | Runs each benchmark enough times to calculate confidence intervals |
| Regression detection | Compares against the last run and flags performance changes |
| Outlier classification | Identifies and categorizes anomalous measurements |
| HTML reports | Visual charts showing distribution, comparison, and trend data |
Every benchmark run produces a target/criterion/ directory with HTML reports you can open in a browser. You see exactly how performance changed, not just a single number.
What makes a good benchmark suite
Building 140+ benchmarks taught us a few things about what makes benchmarks actually useful versus benchmarks that just exist to check a box.
Measure the real path, not a mock
Every benchmark in the suite creates a real SQLite database, inserts real data, and runs real queries. No mocking the storage layer. No skipping serialization. If the production code path touches SQLite, the benchmark touches SQLite.
Test at multiple scales
A single benchmark at one size tells you almost nothing. Memory search at 100 documents? Fast. Memory search at 2,000 documents? Maybe still fast, maybe not. The suite deliberately tests operations at multiple scales — 10, 50, 100, 200, 500, 1000, 2000 — so you see the scaling curve, not just a single point.
Separate the hot paths
Not every function deserves a benchmark. The suite focuses on operations that happen per-turn, per-message, or per-session — the hot paths that users experience directly. A one-time migration function that runs on startup? Don't benchmark it. A PII scanner that runs on every outbound message? Absolutely benchmark it.
Make regression detection automatic
Criterion stores historical results. Run the benchmarks before and after a change, and you get a clear report: session/create: +3.2%, message/add: -1.1%, memory/bm25_search/1000: +0.4%. No manual comparison needed. No spreadsheets. The tooling tells you what changed.
Running the suite
The benchmarks live in a dedicated crate — openpawz-bench — separate from the application code. This keeps benchmark dependencies out of the production binary and gives the suite its own compilation target.
# Run all benchmarks
cd src-tauri && cargo bench -p openpawz-bench
# Run a specific suite
cargo bench -p openpawz-bench --bench session_bench
# Run benchmarks matching a pattern
cargo bench -p openpawz-bench -- "memory/bm25"
Results land in target/criterion/ with full HTML reports. Open target/criterion/report/index.html for an overview of every benchmark, or drill into any individual measurement for distribution charts and regression comparisons.
The eight suites at a glance
| Suite | Focus | Key operations |
|---|---|---|
| session_bench | Sessions, messages, tasks, agent files | Create, list, fetch at scale |
| platform_bench | Config, flows, squads, canvas, projects, telemetry | CRUD at varying DB sizes |
| memory_bench | BM25 search, HNSW indexing, dedup, content overlap | Search and insert at multiple corpus sizes |
| engram_bench | Knowledge graph — entities, edges, subgraph queries | Upserts, traversals, graph scaling |
| cognitive_bench | Proposition decomposition, memory fusion, SCC, tool metadata | Parsing, merging, hashing |
| security_bench | AES-256-GCM, PII detection, injection scanning | Encrypt/decrypt at varying payloads |
| audit_bench | Audit trail append and query | Write throughput, time-range queries |
| reasoning_bench | Pricing engine, cost calculations, model registry | Per-token lookups, conversation costing |
140+ benchmarks. Eight suites. Every hot path in the engine.
Part of the engine architecture
The benchmarks aren't a separate project. They're part of the same Cargo workspace as the engine itself:
| Crate | Role |
|---|---|
openpawz-core |
The pure Rust engine library — everything the benchmarks test |
openpawz-bench |
Criterion.rs benchmark suite — depends directly on openpawz-core
|
openpawz |
Tauri desktop app |
openpawz-cli |
Terminal binary |
The benchmarks import openpawz-core as a library and call the same public API that the desktop app and CLI use. No internal test hooks. No special benchmark-only codepaths. What gets benchmarked is what ships.
This also means the benchmarks serve as a living compatibility check. If a public API changes, the benchmarks fail to compile. If a function signature changes, the benchmark that calls it catches it immediately.
Why this matters for users
You don't need to run these benchmarks yourself (though you're welcome to). They exist so that every release ships with confidence that:
- Nothing got slower — regression detection catches performance changes before they merge
- The fast paths stay fast — session creation, memory search, encryption, audit logging
- Scale is understood — we know where the performance cliffs are, and they're documented
- Security isn't sacrificed for speed — PII detection and injection scanning stay enabled because they're fast enough to never be a concern
Performance isn't a feature you add later. It's a property of the codebase that you either measure or you hope for. OpenPawz measures.
Try it
# Clone and run the full suite
git clone https://github.com/OpenPawz/openpawz.git
cd openpawz/src-tauri
cargo bench -p openpawz-bench
# Open the HTML reports
open target/criterion/report/index.html
Every benchmark runs against a fresh in-memory SQLite database. No external services. No network calls. No setup beyond having Rust installed.
Read the full docs
Star the repo if you want to track progress. 🙏

Top comments (0)