Gotham64

Posted on Mar 18

Benchmarks, Zero Guesswork: Why OpenPawz measures every hot path in the AI engine

#rust #ai #performance #opensource

The performance problem nobody measures

Every AI agent platform talks about speed. Fast responses. Low latency. Real-time agents.

But ask a simple question — how long does it take to create a session? Search memory? Encrypt a credential? Scan for prompt injection? — and you get silence. No numbers. No baselines. No way to tell if the last update made things faster or slower.

This matters more than most teams realize:

Scenario	What goes wrong
A refactor ships without benchmarks	Session creation silently doubles from 20µs to 40µs. Nobody notices — until 10,000 users do.
Memory search "feels slow"	Is it the embedding model? The vector index? The SQLite query? Without measurements, you're guessing.
A dependency update lands	Did the new `rusqlite` version change query performance? Did the `aes-gcm` update affect encrypt/decrypt throughput?
You scale up	50 sessions work fine. 500 sessions work fine. 5,000 sessions? You have no idea where the cliff is.

The problem isn't that platforms are slow. It's that nobody is measuring, so nobody knows. Performance regressions are invisible until they become user complaints.

OpenPawz runs 140+ benchmarks across 8 dedicated suites on every critical path in the engine. Not integration tests pretending to check performance. Real statistical benchmarks with variance analysis, regression detection, and historical comparison.

Star the repo — it's open source

What gets measured and why

The benchmark suite isn't a token gesture. It covers every layer of the engine — from the operations users trigger directly to the internal machinery that makes those operations possible.

Here's the breakdown by suite:

Sessions — the foundation of every conversation

Every interaction with an AI agent starts with a session. Creating one, loading messages, listing history, managing tasks. If these operations are slow, everything built on top of them is slow.

The session benchmarks measure:

Creating sessions and messages — the write path users hit on every single interaction
Listing at scale — 10 sessions, 100 sessions, 500 sessions. Where does performance degrade?
Message depth — fetching 50 messages is trivial. Fetching 1,000 with HMAC chain verification? That's where you find bottlenecks.
Task and agent file I/O — the async operations that happen behind every agent action

Why it matters: session operations are the critical path. A user sends a message, the platform creates a message record, verifies the chain, updates the session. If any of those steps is slow, the user perceives the entire agent as slow — even before the LLM has responded.

Memory — search has to be instant

OpenPawz uses a hybrid memory system: BM25 for keyword search, HNSW vectors for semantic search, and a deduplication layer to prevent memory bloat. Each of these has radically different performance characteristics.

The memory benchmarks test:

BM25 search at different corpus sizes — how does keyword search scale from 100 to 2,000 documents?
HNSW insert and search — vector indexing is notoriously sensitive to dimensionality and dataset size
Content overlap detection — the dedup engine that decides whether a new memory is actually new
Brute-force vs. HNSW comparison — at what dataset size does the approximate index beat linear scan?

Why it matters: memory search happens on every agent turn. The agent checks what it knows before responding. If memory retrieval adds 50ms instead of 5ms, that's 50ms per turn, per user, compounding across every conversation.

Engram — the cognitive layer

Engram is the knowledge graph that sits on top of raw memory. Entities, relationships, propositions. It powers the agent's ability to reason about what it knows rather than just recall it.

The benchmarks cover:

Entity and edge upserts — how fast can the knowledge graph absorb new information?
Subgraph queries — retrieving all edges connected to an entity, at varying graph sizes
Proposition decomposition — breaking complex statements into atomic facts
Memory fusion — merging overlapping memories into coherent summaries
SCC certificate hashing — the capability system that validates tool access

Why it matters: graph operations compound. An agent processing a long conversation might upsert dozens of entities and edges per turn. If each upsert takes 100µs instead of 10µs, you've added milliseconds of invisible overhead that stacks up fast.

Security — crypto can't be the bottleneck

The security suite benchmarks the operations that protect user data: AES-256-GCM encryption, key derivation, PII detection, and injection scanning.

What gets measured:

Encrypt and decrypt at different payload sizes — 64 bytes, 1 KB, 64 KB, 1 MB
Key derivation (Argon2) — the intentionally-slow operation that protects master keys
PII detection — scanning messages for emails, phone numbers, SSNs before they reach the LLM
Injection scanning — detecting prompt injection attempts in user input

Why it matters: security operations run on every message. PII detection scans every outbound message. Injection scanning checks every inbound message. If either of these adds perceptible latency, teams are tempted to disable them. Benchmarks ensure they stay fast enough that there's never a reason to turn them off.

Audit — compliance at zero cost

Every operation in OpenPawz generates an audit trail. The audit benchmarks ensure that logging doesn't slow down the operations being logged.

Append events — how fast can audit records be written?
Query by time range — retrieving audit history for a specific period
Query by event type — filtering for specific operation categories

Why it matters: audit logging is fire-and-forget. If appending an audit record takes longer than the operation it's recording, the tail is wagging the dog. Benchmarks keep audit overhead invisible.

Reasoning — model-aware pricing and routing

The reasoning benchmarks cover the pricing engine and cost calculations that determine which model handles which request.

Price-per-token lookups across all supported models
Cost calculations for conversations of varying length
Model registry operations — looking up capabilities, context windows, routing metadata

Why it matters: routing decisions happen before every LLM call. The engine evaluates which model to use, what it will cost, and whether budget constraints allow it. These lookups need to be sub-microsecond so they never delay the actual inference call.

Platform — the connective tissue

Config, flows, squads, canvas, projects, telemetry. These are the platform features that tie everything together. Individually they seem simple. Collectively, they define whether the platform feels snappy or sluggish.

Config read/write — key-value settings the engine checks constantly
Flow operations — saving, loading, listing workflow graphs at scale
Squad management — creating teams of agents, checking membership
Canvas components — the visual workspace that agents and users share
Project management — grouping agents, sessions, and resources
Telemetry recording — performance data collection that must not affect performance

Why it matters: platform operations are invisible until they're slow. Nobody notices config lookups that take 2µs. Everyone notices when they take 200µs and the settings panel lags.

The tooling: Criterion.rs and statistical rigor

OpenPawz doesn't use hand-rolled timing loops or Instant::now() wrappers. The entire suite runs on Criterion.rs — the same statistical benchmarking framework used by the Rust compiler itself.

What Criterion provides that ad-hoc timing doesn't:

Feature	Why it matters
Warm-up phase	Eliminates cold-cache artifacts from results
Statistical sampling	Runs each benchmark enough times to calculate confidence intervals
Regression detection	Compares against the last run and flags performance changes
Outlier classification	Identifies and categorizes anomalous measurements
HTML reports	Visual charts showing distribution, comparison, and trend data

Every benchmark run produces a target/criterion/ directory with HTML reports you can open in a browser. You see exactly how performance changed, not just a single number.

What makes a good benchmark suite

Building 140+ benchmarks taught us a few things about what makes benchmarks actually useful versus benchmarks that just exist to check a box.

Measure the real path, not a mock

Every benchmark in the suite creates a real SQLite database, inserts real data, and runs real queries. No mocking the storage layer. No skipping serialization. If the production code path touches SQLite, the benchmark touches SQLite.

Test at multiple scales

A single benchmark at one size tells you almost nothing. Memory search at 100 documents? Fast. Memory search at 2,000 documents? Maybe still fast, maybe not. The suite deliberately tests operations at multiple scales — 10, 50, 100, 200, 500, 1000, 2000 — so you see the scaling curve, not just a single point.

Separate the hot paths

Not every function deserves a benchmark. The suite focuses on operations that happen per-turn, per-message, or per-session — the hot paths that users experience directly. A one-time migration function that runs on startup? Don't benchmark it. A PII scanner that runs on every outbound message? Absolutely benchmark it.

Make regression detection automatic

Criterion stores historical results. Run the benchmarks before and after a change, and you get a clear report: session/create: +3.2%, message/add: -1.1%, memory/bm25_search/1000: +0.4%. No manual comparison needed. No spreadsheets. The tooling tells you what changed.

Running the suite

The benchmarks live in a dedicated crate — openpawz-bench — separate from the application code. This keeps benchmark dependencies out of the production binary and gives the suite its own compilation target.

# Run all benchmarks
cd src-tauri && cargo bench -p openpawz-bench

# Run a specific suite
cargo bench -p openpawz-bench --bench session_bench

# Run benchmarks matching a pattern
cargo bench -p openpawz-bench -- "memory/bm25"

Results land in target/criterion/ with full HTML reports. Open target/criterion/report/index.html for an overview of every benchmark, or drill into any individual measurement for distribution charts and regression comparisons.

The eight suites at a glance

Suite	Focus	Key operations
session_bench	Sessions, messages, tasks, agent files	Create, list, fetch at scale
platform_bench	Config, flows, squads, canvas, projects, telemetry	CRUD at varying DB sizes
memory_bench	BM25 search, HNSW indexing, dedup, content overlap	Search and insert at multiple corpus sizes
engram_bench	Knowledge graph — entities, edges, subgraph queries	Upserts, traversals, graph scaling
cognitive_bench	Proposition decomposition, memory fusion, SCC, tool metadata	Parsing, merging, hashing
security_bench	AES-256-GCM, PII detection, injection scanning	Encrypt/decrypt at varying payloads
audit_bench	Audit trail append and query	Write throughput, time-range queries
reasoning_bench	Pricing engine, cost calculations, model registry	Per-token lookups, conversation costing

140+ benchmarks. Eight suites. Every hot path in the engine.

Part of the engine architecture

The benchmarks aren't a separate project. They're part of the same Cargo workspace as the engine itself:

Crate	Role
`openpawz-core`	The pure Rust engine library — everything the benchmarks test
`openpawz-bench`	Criterion.rs benchmark suite — depends directly on `openpawz-core`
`openpawz`	Tauri desktop app
`openpawz-cli`	Terminal binary

The benchmarks import openpawz-core as a library and call the same public API that the desktop app and CLI use. No internal test hooks. No special benchmark-only codepaths. What gets benchmarked is what ships.

This also means the benchmarks serve as a living compatibility check. If a public API changes, the benchmarks fail to compile. If a function signature changes, the benchmark that calls it catches it immediately.

Why this matters for users

You don't need to run these benchmarks yourself (though you're welcome to). They exist so that every release ships with confidence that:

Nothing got slower — regression detection catches performance changes before they merge
The fast paths stay fast — session creation, memory search, encryption, audit logging
Scale is understood — we know where the performance cliffs are, and they're documented
Security isn't sacrificed for speed — PII detection and injection scanning stay enabled because they're fast enough to never be a concern

Performance isn't a feature you add later. It's a property of the codebase that you either measure or you hope for. OpenPawz measures.

Try it

# Clone and run the full suite
git clone https://github.com/OpenPawz/openpawz.git
cd openpawz/src-tauri
cargo bench -p openpawz-bench

# Open the HTML reports
open target/criterion/report/index.html

Every benchmark runs against a fresh in-memory SQLite database. No external services. No network calls. No setup beyond having Rust installed.

Read the full docs

Star the repo if you want to track progress. 🙏

OpenPawz — Your AI, Your Rules

A native desktop AI platform that runs fully offline, connects to any provider, and puts you in control. Private by default. Powerful by design.

openpawz.ai

DEV Community