innerca

Posted on May 2

I tested 4 local models as memory classifiers for OpenClaw — and thinking models are a trap

#openclaw #llm #agents #ai

I'm a backend engineer. When I started using OpenClaw, I kept hitting the same reliability problems every backend engineer recognizes: the agent would forget who I was after a session reset. Safety rules would vanish after context compaction. I'd told it my preferences three times, and it still asked "Python or Go?"

So I did what made sense: I wrote a protocol specification, built a reference implementation, and ran a benchmark. Here's what I learned.

The problem isn't retrieval. It's input.

OpenClaw's memory pipeline looks like this:

Agent decides what to remember → calls memory tools → stores unstructured text → retrieves via vector similarity

Every step depends on the agent making the right decision at the right time. When context compacts, the agent may forget to call memory tools entirely. The memory was potentially retrievable, but the agent never initiated the search.

The community has built excellent solutions on the retrieval side — LanceDB with hybrid vector + BM25 search, cross-encoder reranking, knowledge graphs. But the input side was still wide open: what should be remembered in the first place?

SheetMemory: rule-first memory extraction

The architecture is straightforward: every user message passes through a Perceptor — a pure-regex signal detector — before anything reaches a model. High-confidence signals (rules, explicit preferences, corrections, memory requests) are classified and written directly. No LLM involved. For lower-confidence or ambiguous input, an optional local LLM classifier can map text to the schema.

I wrote a protocol specification, SheetMemory, that defines:

A 6-detector Perceptor — correction, rule, preference, memory request, time commitment, identity. Runs in <1ms per message on the message_received hook.
A 7-type schema (entity, event, fact, rule, impression, plan, reflex) — every memory gets a type, confidence score, importance rating, and optional expiration
QUERY / UPSERT / FORGET primitives with deterministic behavior rules
Hard constraints: critical memories are immune to decay, expire_at records are forcibly archived, user corrections override everything

The key architectural decision: retrieval uses deterministic field filters (type=rule AND confidence>=0.7 AND keywords LIKE '%contract%'), not vector similarity. Semantic search is an optional post-processing step, not the primary engine.

The implementation is an OpenClaw plugin — SQLite storage, Weibull time-based decay, and local-model classification via subagent.run with minimal context.

The benchmark

I built a 25-case dirty-input test set covering 10 challenge dimensions:

noise-wrapped: key information buried in casual chatter
implicit: information conveyed without direct statement
fragmented: broken grammar, telegraphic sentences
negation/correction: "No wait, that's wrong, actually it's..."
multi-intent: event + plan, fact + plan intertwined in one message
boundary-ambiguous: impression vs preference, rule vs fact
code-switching: Chinese with embedded English
hypothetical: "If the review passes next Monday, we'll..."
sarcasm: "Oh absolutely love it when requirements change 2 days before deadline"
third-party: "I heard from Dave that ops had a massive incident..."
very short: "btw my email is james.chen@gmail.com"

Chinese and English versions, identical structure. Tested against 4 models on a MacBook Pro (32GB RAM, macOS Tahoe 26.2, Ollama):

Model	ZH 25c	EN 25c	Parse Rate	Avg Latency
qwen2.5:3b	64%	64%	100%	1.5s
qwen:7b	68%	56%	92%	3.4s
llama3.2:3b	—	33%	47% (EN)	1.5s
gemma-26b (GGUF)	—	—	71%	30s+

The full benchmark script is at scripts/bench-structured-memory.py in the repo. You can run it against your own models.

What I learned

1. Thinking models overthink classification

Gemma-26b struggled with this specific task. Its chain-of-thought mechanism consumed 200–500 tokens of internal monologue before writing a single character of JSON output. Sample from the raw API response:

*   Text: "我们公司的核心产品是一个AI编码助手..."
*   *Is it a fact?* It is a fact, but "event" is more specific...
*   *Wait, let's check the importance scale again.*
*   10 = Critical (identity, core goals, safety rules)
*   *Decision:* 8.
*   *Self-Correction on importance:* Let's re-evaluate...

After 1024 tokens of output budget, it was still debating. The content field was truncated mid-JSON. Average latency: 30+ seconds per message.

This is not a judgment on the model itself — Gemma is excellent at reasoning-heavy tasks. But memory classification is a structured-output task with a fixed schema. It doesn't need chain-of-thought. The thinking mechanism, which is the model's strength in other contexts, becomes overhead here.

Rule of thumb: match the model's strengths to the task. A 7B non-thinking model with 1.5s latency can outperform a 26B thinking model for classification.

2. Parse reliability matters more than accuracy

qwen:7b scored 68% on Chinese — the highest in the matrix. But its 92% parse rate means it dropped 2 out of 25 messages entirely. Both failures were the same pattern: the model invented a type that doesn't exist in the schema, and the parser rejected it.

qwen2.5:3b scored 64% — 4 points lower — but with 100% parse rate across both languages. Every message produced a valid, typed record. A record classified as the wrong type is still retrievable. A record that was never written at all is gone forever.

100% parse rate > 4% accuracy gain. This is the most important finding in the benchmark.

3. Language-native models stay in their lane

qwen:7b is 68% on Chinese and 56% on English — a 12-point gap. llama3.2:3b, an English-native model, couldn't handle Chinese at all (21% parse failure) and dropped to 47% parse rate on English due to JSON truncation (English keywords are longer, eating more token budget).

qwen2.5:3b is the only model with parity across languages. For a plugin that targets the global OpenClaw community, this stability matters.

4. Small models can't discriminate importance

The 3B model's importance scores collapsed to binary: either 4 ("moderate") or 7 ("important"). It never assigned 1–3 or 8–10. The 7B model showed more variance but drifted unpredictably — it rated burnout signals lower than meeting reminders.

This is why the protocol assigns importance to the Perceptor, not the LLM. Rules can reliably detect high-importance signals. The classifier polishes the summary; the Perceptor owns the importance.

Why deterministic retrieval?

Vector search dominates the memory-plugin ecosystem. It's powerful for semantic similarity. But it has fundamental limits:

Non-deterministic results. Re-index or update an embedding model, and the same query returns different rankings.
Poor debuggability. You can't inspect why a memory ranked #3 instead of #1 without analyzing embedding vectors.
Platform-locked binaries. Embedding models are large, architecture-specific, and fragile across updates.

SheetMemory's primary retrieval engine:

SELECT * FROM memory_records
WHERE type = 'rule'
  AND confidence >= 0.7
  AND keywords LIKE '%contract%'

Deterministic, debuggable, and runs on the same SQLite file everywhere. Vector search is demoted to an optional rerank pass on the top-15 field-filtered candidates.

Open source

The repo is github.com/innerca/sheetmemory. The protocol is PROTOCOL.md. Everything is MIT.

If you have a better classification model, or you want to add English Perceptor rules, or you found a case where the protocol breaks — open an issue or send a PR. The benchmark script is designed to be extensible: add your own test cases and run against your own models.

Backend engineer, OpenClaw user, tired of explaining the same preferences three times. @mingchxing

DEV Community