<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: innerca</title>
    <description>The latest articles on DEV Community by innerca (@innerca).</description>
    <link>https://dev.to/innerca</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3908603%2F30bb8d10-ad29-45ae-a2c6-decb73b8c485.jpeg</url>
      <title>DEV Community: innerca</title>
      <link>https://dev.to/innerca</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/innerca"/>
    <language>en</language>
    <item>
      <title>I tested 4 local models as memory classifiers for OpenClaw — and thinking models are a trap</title>
      <dc:creator>innerca</dc:creator>
      <pubDate>Sat, 02 May 2026 08:29:24 +0000</pubDate>
      <link>https://dev.to/innerca/i-tested-4-local-models-as-memory-classifiers-for-openclaw-and-thinking-models-are-a-trap-1eh5</link>
      <guid>https://dev.to/innerca/i-tested-4-local-models-as-memory-classifiers-for-openclaw-and-thinking-models-are-a-trap-1eh5</guid>
      <description>&lt;p&gt;I'm a backend engineer. When I started using OpenClaw, I kept hitting the same reliability problems every backend engineer recognizes: the agent would forget who I was after a session reset. Safety rules would vanish after context compaction. I'd told it my preferences three times, and it still asked "Python or Go?"&lt;/p&gt;

&lt;p&gt;So I did what made sense: I wrote a protocol specification, built a reference implementation, and ran a benchmark. Here's what I learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem isn't retrieval. It's input.
&lt;/h2&gt;

&lt;p&gt;OpenClaw's memory pipeline looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent decides what to remember → calls memory tools → stores unstructured text → retrieves via vector similarity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every step depends on the agent making the right decision at the right time. When context compacts, the agent may forget to call memory tools entirely. The memory was &lt;em&gt;potentially retrievable&lt;/em&gt;, but the agent never initiated the search.&lt;/p&gt;

&lt;p&gt;The community has built excellent solutions on the retrieval side — LanceDB with hybrid vector + BM25 search, cross-encoder reranking, knowledge graphs. But the input side was still wide open: &lt;strong&gt;what should be remembered in the first place?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  SheetMemory: rule-first memory extraction
&lt;/h2&gt;

&lt;p&gt;The architecture is straightforward: every user message passes through a Perceptor — a pure-regex signal detector — before anything reaches a model. High-confidence signals (rules, explicit preferences, corrections, memory requests) are classified and written directly. No LLM involved. For lower-confidence or ambiguous input, an optional local LLM classifier can map text to the schema.&lt;/p&gt;

&lt;p&gt;I wrote a protocol specification, &lt;a href="https://github.com/innerca/sheetmemory/blob/main/PROTOCOL.md" rel="noopener noreferrer"&gt;SheetMemory&lt;/a&gt;, that defines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;6-detector Perceptor&lt;/strong&gt; — correction, rule, preference, memory request, time commitment, identity. Runs in &amp;lt;1ms per message on the &lt;code&gt;message_received&lt;/code&gt; hook.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;7-type schema&lt;/strong&gt; (entity, event, fact, rule, impression, plan, reflex) — every memory gets a type, confidence score, importance rating, and optional expiration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;QUERY / UPSERT / FORGET&lt;/strong&gt; primitives with deterministic behavior rules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hard constraints&lt;/strong&gt;: critical memories are immune to decay, expire_at records are forcibly archived, user corrections override everything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key architectural decision: &lt;strong&gt;retrieval uses deterministic field filters&lt;/strong&gt; (&lt;code&gt;type=rule AND confidence&amp;gt;=0.7 AND keywords LIKE '%contract%'&lt;/code&gt;), not vector similarity. Semantic search is an optional post-processing step, not the primary engine.&lt;/p&gt;

&lt;p&gt;The implementation is an OpenClaw plugin — SQLite storage, Weibull time-based decay, and local-model classification via &lt;code&gt;subagent.run&lt;/code&gt; with minimal context.&lt;/p&gt;

&lt;h2&gt;
  
  
  The benchmark
&lt;/h2&gt;

&lt;p&gt;I built a 25-case dirty-input test set covering 10 challenge dimensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;noise-wrapped&lt;/strong&gt;: key information buried in casual chatter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;implicit&lt;/strong&gt;: information conveyed without direct statement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;fragmented&lt;/strong&gt;: broken grammar, telegraphic sentences&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;negation/correction&lt;/strong&gt;: "No wait, that's wrong, actually it's..."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;multi-intent&lt;/strong&gt;: event + plan, fact + plan intertwined in one message&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;boundary-ambiguous&lt;/strong&gt;: impression vs preference, rule vs fact&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;code-switching&lt;/strong&gt;: Chinese with embedded English&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;hypothetical&lt;/strong&gt;: "If the review passes next Monday, we'll..."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sarcasm&lt;/strong&gt;: "Oh absolutely love it when requirements change 2 days before deadline"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;third-party&lt;/strong&gt;: "I heard from Dave that ops had a massive incident..."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;very short&lt;/strong&gt;: "btw my email is &lt;a href="mailto:james.chen@gmail.com"&gt;james.chen@gmail.com&lt;/a&gt;"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Chinese and English versions, identical structure. Tested against 4 models on a MacBook Pro (32GB RAM, macOS Tahoe 26.2, Ollama):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;ZH 25c&lt;/th&gt;
&lt;th&gt;EN 25c&lt;/th&gt;
&lt;th&gt;Parse Rate&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5:3b&lt;/td&gt;
&lt;td&gt;64%&lt;/td&gt;
&lt;td&gt;64%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;1.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen:7b&lt;/td&gt;
&lt;td&gt;68%&lt;/td&gt;
&lt;td&gt;56%&lt;/td&gt;
&lt;td&gt;92%&lt;/td&gt;
&lt;td&gt;3.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;llama3.2:3b&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;33%&lt;/td&gt;
&lt;td&gt;47% (EN)&lt;/td&gt;
&lt;td&gt;1.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemma-26b (GGUF)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;71%&lt;/td&gt;
&lt;td&gt;30s+&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The full benchmark script is at &lt;code&gt;scripts/bench-structured-memory.py&lt;/code&gt; in the repo. You can run it against your own models.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Thinking models overthink classification
&lt;/h3&gt;

&lt;p&gt;Gemma-26b struggled with this specific task. Its chain-of-thought mechanism consumed 200–500 tokens of internal monologue before writing a single character of JSON output. Sample from the raw API response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;*   Text: "我们公司的核心产品是一个AI编码助手..."
*   *Is it a fact?* It is a fact, but "event" is more specific...
*   *Wait, let's check the importance scale again.*
*   10 = Critical (identity, core goals, safety rules)
*   *Decision:* 8.
*   *Self-Correction on importance:* Let's re-evaluate...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After 1024 tokens of output budget, it was still debating. The content field was truncated mid-JSON. Average latency: 30+ seconds per message.&lt;/p&gt;

&lt;p&gt;This is not a judgment on the model itself — Gemma is excellent at reasoning-heavy tasks. But memory classification is a structured-output task with a fixed schema. It doesn't need chain-of-thought. The thinking mechanism, which is the model's strength in other contexts, becomes overhead here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb:&lt;/strong&gt; match the model's strengths to the task. A 7B non-thinking model with 1.5s latency can outperform a 26B thinking model for classification.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Parse reliability matters more than accuracy
&lt;/h3&gt;

&lt;p&gt;qwen:7b scored 68% on Chinese — the highest in the matrix. But its 92% parse rate means it dropped 2 out of 25 messages entirely. Both failures were the same pattern: the model invented a type that doesn't exist in the schema, and the parser rejected it.&lt;/p&gt;

&lt;p&gt;qwen2.5:3b scored 64% — 4 points lower — but with 100% parse rate across both languages. Every message produced a valid, typed record. A record classified as the wrong type is still retrievable. A record that was never written at all is gone forever.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;100% parse rate &amp;gt; 4% accuracy gain.&lt;/strong&gt; This is the most important finding in the benchmark.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Language-native models stay in their lane
&lt;/h3&gt;

&lt;p&gt;qwen:7b is 68% on Chinese and 56% on English — a 12-point gap. llama3.2:3b, an English-native model, couldn't handle Chinese at all (21% parse failure) and dropped to 47% parse rate on English due to JSON truncation (English keywords are longer, eating more token budget).&lt;/p&gt;

&lt;p&gt;qwen2.5:3b is the only model with parity across languages. For a plugin that targets the global OpenClaw community, this stability matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Small models can't discriminate importance
&lt;/h3&gt;

&lt;p&gt;The 3B model's importance scores collapsed to binary: either 4 ("moderate") or 7 ("important"). It never assigned 1–3 or 8–10. The 7B model showed more variance but drifted unpredictably — it rated burnout signals lower than meeting reminders.&lt;/p&gt;

&lt;p&gt;This is why the protocol assigns importance to the Perceptor, not the LLM. Rules can reliably detect high-importance signals. The classifier polishes the summary; the Perceptor owns the importance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why deterministic retrieval?
&lt;/h2&gt;

&lt;p&gt;Vector search dominates the memory-plugin ecosystem. It's powerful for semantic similarity. But it has fundamental limits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Non-deterministic results.&lt;/strong&gt; Re-index or update an embedding model, and the same query returns different rankings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Poor debuggability.&lt;/strong&gt; You can't inspect why a memory ranked #3 instead of #1 without analyzing embedding vectors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Platform-locked binaries.&lt;/strong&gt; Embedding models are large, architecture-specific, and fragile across updates.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SheetMemory's primary retrieval engine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;memory_records&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'rule'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;keywords&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%contract%'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deterministic, debuggable, and runs on the same SQLite file everywhere. Vector search is demoted to an optional rerank pass on the top-15 field-filtered candidates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open source
&lt;/h2&gt;

&lt;p&gt;The repo is &lt;a href="https://github.com/innerca/sheetmemory" rel="noopener noreferrer"&gt;github.com/innerca/sheetmemory&lt;/a&gt;. The protocol is &lt;a href="https://github.com/innerca/sheetmemory/blob/main/PROTOCOL.md" rel="noopener noreferrer"&gt;PROTOCOL.md&lt;/a&gt;. Everything is MIT.&lt;/p&gt;

&lt;p&gt;If you have a better classification model, or you want to add English Perceptor rules, or you found a case where the protocol breaks — open an issue or send a PR. The benchmark script is designed to be extensible: add your own test cases and run against your own models.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Backend engineer, OpenClaw user, tired of explaining the same preferences three times. &lt;a href="https://github.com/innerca" rel="noopener noreferrer"&gt;@mingchxing&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>openclaw</category>
    </item>
  </channel>
</rss>
