I built a local AI memory gate on a CPU, and my 7B model scored worse than my 1.5B model because it was too smart

#ai #python #opensource #machinelearning

Hi everyone,

I've been hacking on a personal, local project called Hillock. Honestly, it's very much a work in progress and it isn't some flawless breakthrough, but I wanted to see if we could build a lightweight, completely offline memory layer for local LLMs without the overhead of running a heavy neural vector database or wasting precious VRAM.

It is named after the biological Axon Hillock—the region of a human neuron that sums up incoming electrical charges and decides whether to fire (open the gate) or remain silent (block).

How the architecture works:

The Ground Truth (SQLite): Stores hard facts as simple database triples (Subject-Predicate-Object) so the system has a solid symbolic foundation.
The Synapses (Hebbian Plasticity): Tracks which concepts co-occur during a conversation to dynamically build gradient-free associative weights.
The Context (Hyperdimensional Computing): Maintains a 10,000-dimensional leaky context vector that rolls, binds, and accumulates history. This helps the system resolve pronouns (like "he/she") and decide when to block a query to prevent hallucinations.

The "Smarter Model, Lower Score" Paradox

I wrote a tough, 32-sentence scientific benchmark with complex sentence structures and hard negatives (like asking what Einstein discovered when the text only mentions Curie discovering radioactivity and Einstein working with her).

When I ran Qwen 2 (1.5B), it got around 50.0% Retrieval Accuracy. But when I upgraded to the much smarter Qwen 3 (5.2GB), the ingestion time jumped to 20 minutes on my local machine, and its score actually dropped to 25.0%!

Why? Because Qwen 3 is too expressive for my rigid evaluation script:

The test expected Marie_Curie born_in Poland. Qwen 3 extracted [Marie_Curie] -[spent_childhood_in]-> [Poland].
The test expected Albert_Einstein. Qwen 3 extracted [albert_einstein] (lowercase), which broke the exact-string checks.
The test expected compiler. Qwen 3 extracted [first_compiler].

So, while Qwen 3 populated the database with beautiful, highly accurate, and conversational triples (extracting up to 6 clean relations per block in a single pass), it got penalized by the rigid evaluation harness.

The codebase is written in pure Python, is fully open-source (under the AGPL-3.0 copyleft license), and is designed to run entirely offline on consumer hardware.

If anyone is interested in VSAs, alternative cognitive architectures, or has feedback on the HDC context-binding math, I'd love for you to check it out!

GitHub Repository: https://github.com/roandejager/Hillock