From the author:
From my endless pursuit of getting LLM's to stop hallucinating I had a late night chat with gemini discussing what if's.
What if we could use a vector db during training, what if we could allow for agentic reasoning at the point of training.
What if the LLM could apply this reasoning between layers?
I had my ai-agent code up a prototype and started training it only to be disappointed with word salad.
I think I'm on to something here though-
I recently came across a paper by deepseek, and it aligns with the ideas I had , I mean I even called my repo Engram - and this is an idea discussed in their paper
coiencidence ? i think not haha
But I need help figuring out training- Whats the best way to train this thing?
Consider this less a how-to article, but more asking for help to keep pushing this thing forward- I think i'm on to something.
*Word salad: *
The code: https://github.com/kent-ai-dev/engram
⏺ How Engram Actually Thinks: A Deep Dive Into Its Reasoning Engine
The Attention Brain: A Panel of Pattern-Matching Judges
Think of Engram's reasoning engine like a panel of 4 judges at a talent show. Each judge watches the same performance (your input text), but they're looking for different things. Judge 1 notices basic patterns. Judge 2 spots relationships. Judge 3 identifies abstract themes. Judge 4 makes the final call.
That's what the 4 stacked attention layers do.
Here's the thing though: these aren't independent judges. They're more like a relay race. Judge 1 passes their observations to Judge 2, who adds their own analysis and passes it forward. By the time it reaches Judge 4, the understanding is way more sophisticated than what Judge 1 saw.
But what does "understanding" even mean for a neural network? It's just numbers. Lots and lots of numbers.
Words Are Coordinates in Space (Yes, Really)
Before we can talk about reasoning, we need to understand how Engram sees words. It doesn't see letters or syllables. It sees coordinates.
Every word in Engram's vocabulary is a point in 96-dimensional space. I know that sounds like sci-fi nonsense, but it's simpler than it seems. You can't visualize 96 dimensions, so let's pretend it's 3D space for a second.
Imagine "cat" is at position (5, 2, 8). "Dog" might be at (5.3, 2.1, 7.8) — really close to "cat" because they're both small furry pets. "Car" might be way over at (45, 12, 3) because it's got nothing to do with animals.
Words with similar meanings cluster together in this space. Words that appear in similar contexts are neighbors. That's how the model "understands" that "cat" and "dog" are related — they're literally close together in math-space.
So when you type "the cat sat on the", Engram converts that into 5 points in 96D space. Those are the input coordinates.
Attention: The Smart Way to Look Backwards
Now here's where it gets interesting. When the brain is trying to predict what comes after "the cat sat on the", it can't just look at the last word. That's useless. "The" tells you almost nothing.
It needs to look back at the whole phrase and ask: "Which earlier words matter most right now?"
That's attention. It's a weighted search through previous context.
Let's break down how one attention layer works. There are three mini-operations happening inside:
Query, Key, Value (The Spotlight Mechanism)
Every word gets transformed into three different representations:
- Query: "What am I looking for?"
- Key: "What do I have to offer?"
- Value: "What information should I give you?"
When predicting after "on the", the query is asking: "I need a location noun." Then it scans backward through all the previous words' keys. "Cat" offers "animal subject." "Sat" offers "past tense verb." "On the" offers "preposition indicating spatial relationship."
The attention mechanism computes scores: how relevant is each previous word to what I'm looking for right now? It's like shining a spotlight on certain words while dimming others.
Then it takes a weighted average of the values. If "cat" got a high attention score, its value (the actual information extracted from it) gets weighted heavily in the final output.
Here's the math, but don't panic:
attention_score = (query · key) / sqrt(96)
attention_weights = softmax(attention_scores)
output = weighted_sum(values, attention_weights)
Translation: multiply how much you're looking for by what each word offers, convert to percentages, take weighted average.
That's it. That's attention. It's just a smart way to focus on relevant context.
The Feedforward Network (The Actual Thinking Part)
After attention gathers information from previous words, there's a second step: the feedforward network. This is where computation happens.
It's a simple two-layer neural network:
- Expand the representation into a bigger space (4x larger)
- Run it through a nonlinear activation function (GELU, which lets the network learn complex patterns)
- Compress it back down to the original size
Why expand then compress? Because the bigger space gives the network room to explore different interpretations. It's like brainstorming on a whiteboard before writing your final answer.
The feedforward network learns things like "if I see [preposition + determiner] pattern, there's usually a noun coming next" or "question words at the start change the entire sentence structure."
Residual Connections (The Memory Trick)
There's one more detail that matters: residual connections. After each attention block and feedforward block, the output gets added back to the input.
Why? Because neural networks are bad at remembering their starting point after many transformations. Residual connections let information skip forward unchanged, so the network doesn't forget what the original input was.
It's like taking notes during a lecture. You don't just memorize the professor's summary at the end — you keep your original notes too.
The 4-Layer Journey: How Understanding Deepens
Let's trace what happens when you input: "what is the capital of"
Layer 0 processes the raw word vectors. Attention identifies that "what" and "capital" are the important words. "Is" and "the" get low attention scores. Feedforward recognizes this as a question pattern. Output: "This is asking for a capital city of something."
But wait. Before Layer 1 starts, something new happens.
N-gram Memory Injection (The Shortcut)
This is where Engram diverges from standard transformers. After Layer 0 finishes, the brain looks up bigrams and trigrams in a hash table:
- Bigram: "capital" + "of" → hash(47821) → table entry #3821 → embedding vector
- Trigram: "the" + "capital" + "of" → hash(91204) → table entry #1204 → embedding vector
These vectors represent "how often this phrase appears in the training data and what usually follows it." It's pre-computed pattern recognition. No thinking required.
But here's the clever part: a learned gate decides how much to trust this memory. The gate looks at Layer 0's output and the N-gram memory and computes a blend weight.
If the model is confident the N-gram memory is relevant, it injects it heavily. If it's uncertain, it uses less. This happens between Layer 0 and Layer 1, so the remaining layers get a head start.
Why does this matter? Because reconstructing common phrases wastes neural computation. Layers 1-4 can focus on actual reasoning instead of re-deriving "capital of → country name" from scratch every time.
Layer 1 receives the memory-enhanced representation. It refines the understanding: "capital of → probably expecting a country name, but could be a state."
Layer 2 goes deeper: "question structure → expects a proper noun answer, geographic domain."
Layer 3 adds context: "no other geographic context in the sentence → this is a standalone geography quiz question."
Layer 4 makes the final prediction: a 96D vector representing "the concept of a country name, probably European or well-known."
Each layer makes the representation more abstract and more task-specific.
Pondering: Thinking Harder When Confused
Most neural networks run through their layers once and call it done. Engram doesn't.
After the initial pass through Layers 1-4, a halt gate checks: "Am I confident enough, or should I think more?"
It's a tiny neural network (literally one weight matrix) that looks at the final output and predicts a confidence score. If the score is high (above 0.95), it stops. If it's medium or low, it loops back through the layers again.
This can happen up to 3 times. Each loop refines the prediction more.
Easy inputs (common phrases, simple patterns) usually halt after 1 pass. Hard inputs (rare words, complex grammar) ponder for 2-3 passes.
It's adaptive compute. The model literally allocates more processing power to difficult problems.
GPT-4 doesn't do this. It runs through all 96 layers for every token, whether you're typing "hello" or asking it to prove the Riemann hypothesis. Same compute budget for everything.
Engram uses 1x compute for "hello" and 3x compute for weird philosophical questions. That's the pondering mechanism in action.
The Final Prediction: From Vector to Word
After pondering finishes, the brain outputs a 96-dimensional vector. Let's call it prediction_vector.
This vector represents "the abstract concept of what should come next." It's not a word yet. It's a coordinate in concept space.
Now the brain searches ChromaDB (the vocabulary database) for the word whose vector is closest to prediction_vector. It measures distance using L2 norm (Euclidean distance).
Candidate words:
- "france" → distance 0.23
- "paris" → distance 0.31
- "germany" → distance 0.38
- "banana" → distance 2.45
"France" is closest, so it's the top candidate. But the model doesn't just pick the closest one. That'd be boring and repetitive.
Instead, it samples from the top-k candidates using temperature. Temperature controls randomness:
- Low temp (0.1): almost always picks "france" (safe, predictable)
- High temp (1.5): might pick "paris" or "germany" sometimes (creative, risky)
Engram defaults to 0.9 — a balance between coherence and variety.
Episodic Memory: The Experience Boost
Before finalizing the answer, there's one more step. If episodic memory exists (it does in test_brain.py), the model queries the memory database.
It searches for past interactions where the brain's internal state was similar to right now. Not similar words — similar brain states. The search uses the prediction vector as the query.
Let's say it finds: "Three conversations ago, user asked about European capitals, and the answer was France."
It retrieves those memory vectors and blends them into the current prediction using the same learned gate that was used for N-gram memory.
If the gate decides the memory is relevant, it shifts the prediction slightly toward the remembered answer. If the memory seems irrelevant, the gate suppresses it.
This is how Engram "remembers" past conversations without explicitly storing facts in its weights.
Why This Architecture Is Different
Standard transformers (GPT, BERT, etc.) are pure attention stacks. Information flows in one direction: input → layer 1 → layer 2 → ... → layer N → output. Fixed depth, fixed compute, no external memory.
Engram adds three things on top of the standard transformer:
- Conditional memory lookup (N-gram tables injected between layers)
- Adaptive depth (pondering mechanism, 1-3 loops based on confidence)
- External episodic memory (brain-state-indexed experience journal)
And it learns differently too. When the model makes a prediction error, it computes surprise:
surprise = |predicted - actual|²
If surprise is high (the model was really wrong), it multiplies the gradient by up to 3x. If surprise is low (the model already knew this), it barely updates.
This is dopamine-style learning. Humans learn faster from shocking events than from boring repetition. Engram does the same.
The Whole Pipeline, Start to Finish
You type: "what is the capital of france"
- Tokenize into words:
["what", "is", "the", "capital", "of", "france"] - Look up word vectors in ChromaDB (96D coordinates for each)
- Add positional embeddings (so the model knows word order)
- Layer 0 attention: focus on "what" and "capital"
- Layer 0 feedforward: recognize question pattern
- N-gram memory injection: look up "capital of" in hash table, gate decides relevance, inject
- Layer 1-4: refine understanding through attention + feedforward
- Halt gate check: confident? Yes → stop. No → loop Layers 1-4 again (up to 3 times)
- Episodic memory query: search for similar past brain states, retrieve, gate, blend
- Nearest-word search: find closest word vector in vocab ("paris" or "france" depending on context)
- Temperature sampling: randomly pick from top-k candidates
- Output: "paris"
That's the full reasoning pipeline. Four layers of attention, memory lookups at two different points, adaptive pondering, and context-aware gating throughout.
What Makes This Research-Worthy
Big companies don't build models like this. They scale transformers to trillions of parameters and call it done. Engram explores orthogonal ideas:
- Can you separate reasoning from vocabulary? Yes — ChromaDB proves it.
- Can you inject memory between layers instead of compressing it into weights? Yes — N-gram tables prove it.
- Can you adaptively allocate compute based on input difficulty? Yes — pondering proves it.
- Can you learn from surprise instead of treating all data equally? Yes — surprise-weighted gradients prove it.
None of these ideas require massive scale. They work at 900K parameters. That's the point.
Engram isn't trying to beat GPT-4. It's trying to answer: "What if we treated memory and computation as separate, complementary systems instead of mashing everything into one giant neural blob?"
The DeepSeek paper (Jan 2026) validated this approach at scale. Their Engram module (same name, independent discovery) showed that memory injection gives reasoning gains that exceed knowledge gains. Layer 5 with memory matches Layer 12 without it.
That's not incremental improvement. That's architectural rethinking.
So yeah. That's how Engram thinks. Attention to focus on context. Feedforward to compute transformations. Memory to shortcut common patterns. Pondering to think harder when needed. Gating to decide what to trust.
It's a reasoning engine built from interpretable parts. You can trace every step. You can modify every piece. And you can actually understand what's happening under the hood.
That's rare in modern AI. Most models are black boxes. Engram is a glass box.

Top comments (0)