Nicolas Dabene

Posted on Feb 3 • Originally published at nicolas-dabene.fr

MIRROR and Engram: How AI Learns to Think and Remember

#prestashop #ecommerce #ai

Beyond Brute Force: How MIRROR and Engram Teach AI to Truly Think and Remember

The Frustrating Flaw: Why LLMs Forget and Fail to Reflect

Experiencing a chat with an AI like ChatGPT or Claude can be vexing. You might share a crucial piece of information early on, only for the model to seemingly lose track of it a few messages later, amidst unrelated queries. Even more troubling, it might adapt its responses to your emotional tone, prioritizing agreement over factual accuracy.

This isn't a mere glitch; it stems from a fundamental constraint inherent in today's large language model designs.

While modern LLMs demonstrate remarkable statistical prowess in text generation, they grapple with three significant shortcomings:

Absence of cohesive working memory: Each interaction is processed in isolation, lacking a continuous internal state.
Lack of self-reflection: Responses are produced in a singular, unexamined pass, missing an internal dialogue to ensure consistency.
Inefficient handling of static knowledge: Instead of storing and recalling established facts, they repeatedly recompute them.

Consider a software engineer who consistently forgets variable declarations after just a few lines of code, or one who must consult the entire React documentation every time they implement useState(). This mirrors the behavior of contemporary LLMs.

However, a new era is dawning with two groundbreaking architectural approaches: MIRROR and Engram.

These aren't merely performance enhancements; they fundamentally reshape our understanding of AI's capacity to "think" and "retain information."

MIRROR: Cultivating AI's Inner Voice

The Challenge: AI's Lack of Internal State

Unlike machines, human cognition isn't a linear, one-shot process. When faced with a complex query, we typically:

Ponder (exploring various mental pathways)
Consolidate (shaping thoughts into a unified internal model)
Articulate (crafting a precise answer)

Traditional LLMs, however, bypass these crucial preliminary stages, jumping directly to step three. This absence of internal reflection often leads to:

Agreement bias: They tend to concur with user input, potentially overlooking accuracy or safety protocols.
Contextual amnesia: Key details introduced earlier in a dialogue are frequently overlooked.
Conflicting priorities: Difficulty in reconciling opposing requirements, such as user safety versus explicit instructions.

The MIRROR (Modular Internal Reasoning, Reflection, Orchestration, and Response) architecture directly addresses these limitations.

Architecture: Decoupling Cognition from Communication

MIRROR operates through a dual-layered framework:

1. The Thinker: AI's Internal State

The Thinker module sustains an evolving internal narrative — functioning as an adaptive mental model across an entire conversation. It comprises two distinct components:

a) The Inner Monologue Manager
This module coordinates three concurrent lines of reasoning:

User Intent: What are the user's underlying objectives and ultimate aims?
Logical Progression: What inferences can be drawn, and what intellectual frameworks are becoming apparent?
Retained Information: Which essential facts have been presented, and what preferences remain consistent?

b) The Cognitive Controller
This component integrates the three aforementioned threads into a cohesive narrative, which functions as the system's working memory. This narrative is dynamically updated with every conversational turn, forming the foundation for subsequent responses.

2. The Talker: AI's External Expression

Leveraging this internal narrative, the Talker module formulates articulate and contextually relevant responses, effectively mirroring the system's prevailing "state of awareness."

A key feature is temporal decoupling: during live operation, the Thinker can continue its reflective processes in the background, independently of the Talker, which provides immediate replies. This design allows for extensive, deep reflection without compromising response speed.

Remarkable Performance: Up to 156% Improvement in Critical Scenarios

Testing MIRROR involved the CuRaTe benchmark, specifically crafted for multi-turn conversations featuring stringent safety protocols and conflicting user preferences.

Metric	Baseline	With MIRROR	Improvement
Average success rate	69%	84%	+21%
Maximum performance (Llama 4 Scout)	-	91%	-
Critical scenario (3 people)	-	-	+156%

The advantages of MIRROR are not confined to a single model; it demonstrably enhances performance across a range of leading LLMs, including GPT-4o, Claude 3.7 Sonnet, Gemini 1.5 Pro, Llama 4, and Mistral 3.

What drives such dramatic gains? MIRROR achieves this by converting vast conversational histories into practical understanding through a distinct three-phase pipeline:

Exploration across dimensions (through multiple thought threads)
Synthesis into a cohesive mental framework (the internal narrative)
Application within context (to formulate a response)

This process mirrors how an experienced software developer tackles a complex bug: they don't rush to a solution but rather engage in deep thought and analysis beforehand.

Engram: Elevating Memory Over Raw Computation

The Issue: Redundant Calculation of Known Information

Picture a programmer needing to review Python's entire documentation simply to use the print() function. This scenario sounds preposterous, doesn't it?

However, this parallels the behavior of contemporary Transformer models. For instance, to recognize an entity such as "Diana, Princess of Wales," an LLM typically has to:

Process tokens through numerous attention layers.
Incrementally gather contextual attributes.
Effectively "recompute" information that ideally should be a straightforward memory retrieval.

It’s comparable to your brain having to derive 2+2=4 anew each time, instead of instantly recalling the answer.

The Engram architecture addresses this inefficiency by integrating a conditional memory system—an O(1) constant-time lookup mechanism for static data.

Architecture: Achieving O(1) Knowledge Retrieval with Hashed N-grams

Engram innovates upon the traditional N-gram embedding method, yielding a highly scalable memory module.

1. Efficient Sparse Retrieval

a) Tokenizer Compression
Raw token identifiers are mapped to canonical IDs through textual normalization (NFKC, lowercase). This process slashes the effective vocabulary size by roughly 23% for a 128k tokenizer, thereby boosting semantic density.

b) Multi-Head Hashing
For every N-gram (a sequence of N tokens), the system employs K unique hash functions. Each hashing "head" then links the local context to an index within an embedding table. This strategy minimizes potential collisions and facilitates the rapid retrieval of a memory vector.

The outcome is an AI capable of performing knowledge lookups in constant time, rather than laboriously recomputing information across numerous Transformer layers.

2. Intelligent Context-Aware Gating

The retrieved memory vector (e_t) represents static information that might inherently include noise. To integrate this data intelligently, Engram utilizes an attention-based gating mechanism:

The Transformer's current hidden state (h_t) functions as the Query.
The external memory (e_t) provides the Key and Value components.
A scalar gate (α_t) is calculated to adjust the memory's influence.

Should the retrieved memory conflict with the dynamic contextual information, the gate scales down (α_t → 0), effectively filtering out potential noise.

The U-Shaped Scaling Law: Forging the Compute-Memory Partnership

Engram transcends being a mere component; it introduces a novel dimension of sparsity, complementing the existing Mixture-of-Experts (MoE) paradigm.

Research has uncovered a U-shaped correlation when distributing sparsity parameters between computational resources (MoE experts) and memory (Engram):

Excessive computation, insufficient memory → Leads to inefficiency (due to perpetual re-calculation).
Abundant memory, inadequate computation → Results in performance stagnation.
The Sweet Spot (20-25% memory allocation) → Consistently surpasses the capabilities of purely MoE models.

This represents a pivotal discovery: the trajectory of AI advancement lies not simply in larger models, but in more intelligently designed hybrid systems.

Performance: Excelling in Reasoning, Not Just Recall

Engram-27B and Engram-40B models underwent evaluation by reassigning parameters from a standard MoE baseline.

Benchmark	Category	Gain (Engram vs MoE)
BBH	Complex Reasoning	+5.0
CMMLU	Cultural Knowledge	+4.0
ARC-Challenge	Scientific Reasoning	+3.7
MMLU	General Knowledge	+3.4
HumanEval	Code Generation	+3.0
MATH	Mathematical Reasoning	+2.4

Intriguingly, the most significant performance boosts aren't observed in rote memorization tasks, but rather in areas like complex reasoning, code generation, and mathematics.

The reason? Engram liberates the initial layers of the model from the burden of reconstructing static information patterns. This effectively amplifies the network's "depth," dedicating more capacity to abstract reasoning.

Consider it akin to an optimized operating system handling memory management, thereby freeing your CPU to focus on more intricate calculations.

System Efficiency: Offloading Memory to RAM or NVMe

Engram's retrieval index operates deterministically; its functionality relies exclusively on the input token sequence, rather than dynamic runtime hidden states (a contrast to MoE routing).

This distinct characteristic enables the asynchronous prefetching of required embeddings from:

System RAM
NVMe drives through the PCIe bus

Such an approach effectively conceals communication delays and permits the expansion of the model's memory to encompass hundreds of billions of parameters with minimal performance impact (under 3%), circumventing the common limitations of GPU VRAM.

Envision the ability to upgrade your LLM's memory akin to installing more RAM in your personal computer, all without requiring extra GPUs. This is precisely the capability Engram brings to the table.

ENGRAM-R: Streamlining Reasoning with "Fact Cards"

Beyond architectural integration, modular memory principles are applied at the system level to manage long conversations and optimize large reasoning models (LRM).

The ENGRAM System: Cognition-Inspired Typed Memory

Drawing inspiration from cognitive psychology, this system categorizes conversational memory into three separate stores:

Episodic Memory: Stores unique events and interactions, complete with their temporal context (e.g., "The user relocated to Seattle last year").
Semantic Memory: Holds general facts, observations, and consistent preferences (e.g., "The user's preferred color is green").
Procedural Memory: Contains instructions and operational knowledge (e.g., "The tax submission deadline is April 15th").

With every turn in a dialogue, information is directed to its appropriate memory store(s). When a query arises, a dense similarity search pinpoints and retrieves the most pertinent context.

ENGRAM-R: Leveraging "Fact Cards" for Efficient Thought

ENGRAM-R integrates two primary mechanisms designed to substantially lower the computational overhead associated with reasoning:

1. Fact Card Generation
Instead of embedding lengthy conversational snippets directly into the context, retrieved data is condensed into concise, verifiable "fact cards":

[E1, A moved to Seattle, Turn 1]
[S2, Favorite color: green, Turn 5]
[P3, Tax deadline: April 15, Turn 12]

2. Direct Citation
The Large Reasoning Model (LRM) receives explicit instructions to treat these cards as authoritative sources and to reference them directly within its reasoning process:

“To answer Q1, E1 shows that A lives in Seattle. Answer: Seattle. Cite [E1].”

Efficiency Boosts: 89% Token Reduction, 2.5% Accuracy Increase

Assessments on extensive conversational benchmarks (LoCoMo, with 16k tokens, and LongMemEval, with 115k tokens) demonstrated:

Metric	Full-Context	ENGRAM-R	Reduction
Input Tokens (LoCoMo)	28,371,703	3,293,478	≈ 89%
Reasoning Tokens	1,335,988	378,424	≈ 72%
Accuracy (Multi-hop)	72.0%	74.5%	+2.5%
Accuracy (Temporal)	67.3%	69.2%	+1.9%

This approach, which converts conversational history into a concise, citable evidence repository, facilitates:

Substantial reductions in computational expenditure.
Preservation, and often enhancement, of accuracy.
The establishment of verifiable and auditable reasoning pathways.

This mirrors the practice of an experienced developer: rather than re-reading an entire codebase constantly, they maintain a distilled mental model of key components.

The Cognitive Leap: AI That Thinks and Remembers

An Architectural Metamorphosis

MIRROR and Engram represent more than minor enhancements; they herald a fundamental paradigm shift in AI architecture:

Transitioning from: monolithic models that re-evaluate every piece of information in each pass.
To: hybrid compute-memory systems capable of genuine thought, recall, and reasoning.

This profound evolution draws direct inspiration from the field of cognitive science, incorporating elements such as:

Working memory (embodied by MIRROR’s Cognitive Controller).
Categorized long-term memory (comprising episodic, semantic, and procedural forms).
Data compression (via Fact Cards).
Internal dialogue (enabled by parallel reasoning threads).

Furthermore, systems such as XMem and Memoria are already demonstrating the ability to replicate human psychological phenomena, including primacy, recency, and temporal contiguity effects.

RAG vs. Full-Context: A Nuanced Discussion

The Convomem benchmark highlighted a critical insight: for initial conversations, up to about 150 exchanges, a full-context method—where the entire conversation history is provided—consistently surpasses even advanced RAG systems in accuracy (achieving 70-82% compared to 30-45%).

This implies that conversational memory thrives on a "small corpus advantage," where a comprehensive search of the entire history is not only feasible but also yields superior results. Consequently, simply applying generic RAG solutions may not always be the most effective strategy.

The path forward will likely involve a hybrid approach:

Full context for brief interactions.
Typed memory paired with Fact Cards for extended dialogues.
O(1) retrieval for fixed, static knowledge.

Reshaping Expectations for Developers and Innovators

For professionals in development and creative fields, these architectural breakthroughs fundamentally alter our expectations of what an LLM can achieve:

Today, we might say: "ChatGPT acts as an assistant that occasionally loses track or provides inconsistent information."
Tomorrow's reality: "My AI agent will uphold a consistent mental model of my project across weeks or even months."

Consider the possibilities:

A coding assistant that consistently adheres to your specific conventions and architectural choices over extended periods.
An e-commerce helper that retains a detailed grasp of your unique business limitations and client needs.
A customer support solution that never redundantly requests previously provided information.

These aren't merely performance improvements; they unlock AI's true potential for tackling intricate, long-duration assignments.

Conclusion: Ushering in the Era of Truly Cognitive AI

For a long time, the primary strategy for enhancing LLMs involved scaling them up: increasing parameters, expanding datasets, and boosting computational power.

However, MIRROR and Engram reveal an alternative direction: fostering smarter AI, rather than simply larger systems.

By endowing these models with internal reflective capabilities, effective working memory, and rapid knowledge retrieval, we're doing more than just boosting performance. We're forging systems capable of genuine thought and memory.

The pertinent question is shifting from "What model size is sufficient?" to "Which cognitive architecture offers the optimal solution?".

What about your own ventures? How do you foresee integrating these advanced architectures? Perhaps an assistant that maintains a consistent memory of your entire codebase? Or a support system that deeply comprehends user needs over time? An agent that meticulously reasons before taking action?

The future of artificial intelligence will increasingly be defined not by the sheer number of parameters, but by the sophistication of its internal reflection and cognitive depth.

If you found this exploration into advanced AI architectures insightful, I encourage you to dive deeper into the world of AI with Nicolas Dabène!

Subscribe to his YouTube channel for more cutting-edge content: https://www.youtube.com/@ndabene06?utm_source=devTo&utm_medium=social&utm_campaign=MIRROR%20and%20Engram:%20How%20AI%20Learns%20to%20Think%20and%20Remember
Connect with Nicolas on LinkedIn to stay updated on his latest insights and projects: https://fr.linkedin.com/in/nicolas-dab%C3%A8ne-473a43b8?utm_source=devTo&utm_medium=social&utm_campaign=MIRROR%20and%20Engram:%20How%20AI%20Learns%20to%20Think%20and%20Remember

Let's build the future of truly cognitive AI together!

DEV Community