Gregory Dickson

Posted on Dec 26

What Deep Learning Theory Teaches Us About AI Memory

#ai #memory #agents

How rate reduction and lossy compression principles from Berkeley's new textbook could reshape how we build persistent memory for LLMs

The Memory Problem No One Talks About

Every AI coding assistant you use today has the same dirty secret: it forgets everything the moment your session ends. That brilliant debugging session where Claude figured out your codebase architecture? Gone. The context about your team's coding conventions that took 20 messages to establish? Evaporated.

We're building MemoryGraph to solve this problem—a graph-based memory system that gives LLMs persistent, queryable memory across sessions. But as we dove deeper into the architecture, we kept hitting the same fundamental question:

What does it actually mean to "remember" something well?

It's not enough to just store text and retrieve it. Human memory doesn't work that way. We compress experiences into schemas, organize knowledge hierarchically, and somehow retrieve exactly what's relevant from decades of accumulated experience in milliseconds.

Then we discovered a new textbook that changed how we think about this problem entirely.

"Learning Deep Representations of Data Distributions"

In August 2025, Yi Ma's lab at Berkeley released Learning Deep Representations of Data Distributions—an open-source textbook that presents a unified mathematical framework for understanding deep learning through the lens of compression.

Their central thesis is deceptively simple:

"We compress to learn, and we learn to compress."

The book argues that intelligence—whether biological or artificial—fundamentally involves discovering low-dimensional structure in high-dimensional data and transforming that data into compact, structured representations.

This isn't just philosophy. They provide rigorous mathematics showing that popular neural network architectures (ResNets, Transformers, CNNs) can be derived as iterative optimization steps that maximize something called rate reduction—a measure of how well representations compress data while preserving important distinctions.

Reading this, we realized: this framework maps directly onto the memory storage problem.

Rate Reduction: A New Way to Think About Memory Quality

The book introduces a principle called Maximal Coding Rate Reduction (MCR²). Here's the intuition:

Imagine you have a collection of memories from different categories—bug fixes, architectural decisions, API documentation, team preferences. A good memory representation should do two things simultaneously:

Maximize expansion between categories: Memories about bug fixes should live in a completely different "region" of your representation space than memories about team preferences. You want these categories to be as distinguishable as possible.
Maximize compression within categories: All your bug fix memories should cluster tightly together. They share common structure—problem, cause, solution—and your representation should capture that.

Mathematically, this is expressed as:

ΔR = R(all memories) - Σ R(memories in each category)

Where R is the "coding rate"—essentially, how many bits you'd need to encode the data. You want to maximize ΔR: the total coding rate should be high (diverse memories), but the sum of within-category rates should be low (similar memories cluster).

This gives us a concrete metric for memory quality that goes beyond simple retrieval accuracy.

How This Applies to LLM Memory Systems

The Problem with Flat Embeddings

Most vector databases treat all memories the same way: convert text to a 384 or 768-dimensional embedding, store it, retrieve by cosine similarity.

But this ignores the structure we know exists in the data. A memory about a "person" is fundamentally different from a memory about a "code pattern." Treating them identically wastes representational capacity and makes retrieval harder.

The Berkeley framework suggests a different approach: type-specific subspaces.

Structured Embedding Spaces

Instead of one flat embedding space, imagine memories organized into learned subspaces:

┌─────────────────────────────────────────────────────────────┐
│                    Embedding Space (384-dim)                │
│                                                             │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│   │   Person     │  │   Project    │  │   Solution   │     │
│   │  Subspace    │  │  Subspace    │  │  Subspace    │     │
│   │   (64-dim)   │  │  (128-dim)   │  │   (96-dim)   │     │
│   │              │  │              │  │              │     │
│   │  • Alice     │  │ • ProjectX   │  │ • Fix#123    │     │
│   │  • Bob       │  │ • MemGraph   │  │ • Fix#456    │     │
│   │  • Carol     │  │ • API-v2     │  │ • Pattern#7  │     │
│   └──────────────┘  └──────────────┘  └──────────────┘     │
│                                                             │
│        ↑ Orthogonal subspaces (maximally separated)         │
└─────────────────────────────────────────────────────────────┘

Each memory type gets projected into its own subspace. These subspaces are learned to be orthogonal—maximizing separation between types. Within each subspace, similar memories cluster together—maximizing compression.

This is rate reduction in action: expand between categories, compress within them.

The Graph as a Compression Mechanism

Here's where things get interesting for MemoryGraph specifically.

The Berkeley book shows that neural network layers can be understood as iterative compression steps. Each layer transforms representations to be more compact and more structured.

We realized: a knowledge graph already does this.

Consider how MemoryGraph stores information:

Raw Input: "Alice fixed the authentication bug in the login 
            service yesterday by adding proper token validation"

Graph Representation:
  (Alice:Person) --[AUTHORED]--> (Fix#892:Solution)
  (Fix#892:Solution) --[RESOLVES]--> (AuthBug:Error)
  (AuthBug:Error) --[AFFECTS]--> (LoginService:Project)
  (Fix#892:Solution) --[INVOLVES]--> (TokenValidation:CodePattern)

This graph representation is a lossy compression of the original text. We've extracted the essential structure—who, what, where, how—and discarded the rest. The entities are compressed representations (cluster centers), and the relationships define how to navigate between them.

In the language of the Berkeley book:

Entities = compressed representations of many observations (low-dimensional subspace centers)
Relations = transformation operators between subspaces
Observations = high-dimensional raw data that gets compressed into entity updates

The graph structure itself encodes the low-dimensional manifold that rate reduction seeks to discover.

Compression in Action: MemoryGraph's Inference Engine

This isn't just theory—we've already implemented automatic compression in MemoryGraph's inference engine. When you save a memory, the system automatically discovers and creates new relationships you didn't explicitly define.

Transitive Compression

Consider this scenario. You create two explicit relationships:

"Auth Service" --[DEPENDS_ON]--> "JWT Library"
"JWT Library" --[DEPENDS_ON]--> "Crypto Utils"

The inference engine automatically adds:

"Auth Service" --[DEPENDS_ON]--> "Crypto Utils" (inferred, transitive)

This is path compression—reducing a multi-hop traversal to a single edge. In information-theoretic terms, we're eliminating redundancy in the graph structure. The transitive relationship was always implicitly there; the inference engine makes it explicit and queryable.

Type Inference as Semantic Compression

The engine also performs type inference:

Rule	What It Does
`type_from_solves`	If memory SOLVES a problem → type becomes "solution"
`type_from_fixes`	If memory FIXES an error → type becomes "fix"
`type_from_causes`	If memory CAUSES a problem → type becomes "problem"

This is semantic compression: instead of storing "this memory has a SOLVES relationship to a problem-type entity," we compress that pattern into a single type label. The type is the compressed representation of the memory's structural role in the graph.

Co-occurrence Affinity

Our cloud tier includes an even more interesting rule: co-occurrence affinity. When two memories share multiple common connections (say, 3+ shared neighbors), the engine infers they're related—even if no one explicitly connected them.

Memory A --[USES]--> React
Memory A --[USES]--> TypeScript  
Memory A --[PART_OF]--> Frontend Module

Memory B --[USES]--> React
Memory B --[USES]--> TypeScript
Memory B --[PART_OF]--> Frontend Module

Inferred: Memory A --[RELATED_TO]--> Memory B (confidence: 0.45)

This is the graph equivalent of finding low-dimensional structure: memories that occupy similar "positions" in the relationship space (share many neighbors) are likely semantically related, even without explicit links.

Confidence Scores and Lossy Compression

Every inferred relationship carries a confidence score (0-1). Lower confidence means the inference is more speculative—it's a "lossier" compression of the underlying evidence.

{
  "relationship_type": "DEPENDS_ON",
  "inferred": true,
  "confidence": 0.5,
  "rule": "transitive_depends_on",
  "depth": 2
}

Users can tune this tradeoff: accept more inferred relationships (richer graph, more noise) or fewer (sparser graph, higher precision). This is exactly the rate-distortion tradeoff that information theory describes—you choose how much fidelity to sacrifice for how much compression.

The cleanup API makes this explicit:

POST /inference/cleanup?min_confidence=0.3&max_age_days=30

This removes low-confidence edges older than 30 days—literally pruning the graph to maintain a target compression quality.

Practical Implementation Ideas

The inference engine demonstrates that compression principles already work in MemoryGraph. Based on the Berkeley framework, we're exploring several enhancements that go deeper:

1. Type-Aware Embeddings

Currently, our semantic search (planned for the SDK) would treat all content identically. But the Berkeley framework suggests projecting embeddings through type-specific learned transformations:

class TypeAwareEmbedding:
    def __init__(self):
        self.base_encoder = SentenceTransformer("all-MiniLM-L6-v2")

        # Learned projections for each entity type
        # Dimensions chosen based on type complexity
        self.projections = {
            "person": LinearProjection(384, 64),
            "project": LinearProjection(384, 128),
            "solution": LinearProjection(384, 96),
            "error": LinearProjection(384, 48),
            "code_pattern": LinearProjection(384, 64),
        }

    def embed(self, content: str, entity_type: str) -> np.ndarray:
        base = self.base_encoder.encode(content)
        projection = self.projections.get(entity_type)
        return projection(base) if projection else base

The dimensionality of each subspace reflects the inherent complexity of that type. People are relatively simple to characterize; projects have more nuance. This extends our existing type inference—instead of just labeling types, we'd represent them in mathematically distinct subspaces.

2. Progressive Memory Consolidation

The book describes how deep networks progressively compress representations layer by layer. Our inference engine already does single-pass compression. We can extend this to multi-layer consolidation over time:

Layer 0: Raw Observations (high-dimensional, ephemeral)
         "User mentioned they prefer tabs over spaces"
         "User asked about Python formatting"
         "User corrected a spacing issue in code"
              │
              ▼ Compression (after session) [EXISTING: type inference]

Layer 1: Working Memory (mid-dimensional, session-persistent)  
         "User has strong code formatting preferences"
              │
              ▼ Compression (after multiple sessions) [NEW: consolidation]

Layer 2: Consolidated Knowledge (low-dimensional, long-term)
         Entity property: coding_style = "strict_formatting"
              │
              ▼ Compression (over time) [NEW: schema evolution]

Layer 3: Core Identity (minimal, permanent)
         Entity: User with trait "detail_oriented"

This mirrors how human memory consolidation works—episodic memories compress into semantic knowledge over time. Our existing similar_tags_affinity rule hints at this: memories that share structure get linked. The next step is actually merging them.

3. Expansion-Compression Retrieval

Our planned hybrid search (ADR-005) optimizes for similarity. The rate reduction framework suggests we should also optimize for distinctiveness:

def retrieval_score(query: str, memory: Memory, 
                    other_retrieved: List[Memory]) -> float:
    # Compression term: how relevant is this memory?
    similarity = cosine_similarity(
        embed(query), 
        embed(memory.content)
    )

    # Expansion term: how distinct is this from other results?
    # This prevents returning 5 near-duplicate memories
    distinctiveness = 1.0 - mean([
        cosine_similarity(embed(memory.content), embed(other.content))
        for other in other_retrieved
    ])

    # Balance both objectives (like rate reduction's R - Rc)
    return alpha * similarity + (1 - alpha) * distinctiveness

This directly implements the MCR² principle: maximize total coding rate (diverse results) while minimizing within-group coding rate (each result is relevant). It prevents retrieval from returning redundant results—a common failure mode of pure similarity search.

4. Graph-Guided Manifold Navigation

Our inference engine already exploits graph structure for discovery. We can extend this to retrieval:

async def graph_aware_search(query: str, depth: int = 2):
    # 1. Find entry points via embedding similarity
    seeds = await semantic_search(query, limit=5)

    # 2. Expand along graph edges (follow the manifold)
    # This uses our existing relationship structure
    expanded = set(seeds)
    frontier = seeds

    for _ in range(depth):
        for entity in frontier:
            # Relations define valid transitions on the manifold
            # Inferred edges from our engine help here!
            neighbors = await get_related_entities(entity)
            expanded.update(neighbors)
        frontier = expanded - set(seeds)

    # 3. Re-rank by combined graph + semantic relevance
    return rank_by_rate_reduction_score(expanded, query)

The graph provides a strong inductive bias about which memories are likely relevant together. Transitive inferred edges (from our inference engine) act as "shortcuts" on the manifold—they represent compressed paths through the relationship space.

5. Inference Rules as Compression Operators

Looking at our inference rules through the Berkeley lens reveals their true nature:

Rule	Compression Type	Information-Theoretic Interpretation
`transitive_depends_on`	Path compression	Removes redundant traversals
`type_from_solves`	Semantic compression	Encodes role in single label
`co_occurrence_affinity`	Structural compression	Finds shared subspace membership
`reverse_solves`	Bidirectional encoding	Enables queries from either direction

This suggests new rules we could add:

# Cluster compression: memories with 5+ shared tags → merge into summary entity
"high_overlap_merge": {
    "condition": "shared_tags >= 5 AND same_project",
    "action": "create_summary_entity",
    "compression_type": "lossy"
}

# Temporal compression: daily memories → weekly summary
"temporal_consolidation": {
    "condition": "age > 7 days AND same_topic",
    "action": "compress_to_summary",
    "preserve": ["key_decisions", "blockers", "outcomes"]
}

What's Next

The inference engine proves that compression principles work for AI memory. We're now exploring how to go deeper:

Already Built:

Transitive inference (path compression)
Type inference (semantic compression)
Co-occurrence affinity (structural compression)
Confidence-based cleanup (rate-distortion tradeoff)

Actively Researching:

Type-aware embeddings: How do we train type-specific projections without massive labeled datasets? Self-supervised approaches using the graph structure itself look promising—the relationships encode supervision signal.

Consolidation triggers: When should observations compress into entity updates? Too aggressive and we lose detail; too conservative and memory bloats. We're exploring information-theoretic triggers: consolidate when adding a new observation wouldn't increase the coding rate significantly.

Cross-type retrieval: How do we handle queries spanning multiple types? "Find solutions that Alice worked on for projects using Python" crosses Person, Solution, and Project subspaces. The graph edges provide a natural answer—they're the transformation operators between subspaces.

Rate reduction metrics: Can we use MCR² as an actual quality metric during development? This would let us evaluate memory architectures principally, not just via retrieval benchmarks. Early experiments suggest the metric correlates well with subjective "memory usefulness."

The Bigger Picture

The Berkeley book's framework suggests something profound: compression isn't just a storage optimization—it's the fundamental operation of learning itself.

Every time you explain a complex system by its key components, you're doing rate reduction. Every time you recognize a pattern across multiple experiences, you're finding low-dimensional structure. Every time you organize knowledge hierarchically, you're building a manifold.

For AI memory systems, this means we shouldn't think of memory as a retrieval problem with storage attached. Memory is a compression problem with retrieval as a side effect.

Get the compression right—find the true low-dimensional structure in the experiences—and retrieval becomes almost trivial. The memories that matter will naturally cluster together, and the right memory for a given context will be the one that reduces uncertainty the most.

MemoryGraph's inference engine is our first step down this path. Transitive compression, type inference, and co-occurrence affinity are all compression operators—they find structure and make it explicit. The next steps—type-aware embeddings, progressive consolidation, and rate-reduction-guided retrieval—push further toward a system that truly learns from experiences rather than just storing them.

The theoretical grounding matters because it tells us why these approaches work and what to try next. When your inference rule creates a transitive edge, that's not just a database optimization—it's the system discovering that a three-hop path can be compressed to one hop without losing essential information. When type inference labels a memory as "solution," it's compressing the memory's structural role into a single bit of semantic information.

That's what we're building toward with MemoryGraph. Not just a database that stores what AI assistants experience, but a system that truly learns from those experiences—compressing them into structured knowledge that makes every future interaction more intelligent.

MemoryGraph is an open-source graph-based memory system for AI assistants. Check out the project at github.com/gregorydickson/memory-graph or try the cloud platform at memorygraph.dev.

The Berkeley textbook "Learning Deep Representations of Data Distributions" is freely available at ma-lab-berkeley.github.io/deep-representation-learning-book.

References

Buchanan, S., Pai, D., Wang, P., & Ma, Y. (2025). Learning Deep Representations of Data Distributions. Online textbook.
Chan, K.H.R., Yu, Y., You, C., Qi, H., Wright, J., & Ma, Y. (2022). ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction. Journal of Machine Learning Research.
Yu, Y., Buchanan, S., Pai, D., et al. (2024). White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is? Journal of Machine Learning Research.

Tags: #AI #Memory #LLM #MachineLearning #KnowledgeGraphs #Compression #Research

DEV Community