How rate reduction and lossy compression principles from Berkeley's new textbook could reshape how we build persistent memory for LLMs
The Memory Problem No One Talks About
Every AI coding assistant you use today has the same dirty secret: it forgets everything the moment your session ends. That brilliant debugging session where Claude figured out your codebase architecture? Gone. The context about your team's coding conventions that took 20 messages to establish? Evaporated.
We're building MemoryGraph to solve this problem—a graph-based memory system that gives LLMs persistent, queryable memory across sessions. But as we dove deeper into the architecture, we kept hitting the same fundamental question:
What does it actually mean to "remember" something well?
It's not enough to just store text and retrieve it. Human memory doesn't work that way. We compress experiences into schemas, organize knowledge hierarchically, and somehow retrieve exactly what's relevant from decades of accumulated experience in milliseconds.
Then we discovered a new textbook that changed how we think about this problem entirely.
"Learning Deep Representations of Data Distributions"
In August 2025, Yi Ma's lab at Berkeley released Learning Deep Representations of Data Distributions—an open-source textbook that presents a unified mathematical framework for understanding deep learning through the lens of compression.
Their central thesis is deceptively simple:
"We compress to learn, and we learn to compress."
The book argues that intelligence—whether biological or artificial—fundamentally involves discovering low-dimensional structure in high-dimensional data and transforming that data into compact, structured representations.
This isn't just philosophy. They provide rigorous mathematics showing that popular neural network architectures (ResNets, Transformers, CNNs) can be derived as iterative optimization steps that maximize something called rate reduction—a measure of how well representations compress data while preserving important distinctions.
Reading this, we realized: this framework maps directly onto the memory storage problem.
Rate Reduction: A New Way to Think About Memory Quality
The book introduces a principle called Maximal Coding Rate Reduction (MCR²). Here's the intuition:
Imagine you have a collection of memories from different categories—bug fixes, architectural decisions, API documentation, team preferences. A good memory representation should do two things simultaneously:
Maximize expansion between categories: Memories about bug fixes should live in a completely different "region" of your representation space than memories about team preferences. You want these categories to be as distinguishable as possible.
Maximize compression within categories: All your bug fix memories should cluster tightly together. They share common structure—problem, cause, solution—and your representation should capture that.
Mathematically, this is expressed as:
ΔR = R(all memories) - Σ R(memories in each category)
Where R is the "coding rate"—essentially, how many bits you'd need to encode the data. You want to maximize ΔR: the total coding rate should be high (diverse memories), but the sum of within-category rates should be low (similar memories cluster).
This gives us a concrete metric for memory quality that goes beyond simple retrieval accuracy.
How This Applies to LLM Memory Systems
The Problem with Flat Embeddings
Most vector databases treat all memories the same way: convert text to a 384 or 768-dimensional embedding, store it, retrieve by cosine similarity.
But this ignores the structure we know exists in the data. A memory about a "person" is fundamentally different from a memory about a "code pattern." Treating them identically wastes representational capacity and makes retrieval harder.
The Berkeley framework suggests a different approach: type-specific subspaces.
Structured Embedding Spaces
Instead of one flat embedding space, imagine memories organized into learned subspaces:
┌─────────────────────────────────────────────────────────────┐
│ Embedding Space (384-dim) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Person │ │ Project │ │ Solution │ │
│ │ Subspace │ │ Subspace │ │ Subspace │ │
│ │ (64-dim) │ │ (128-dim) │ │ (96-dim) │ │
│ │ │ │ │ │ │ │
│ │ • Alice │ │ • ProjectX │ │ • Fix#123 │ │
│ │ • Bob │ │ • MemGraph │ │ • Fix#456 │ │
│ │ • Carol │ │ • API-v2 │ │ • Pattern#7 │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ↑ Orthogonal subspaces (maximally separated) │
└─────────────────────────────────────────────────────────────┘
Each memory type gets projected into its own subspace. These subspaces are learned to be orthogonal—maximizing separation between types. Within each subspace, similar memories cluster together—maximizing compression.
This is rate reduction in action: expand between categories, compress within them.
The Graph as a Compression Mechanism
Here's where things get interesting for MemoryGraph specifically.
The Berkeley book shows that neural network layers can be understood as iterative compression steps. Each layer transforms representations to be more compact and more structured.
We realized: a knowledge graph already does this.
Consider how MemoryGraph stores information:
Raw Input: "Alice fixed the authentication bug in the login
service yesterday by adding proper token validation"
Graph Representation:
(Alice:Person) --[AUTHORED]--> (Fix#892:Solution)
(Fix#892:Solution) --[RESOLVES]--> (AuthBug:Error)
(AuthBug:Error) --[AFFECTS]--> (LoginService:Project)
(Fix#892:Solution) --[INVOLVES]--> (TokenValidation:CodePattern)
This graph representation is a lossy compression of the original text. We've extracted the essential structure—who, what, where, how—and discarded the rest. The entities are compressed representations (cluster centers), and the relationships define how to navigate between them.
In the language of the Berkeley book:
- Entities = compressed representations of many observations (low-dimensional subspace centers)
- Relations = transformation operators between subspaces
- Observations = high-dimensional raw data that gets compressed into entity updates
The graph structure itself encodes the low-dimensional manifold that rate reduction seeks to discover.
Compression in Action: MemoryGraph's Inference Engine
This isn't just theory—we've already implemented automatic compression in MemoryGraph's inference engine. When you save a memory, the system automatically discovers and creates new relationships you didn't explicitly define.
Transitive Compression
Consider this scenario. You create two explicit relationships:
"Auth Service" --[DEPENDS_ON]--> "JWT Library"
"JWT Library" --[DEPENDS_ON]--> "Crypto Utils"
The inference engine automatically adds:
"Auth Service" --[DEPENDS_ON]--> "Crypto Utils" (inferred, transitive)
This is path compression—reducing a multi-hop traversal to a single edge. In information-theoretic terms, we're eliminating redundancy in the graph structure. The transitive relationship was always implicitly there; the inference engine makes it explicit and queryable.
Type Inference as Semantic Compression
The engine also performs type inference:
| Rule | What It Does |
|---|---|
type_from_solves |
If memory SOLVES a problem → type becomes "solution" |
type_from_fixes |
If memory FIXES an error → type becomes "fix" |
type_from_causes |
If memory CAUSES a problem → type becomes "problem" |
This is semantic compression: instead of storing "this memory has a SOLVES relationship to a problem-type entity," we compress that pattern into a single type label. The type is the compressed representation of the memory's structural role in the graph.
Co-occurrence Affinity
Our cloud tier includes an even more interesting rule: co-occurrence affinity. When two memories share multiple common connections (say, 3+ shared neighbors), the engine infers they're related—even if no one explicitly connected them.
Memory A --[USES]--> React
Memory A --[USES]--> TypeScript
Memory A --[PART_OF]--> Frontend Module
Memory B --[USES]--> React
Memory B --[USES]--> TypeScript
Memory B --[PART_OF]--> Frontend Module
Inferred: Memory A --[RELATED_TO]--> Memory B (confidence: 0.45)
This is the graph equivalent of finding low-dimensional structure: memories that occupy similar "positions" in the relationship space (share many neighbors) are likely semantically related, even without explicit links.
Confidence Scores and Lossy Compression
Every inferred relationship carries a confidence score (0-1). Lower confidence means the inference is more speculative—it's a "lossier" compression of the underlying evidence.
{
"relationship_type": "DEPENDS_ON",
"inferred": true,
"confidence": 0.5,
"rule": "transitive_depends_on",
"depth": 2
}
Users can tune this tradeoff: accept more inferred relationships (richer graph, more noise) or fewer (sparser graph, higher precision). This is exactly the rate-distortion tradeoff that information theory describes—you choose how much fidelity to sacrifice for how much compression.
The cleanup API makes this explicit:
POST /inference/cleanup?min_confidence=0.3&max_age_days=30
This removes low-confidence edges older than 30 days—literally pruning the graph to maintain a target compression quality.
Practical Implementation Ideas
The inference engine demonstrates that compression principles already work in MemoryGraph. Based on the Berkeley framework, we're exploring several enhancements that go deeper:
1. Type-Aware Embeddings
Currently, our semantic search (planned for the SDK) would treat all content identically. But the Berkeley framework suggests projecting embeddings through type-specific learned transformations:
class TypeAwareEmbedding:
def __init__(self):
self.base_encoder = SentenceTransformer("all-MiniLM-L6-v2")
# Learned projections for each entity type
# Dimensions chosen based on type complexity
self.projections = {
"person": LinearProjection(384, 64),
"project": LinearProjection(384, 128),
"solution": LinearProjection(384, 96),
"error": LinearProjection(384, 48),
"code_pattern": LinearProjection(384, 64),
}
def embed(self, content: str, entity_type: str) -> np.ndarray:
base = self.base_encoder.encode(content)
projection = self.projections.get(entity_type)
return projection(base) if projection else base
The dimensionality of each subspace reflects the inherent complexity of that type. People are relatively simple to characterize; projects have more nuance. This extends our existing type inference—instead of just labeling types, we'd represent them in mathematically distinct subspaces.
2. Progressive Memory Consolidation
The book describes how deep networks progressively compress representations layer by layer. Our inference engine already does single-pass compression. We can extend this to multi-layer consolidation over time:
Layer 0: Raw Observations (high-dimensional, ephemeral)
"User mentioned they prefer tabs over spaces"
"User asked about Python formatting"
"User corrected a spacing issue in code"
│
▼ Compression (after session) [EXISTING: type inference]
Layer 1: Working Memory (mid-dimensional, session-persistent)
"User has strong code formatting preferences"
│
▼ Compression (after multiple sessions) [NEW: consolidation]
Layer 2: Consolidated Knowledge (low-dimensional, long-term)
Entity property: coding_style = "strict_formatting"
│
▼ Compression (over time) [NEW: schema evolution]
Layer 3: Core Identity (minimal, permanent)
Entity: User with trait "detail_oriented"
This mirrors how human memory consolidation works—episodic memories compress into semantic knowledge over time. Our existing similar_tags_affinity rule hints at this: memories that share structure get linked. The next step is actually merging them.
3. Expansion-Compression Retrieval
Our planned hybrid search (ADR-005) optimizes for similarity. The rate reduction framework suggests we should also optimize for distinctiveness:
def retrieval_score(query: str, memory: Memory,
other_retrieved: List[Memory]) -> float:
# Compression term: how relevant is this memory?
similarity = cosine_similarity(
embed(query),
embed(memory.content)
)
# Expansion term: how distinct is this from other results?
# This prevents returning 5 near-duplicate memories
distinctiveness = 1.0 - mean([
cosine_similarity(embed(memory.content), embed(other.content))
for other in other_retrieved
])
# Balance both objectives (like rate reduction's R - Rc)
return alpha * similarity + (1 - alpha) * distinctiveness
This directly implements the MCR² principle: maximize total coding rate (diverse results) while minimizing within-group coding rate (each result is relevant). It prevents retrieval from returning redundant results—a common failure mode of pure similarity search.
4. Graph-Guided Manifold Navigation
Our inference engine already exploits graph structure for discovery. We can extend this to retrieval:
async def graph_aware_search(query: str, depth: int = 2):
# 1. Find entry points via embedding similarity
seeds = await semantic_search(query, limit=5)
# 2. Expand along graph edges (follow the manifold)
# This uses our existing relationship structure
expanded = set(seeds)
frontier = seeds
for _ in range(depth):
for entity in frontier:
# Relations define valid transitions on the manifold
# Inferred edges from our engine help here!
neighbors = await get_related_entities(entity)
expanded.update(neighbors)
frontier = expanded - set(seeds)
# 3. Re-rank by combined graph + semantic relevance
return rank_by_rate_reduction_score(expanded, query)
The graph provides a strong inductive bias about which memories are likely relevant together. Transitive inferred edges (from our inference engine) act as "shortcuts" on the manifold—they represent compressed paths through the relationship space.
5. Inference Rules as Compression Operators
Looking at our inference rules through the Berkeley lens reveals their true nature:
| Rule | Compression Type | Information-Theoretic Interpretation |
|---|---|---|
transitive_depends_on |
Path compression | Removes redundant traversals |
type_from_solves |
Semantic compression | Encodes role in single label |
co_occurrence_affinity |
Structural compression | Finds shared subspace membership |
reverse_solves |
Bidirectional encoding | Enables queries from either direction |
This suggests new rules we could add:
# Cluster compression: memories with 5+ shared tags → merge into summary entity
"high_overlap_merge": {
"condition": "shared_tags >= 5 AND same_project",
"action": "create_summary_entity",
"compression_type": "lossy"
}
# Temporal compression: daily memories → weekly summary
"temporal_consolidation": {
"condition": "age > 7 days AND same_topic",
"action": "compress_to_summary",
"preserve": ["key_decisions", "blockers", "outcomes"]
}
What's Next
The inference engine proves that compression principles work for AI memory. We're now exploring how to go deeper:
Already Built:
- Transitive inference (path compression)
- Type inference (semantic compression)
- Co-occurrence affinity (structural compression)
- Confidence-based cleanup (rate-distortion tradeoff)
Actively Researching:
Type-aware embeddings: How do we train type-specific projections without massive labeled datasets? Self-supervised approaches using the graph structure itself look promising—the relationships encode supervision signal.
Consolidation triggers: When should observations compress into entity updates? Too aggressive and we lose detail; too conservative and memory bloats. We're exploring information-theoretic triggers: consolidate when adding a new observation wouldn't increase the coding rate significantly.
Cross-type retrieval: How do we handle queries spanning multiple types? "Find solutions that Alice worked on for projects using Python" crosses Person, Solution, and Project subspaces. The graph edges provide a natural answer—they're the transformation operators between subspaces.
Rate reduction metrics: Can we use MCR² as an actual quality metric during development? This would let us evaluate memory architectures principally, not just via retrieval benchmarks. Early experiments suggest the metric correlates well with subjective "memory usefulness."
The Bigger Picture
The Berkeley book's framework suggests something profound: compression isn't just a storage optimization—it's the fundamental operation of learning itself.
Every time you explain a complex system by its key components, you're doing rate reduction. Every time you recognize a pattern across multiple experiences, you're finding low-dimensional structure. Every time you organize knowledge hierarchically, you're building a manifold.
For AI memory systems, this means we shouldn't think of memory as a retrieval problem with storage attached. Memory is a compression problem with retrieval as a side effect.
Get the compression right—find the true low-dimensional structure in the experiences—and retrieval becomes almost trivial. The memories that matter will naturally cluster together, and the right memory for a given context will be the one that reduces uncertainty the most.
MemoryGraph's inference engine is our first step down this path. Transitive compression, type inference, and co-occurrence affinity are all compression operators—they find structure and make it explicit. The next steps—type-aware embeddings, progressive consolidation, and rate-reduction-guided retrieval—push further toward a system that truly learns from experiences rather than just storing them.
The theoretical grounding matters because it tells us why these approaches work and what to try next. When your inference rule creates a transitive edge, that's not just a database optimization—it's the system discovering that a three-hop path can be compressed to one hop without losing essential information. When type inference labels a memory as "solution," it's compressing the memory's structural role into a single bit of semantic information.
That's what we're building toward with MemoryGraph. Not just a database that stores what AI assistants experience, but a system that truly learns from those experiences—compressing them into structured knowledge that makes every future interaction more intelligent.
MemoryGraph is an open-source graph-based memory system for AI assistants. Check out the project at github.com/gregorydickson/memory-graph or try the cloud platform at memorygraph.dev.
The Berkeley textbook "Learning Deep Representations of Data Distributions" is freely available at ma-lab-berkeley.github.io/deep-representation-learning-book.
References
Buchanan, S., Pai, D., Wang, P., & Ma, Y. (2025). Learning Deep Representations of Data Distributions. Online textbook.
Chan, K.H.R., Yu, Y., You, C., Qi, H., Wright, J., & Ma, Y. (2022). ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction. Journal of Machine Learning Research.
Yu, Y., Buchanan, S., Pai, D., et al. (2024). White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is? Journal of Machine Learning Research.
Tags: #AI #Memory #LLM #MachineLearning #KnowledgeGraphs #Compression #Research
Top comments (0)