UAML

Posted on Mar 19 • Originally published at uaml-memory.com

Context Quality is the New Model Quality: An Open Memory Provider Standard with Zero-Downtime Compaction for LLM Agents

#ai #machinelearning #opensource #python

Context Quality is the New Model Quality: An Open Memory Provider Standard with Zero-Downtime Compaction for LLM Agents

Short title: How We Eliminated 77% Entity Loss and Agent Freeze with an Open Memory Standard

Author: L. Zamazal, GLG, a.s.

Date: March 2026

Keywords: LLM memory, context compaction, agent memory, information loss, on-demand recall, UAML, structured memory, MCP, zero-downtime, open standard

Abstract

Large language models have no persistent memory. Every inference call receives the entire conversation context as input, and the model's response quality depends directly on the quality of that input. When context grows beyond the window limit, platforms perform compaction — summarizing the conversation to fit. We measured that standard compaction loses 77% of named entities (people, decisions, tools, dates) in production multi-agent deployments, directly degrading agent decision quality.

We present a three-layer recall architecture that achieves 100% entity recovery while keeping per-turn token costs minimal through on-demand retrieval. Deployed in production across two agent instances processing 16,000+ messages, our approach demonstrates that improving memory quality is more cost-effective than upgrading models. We survey 27 recent papers from Western, Chinese, Korean, and Japanese research communities and show that no existing system combines selective on-demand recall with post-quantum encryption, audit trails, and temporal validity — capabilities required for deployment in regulated industries.

Our central claim: you don't pay for a better model; you pay for better memory.

1. Introduction

The dominant assumption in the LLM industry is that performance scales with model capability — larger models, longer context windows, better reasoning. This assumption drives a hardware arms race and an API pricing war. But it obscures a fundamental architectural problem: the context window is not memory. It is cache.

Every API call to a large language model sends the complete conversation context — system instructions, tool schemas, prior messages, injected documents — as input. There is no persistent state between calls. The model sees only what is explicitly present in that single request. When the accumulated context exceeds the model's window, frameworks apply compaction: lossy summarization that discards what doesn't fit.

An AI agent is only as good as what it remembers. Yet the dominant paradigm for handling context overflow is the computational equivalent of asking a colleague to shred their notes and work from a one-paragraph summary.

The consequences are measurable. In our production deployment, we observed that after standard compaction, agents could recall only 23% of named entities from prior conversations — losing decisions, attributions, tool references, and temporal context. When asked about a payment processor integration discussed 90 minutes earlier, the agent responded "I don't know," despite having processed that exact information before compaction occurred.

This paper presents evidence from academic research and production measurements that:

Context quality determines output quality — not model size, not context window length
Standard compaction is fundamentally lossy — and the loss is quantifiable
On-demand recall from structured memory solves this without inflating per-turn costs
No existing system combines selective recall with the security and auditability required for regulated deployment
The approach works today — no model modifications required, no cloud dependencies

2. The Context Quality Problem

2.1 The "Lost in the Middle" Effect

Liu et al. (2024) demonstrated that transformer-based models exhibit significant performance degradation when relevant information is positioned in the middle of the input context. In multi-document QA experiments, accuracy dropped by more than 20 percentage points when the answer-bearing document moved from the beginning or end to the center of the context [1]. This finding has been replicated across model families including GPT-4, Claude, and Llama.

Crucially, this is not a generational bug being patched away. Esmi et al. (2026) benchmarked GPT-5 and found that long-context performance still degrades compared to short-context baselines [2]. Salvatore et al. (2025) argue this is an emergent property of the attention mechanism itself [3]. The problem is architectural.

2.2 Context Rot

Chroma Research (2026) coined the term "Context Rot" to describe the non-uniform performance degradation that occurs as input length increases [4]. Testing 18 LLMs on controlled tasks where only input length varied (not task complexity), they found that standard benchmarks like Needle-in-a-Haystack give a false sense of security — they test simple lexical retrieval, not the semantic reasoning that real applications demand.

The implication: a 1M-token context window filled with marginally relevant information will produce worse results than a 50K-token window with precisely the right information.

2.3 The Compaction Trap

When context exceeds the window limit, agent platforms perform compaction — typically by sending the entire context to an LLM with instructions to summarize. Wang et al. (2026) describe this as a "fundamentally lossy" operation [5]: truncation and summarization compress or discard evidence that may be critical for future decisions.

The problem compounds over time. Each compaction cycle loses detail from the previous cycle's summary. After several cycles, the agent retains only the broadest themes — a phenomenon we term progressive context amnesia.

2.4 Context Conflicts

Research from the Harbin Institute of Technology (Tan et al., ACL 2024) demonstrated that when retrieved context conflicts with the model's parametric knowledge, LLMs frequently ignore the provided context entirely [6]. Poorly curated context injection doesn't just waste tokens — it can actively mislead the model by triggering parametric override of correct but poorly positioned external evidence. This makes provenance and temporal validity of injected context a first-order concern.

3. Measuring Information Loss: Production Evidence

3.1 Experimental Setup

We deployed UAML (Universal Agent Memory Layer) on two production agent instances processing real-world multi-agent team conversations over a period of several weeks. Over 16,000 chat messages were indexed, producing 6,400+ structured knowledge entries. We then measured entity recovery — the ability to recall named entities (people, tools, decisions, dates, configurations) after compaction events.

3.2 Three-Layer Architecture

Our architecture provides three recovery layers:

Layer	Source	Function
L1	Platform native compaction	Summarized conversation context
L2	UAML knowledge base	Structured entities extracted from conversations
L3	SQL archive	Complete, unmodified message history

3.3 Results

Configuration	Entity Recovery	What Survives
L1 only (standard compaction)	23%	Broad themes, recent topics
L1 + L2 (+ structured knowledge)	50%	Named entities, key decisions, facts
L1 + L2 + L3 (full architecture)	100%	Everything — zero data loss

Standard compaction loses 77% of named entities. After compaction, the agent cannot reliably recall:

Who made a specific decision
What tool or configuration was discussed
When a change was agreed upon
Why a particular approach was chosen

These are precisely the facts that determine whether an agent's next response is helpful or hallucinated.

3.4 Real-World Impact

During production operation, an agent was asked about a prior decision regarding a payment processor integration — a topic discussed 90 minutes earlier but lost to compaction. Without structured memory, the agent could not answer. With on-demand recall (a single query taking <100ms), the full decision context was recovered and the agent responded correctly with complete attribution.

This is not an edge case. In multi-session, multi-day agent deployments, every compaction cycle creates potential blind spots that accumulate over time.

3.5 Structural Waste

Mason (2026) independently confirmed the scale of the problem by analyzing 857 production LLM sessions comprising 4.45 million effective input tokens [7]. The finding: 21.8% of all context was structural waste — tool definitions that were never invoked, system prompt repetitions, stale tool results from completed subtasks. This waste is not merely an efficiency problem; it actively competes with relevant information for the model's limited attention capacity.

4. On-Demand Recall vs. Context Stuffing

A naive solution would be to inject all available memory into every context. This fails for three reasons:

Token cost scales linearly — every additional token in context is paid for on every turn, whether or not it's needed
Context Rot degrades quality — more tokens ≠ better results [4]
Distractors harm accuracy — Iratni et al. (2025) showed that irrelevant retrieved passages actively degrade output quality [8]

On-demand recall avoids all three problems:

Approach	Tokens/turn	Cost Pattern	Accuracy
No compaction (full history)	200K+	Very high, every turn	Degrades with length
Standard compaction	20–50K	Low	23% entity recovery
Auto-inject all memory	50–80K	High, every turn	High but with noise
On-demand recall	20–50K base + 2–5K when needed	Low (recall only when needed)	100% recoverable

The mechanism works in two phases within a single turn:

Phase 1: The agent receives a query, recognizes it needs historical context, and calls a memory retrieval tool
Phase 2: Relevant entries are returned into context, and the agent formulates a response with complete information

This mirrors how professionals work: you don't carry every document to every meeting, but you know where to find them when needed.

5. The Research Landscape

5.1 Context Compression

The compression approach tackles the problem at the infrastructure level — making context smaller without (ideally) losing information.

Activation Beacon (Zhang, Liu, Xiao et al., BAAI/FlagOpen, 2024) compresses KV cache activations rather than text, achieving 8× KV cache reduction and 2× inference speedup on 128K+ contexts [9]. Recurrent Context Compression (Huang, Zhu, Wang et al., 2024) achieves 32× compression with BLEU4 near 0.95 [10]. Semantic Compression (Fei, Niu, Zhou et al., Huawei, ACL 2024) applies information-theoretic source coding to extend context 6–8× without fine-tuning [11].

These approaches address hardware efficiency but do not solve the semantic quality problem: compressed content still carries all information indiscriminately.

5.2 Memory Agent Architectures

A more promising direction treats memory as a managed system rather than a compression target.

MemAgent (Yu, Chen, Feng et al., 2025) uses RL-trained agents that read text in segments and update memory via an overwrite strategy, extrapolating from 8K training context to 3.5M token QA with <5% loss [12]. With 81 citations, it represents the current state of the art in Chinese memory agent research.

SimpleMem (Liu, Su, Xia et al., 2026) implements a three-stage pipeline — semantic compression, online synthesis, intent-aware retrieval — achieving +26.4% F1 on LoCoMo while reducing token consumption by 30× [13]. Its architecture is conceptually closest to our on-demand recall approach.

MemOS (MemTensor, Shanghai Jiao Tong University, Renmin University, China Telecom, 2025) proposes a Memory Operating System with OS-inspired lifecycle management, achieving state-of-the-art across benchmarks [14]. It operates primarily in latent space — complementary to text-level structured approaches.

M+ (ICML 2025) extends MemoryLLM with scalable long-term memory, confirming that "retaining information from the distant past remains a challenge" [15] — exactly what SQL-backed archival solves without model modification.

5.3 Agent-Centric Context Management

ACON (Kang et al., Microsoft Research, 2025) optimizes context compression for long-horizon agents, reducing memory usage by 26–54% [16]. Its key finding validates our approach: "generic summarization easily loses critical details" — task-aware, selective retrieval is essential.

Pichay (Mason, 2026) takes the most radical approach, treating the context window as L1 cache and implementing demand paging with eviction and fault detection [7]. In production, it reduces context consumption by up to 93%. Mason's framing captures the field's core insight: "the problems — context limits, attention degradation, cost scaling, lost state across sessions — are virtual memory problems wearing different clothes."

Focus (Verma, 2026) implements autonomous context compression where agents decide when to consolidate learnings and prune history, achieving 22.7% token reduction without accuracy loss [17].

MemArt (2025) demonstrates that structured retrieval improves accuracy by 11.8–39.4% over plaintext memory methods with 91–135× reduction in prefill tokens [18] — direct validation of the principle that targeted recall outperforms brute-force injection.

5.4 Korean and Japanese Contributions

InfiniGen (Lee et al., Seoul National University, OSDI 2024) addresses KV cache management for long-text generation, achieving up to 3× speedup over existing methods [23]. Funded by Samsung Advanced Institute of Technology and cited 253 times, it represents the hardware-level approach to context scaling that complements software-level memory management.

THEANINE (Ong, Kim, Gwak et al., NAACL 2025) introduces timeline-based memory management for lifelong dialogue agents [24]. Its key insight — don't delete old memories, but connect them temporally and causally — directly validates UAML's temporal validity mechanism. Memories form evolving timelines rather than static snapshots.

LRAgent (Jeon et al., Korea, 2026) tackles KV cache sharing for multi-LoRA agents [25], addressing the exact overhead problem that emerges when multiple agents share a backbone but maintain separate caches.

Most striking is "Store then On-Demand Extract" (Yamanaka et al., Japan, 2026), which argues against the dominant "extract then store" paradigm and advocates storing raw data with on-demand extraction at query time [26]. This is philosophically identical to UAML's three-layer approach: preserve everything (L3), extract structured knowledge (L2), and retrieve on demand. Yamanaka's framing — "uplifting the world with memory" — captures the same conviction that memory infrastructure is foundational, not auxiliary.

AIM-RM (Yoshizato, Shimizu et al., Japan, AAMAS 2026) demonstrates practical deployment of memory retrieval in industrial supply chain agents [27], confirming that memory-augmented agents are moving beyond chatbot applications into production enterprise systems.

5.5 Foundational References

MemGPT/Letta (Packer et al., 2023) pioneered virtual context management inspired by OS memory hierarchies [19]. Mem0 (2025) offers scalable memory for multi-session dialogues [20]. Memex(RL) (Wang et al., 2026) is the closest academic parallel — indexed memory with compact summaries plus full-fidelity external storage [5].

5.6 Surveys

Two comprehensive surveys map the field: Cognitive Memory in LLMs (Shan, Luo, Zhu et al., 2025) provides the most complete taxonomy of memory mechanisms with 34 citations [21], and A Comprehensive Survey on Long Context Language Modeling (Liu, Zhu, Bai et al., 2025) covers the full spectrum with contributions from 35+ researchers and 88 citations [22].

5.7 The Gap

Across all surveyed approaches, several critical capabilities are consistently absent:

Capability	MemAgent	SimpleMem	MemOS	Pichay	THEANINE	Mem0	MemGPT	UAML
On-demand selective recall	✗	✓	Partial	✓	✗	Partial	Partial	✓
Cross-session memory	✗	✗	✓	✗	✓	✓	✗	✓
End-to-end encryption	✗	✗	✗	✗	✗	✗	✗	✓ (PQC)
Audit trail / provenance	✗	✗	✗	✗	✗	✗	✗	✓
Temporal validity	✗	✗	✗	✗	✓	✗	✗	✓
Multi-agent isolation	✗	✗	✗	✗	✗	✗	✗	✓
Local-first / self-hosted	✓	✓	✗	✓	✓	✗	Partial	✓
Certifiable for regulated use	✗	✗	✗	✗	✗	✗	✗	✓
Zero-downtime compaction	✗	✗	✗	Partial	✗	✗	✗	✓
MCP integration (drop-in)	✗	✗	✗	✗	✗	✗	✗	✓
Production-deployed	Research	Research	Research	✓	Research	✓	Partial	✓

No existing system combines selective on-demand recall with the security, auditability, and temporal reasoning required for deployment in regulated environments — healthcare, legal, financial services, government.

6. Security and Compliance: The Enterprise Gap

Academic memory systems universally neglect security. For enterprises in regulated industries, this is a deployment blocker. UAML addresses this gap with:

Post-quantum cryptography (ML-KEM-768, NIST FIPS 203) for all stored memories — future-proof against quantum computing threats
Complete audit trail: every write, read, and recall is logged with actor, timestamp, and context
Provenance chain: each knowledge entry tracks its source, extraction confidence, and modification history
Temporal validity: memories carry valid-from and valid-until metadata, preventing stale information from entering active context
Client isolation: strict per-client, per-project memory boundaries prevent cross-contamination
Local-first deployment: all data stays on the user's infrastructure — no cloud dependency, no data exfiltration risk

These are not features for a roadmap. They are implemented, tested, and deployed in production.

7. Toward an Open Standard: External Memory Provider API

We believe memory infrastructure should be standardized, not proprietary. We have proposed an External Memory Provider API (RFC #49233) [28] that addresses three interconnected problems: agent downtime during compaction, information loss after compaction, and the lack of a standard integration path for external memory systems.

7.1 The Problem: Compaction Blackouts

When an agent platform's context window fills up, the platform performs synchronous in-band compaction: the agent stops responding, the entire context is sent to an LLM for summarization, and the summary replaces the original context. In our measurements, this creates a 30–60 second blackout — the agent is completely unresponsive. For production use cases (customer support, financial services, healthcare), this is a deployment blocker.

7.2 Proposed Architecture: Hot-Swap Context

RFC #49233 proposes a hot-swap architecture with continuous background synchronization:

┌──────────────────────────────────────────┐
│           Agent Platform                 │
│  Context Slot A (active) ←→ Slot B       │
│              ↕               ↕           │
│  ┌────────────────────────────────┐      │
│  │   Memory Provider Interface    │      │
│  └──────────┬─────────────────────┘      │
└─────────────┼────────────────────────────┘
              ▼
┌─────────────────────────────┐
│  External Memory Provider   │
│  • Continuous async sync    │
│  • Background compression   │
│  • Pre-built context ready  │
│  • Full audit trail         │
└─────────────────────────────┘

The mechanism has four stages:

Continuous sync — Every message is written to the memory provider asynchronously (fire-and-forget, local socket, ~1ms). The provider never blocks the agent.
Background compression — The provider maintains a compressed context summary, updated asynchronously using a lightweight local model. The compressed context is always ready before the agent needs it.
Atomic swap — When capacity threshold is reached, the platform reads the pre-built context from the provider (~50ms) and swaps it in during the inter-message gap. The agent never pauses.
On-demand recall — When the agent needs details that were compressed away, it queries the provider via a tool call (~10–50ms), receiving targeted knowledge entries.

This is analogous to double-buffering in graphics: while the agent uses buffer A, the provider prepares buffer B in the background. When it's time to compact, the platform atomically swaps A for B.

7.3 The API

The proposed interface is deliberately minimal — three core operations plus lifecycle hooks:

interface MemoryProvider {
  // Fire-and-forget after each message (async, not in critical path)
  onMessage(sessionId: string, message: Message): Promise<void>;

  // Returns pre-built compressed context
  getCompressedContext(sessionId: string, maxChars: number): Promise<CompressedContext>;

  // On-demand recall tool
  recall(sessionId: string, query: string, limit?: number): Promise<MemoryEntry[]>;

  ping(): Promise<{ ok: boolean; latencyMs: number }>;
}

interface SessionHooks {
  'pre-compaction': (session: Session) => Promise<void>;
  'post-compaction': (session: Session, newContext: CompressedContext) => Promise<void>;
  'session-start': (session: Session) => Promise<CompressedContext | null>;  // Cold-start recovery
  'session-end': (session: Session) => Promise<void>;
}

Configuration is opt-in with full backward compatibility:

{
  "memory": {
    "provider": "uaml",
    "endpoint": "http://localhost:8770",
    "compaction": {
      "strategy": "hot-swap",
      "threshold": 0.85,
      "targetSize": 0.40
    }
  }
}

7.4 Performance Impact

Metric	Current (builtin)	With Memory Provider
Compaction duration	30–60s	<100ms
Agent downtime	30–60s	0 (between messages)
Information loss	Significant (77%)	None (full DB)
Audit trail	None	Complete
Cost per compaction	~$0.10–0.50	~$0.001 (async local)

7.5 Seamless Integration via MCP

A critical design decision is the choice of integration protocol for the recall tool. UAML implements the Model Context Protocol (MCP) connector, which means any existing LLM agent that supports MCP tool calls can connect to UAML's full memory capabilities — recall, indexing, knowledge extraction — without any code changes to the agent itself. The agent simply gains a new tool in its toolbox. No forking, no framework lock-in, no migration.

In our production deployment, two agents from different platforms were connected to UAML via MCP within minutes, immediately gaining access to structured memory recall while retaining all their existing functionality. The MCP approach turns memory from a platform feature into a universal service layer that any agent can consume.

7.6 Dual-Track Architecture: Never a Single Point of Failure

A critical design principle: the platform's builtin compaction pipeline always runs in parallel — it is never disabled, even when a memory provider is active.

Builtin Compaction  ─────────────────────► ALWAYS running (shadow/backup)
UAML Memory Provider ────────────────────► Enriches when available (overlay)

This guarantees 100% functionality at all times:

If the memory provider fails → agent continues with builtin context (no interruption)
If the memory provider succeeds → agent gets enriched context (better recall)
Switchover is automatic — no operator intervention required
The memory provider enhances the pipeline; it does not replace it

7.7 Input Quality Classification

Not all information is equally important. The memory provider classifies every entry:

Level	Criteria	% of data	Examples
HIGH	Decisions, rules, architecture choices	0.5%	"Chose ML-KEM-768 for encryption"
MEDIUM	Entity mentions, config changes, results	7%	"VPS IP: 5.189.139.221"
LOW	Debug output, heartbeats, transient status	93%	Tool output, NO_REPLY messages

This filtering ensures the recall API returns signal, not noise. The getCompressedContext endpoint prioritizes HIGH and MEDIUM entries, keeping injected context compact and relevant.

7.8 Provenance Chain

Every knowledge entry maintains a verifiable provenance chain from recalled fact to original message:

UAML Entry #4521
├── content: "Decided on hot-swap compaction strategy"
├── importance: HIGH (score: 7)
├── source: session:a7e0260c:msg_hash_abc123
├── chat_history_id: 28451  ← links to SQL archive
└── created_at: 2026-03-18T14:23:00Z
        │
        ▼
SQL chat_messages #28451
├── text: [full original message, verbatim]
├── source_file: a7e0260c-...jsonl
└── source_line: 14832

This enables audit (trace any fact to its source), verification (compare summary against verbatim record), and compliance (demonstrate data provenance for regulated environments).

7.9 Cross-Session Memory

In production, agents operate across multiple sessions (Discord channels, messaging platforms, scheduled tasks). The proposed API extension enables:

Global recall — find knowledge from any session, not just the current one
Session grouping — the broker identifies related sessions automatically
Our implementation already indexes 807+ session fragments with recall chain linking across session boundaries

7.10 Error Handling and Backward Compatibility

The design prioritizes graceful degradation:

Provider failure → automatic fallback to builtin compaction (never block the agent)
Cold start → session-start hook enables context reconstruction from historical data
Message during swap → swap occurs in inter-message gap; incoming messages queued during ~100ms swap
Default behavior unchanged → builtin compaction remains the default; memory provider is opt-in

7.11 Phased Roadmap

Phase	Capability	Estimated Timeline
1	Pre/post-compaction hooks	1–2 weeks
2	Memory Provider Interface	2–3 weeks
3	Hot-swap compaction strategy	3–4 weeks
4	Background sync + config + docs	2–3 weeks

Phase 1 alone would already enable external memory integration and demonstrate value. The full RFC proposal is publicly available at github.com/openclaw/openclaw/issues/49233 and has received community attention and independent analysis, confirming demand for standardized memory infrastructure.

8. Practical Implications

For Agent Developers

Standard compaction is a silent quality killer — measure your entity recovery rate before and after compaction
On-demand recall is more cost-effective than larger context windows or more expensive models
Memory quality compounds: each correct recall enables better future decisions

For Platform Builders

Expose compaction hooks to external providers — one-size-fits-all compaction is insufficient
Compliance requirements will drive enterprise adoption of structured, auditable memory
The open standard approach (RFC #49233) reduces integration risk

For Researchers

Context quality vs. context quantity deserves more attention as an evaluation axis
Entity recovery rate is a practical, reproducible metric for comparing memory approaches
The gap between academic prototypes and production requirements (encryption, audit, compliance) represents an underexplored research opportunity

For the Industry

Research groups across East Asia are producing world-class work on context compression and memory architectures. Chinese institutions (BAAI, Shanghai Jiao Tong, Harbin Institute, Huawei, China Telecom, MemTensor), Korean groups (Seoul National University, Samsung, KAIST), and Japanese researchers are collectively advancing the field at a pace that exceeds Western output in volume and increasingly matches it in impact. Any serious memory infrastructure product must engage with this global research base.

9. Conclusion

The evidence converges from multiple independent sources: context quality determines output quality. A smaller context with the right information produces better results than a larger context with noise. Standard compaction loses 77% of named entities — a measurable degradation that directly affects agent decision quality.

On-demand recall from structured memory resolves this without the token cost of context stuffing. Our three-layer architecture (compaction + knowledge base + archive) achieves 100% entity recovery while maintaining minimal per-turn costs. The approach requires no model modifications, no cloud dependencies, and works as an overlay on existing agent platforms.

We are not arguing for replacing compaction — it serves a useful purpose in maintaining manageable context sizes. We are arguing that compaction alone is insufficient, and that a structured external memory layer transforms it from a lossy compression into a lossless one.

The organizations that understand this will build agents that actually work. The rest will keep buying bigger models and wondering why they still forget.

Memory quality is the new model quality.

References

[1] Liu, N.F. et al. (2024). "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the ACL, 12, 157–173. arXiv:2307.03172

[2] Esmi, N. et al. (2026). "GPT-5 vs Other LLMs in Long Short-Context Performance." arXiv, February 2026.

[3] Salvatore, N. et al. (2025). "Lost in the Middle: An Emergent Property from Information Retrieval Demands in LLMs." arXiv, October 2025.

[4] Chroma Research (2026). "Context Rot: How Increasing Input Tokens Impacts LLM Performance." research.trychroma.com/context-rot

[5] Wang, Z. et al. (2026). "Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory." arXiv:2603.04257

[6] Tan, Y. et al. (2024). "When Retrieved Context Conflicts with Parametric Knowledge." ACL 2024, Harbin Institute of Technology.

[7] Mason, T. (2026). "The Missing Memory Hierarchy: Demand Paging for LLM Context Windows." arXiv:2603.09023

[8] Iratni, M. et al. (2025). "Dynamic Context Selection for Retrieval-Augmented Generation: Mitigating Distractors and Positional Bias." arXiv, December 2025.

[9] Zhang, P. et al. (2024). "Long Context Compression with Activation Beacon." BAAI/FlagOpen. arXiv:2401.03462

[10] Huang, C. et al. (2024). "Recurrent Context Compression: Efficiently Expanding the Context Window of LLM." arXiv:2406.06110

[11] Fei, W. et al. (2024). "Extending Context Window of Large Language Models via Semantic Compression." Huawei. Findings of ACL 2024, 5169–5181.

[12] Yu, H. et al. (2025). "MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent." arXiv:2507.02259

[13] Liu, J. et al. (2026). "SimpleMem: Efficient Lifelong Memory for LLM Agents." arXiv:2601.02553

[14] MemTensor et al. (2025). "MemOS: A Memory OS for AI System." Shanghai Jiao Tong University, Renmin University, China Telecom. arXiv:2507.03724

[15] M+ (2025). "Extending MemoryLLM with Scalable Long-Term Memory." ICML 2025. arXiv:2502.00592

[16] Kang, M. et al. (2025). "ACON: Optimizing Context Compression for Long-Horizon LLM Agents." arXiv:2510.00615

[17] Verma, N. (2026). "Active Context Compression: Autonomous Memory Management in LLM Agents." arXiv:2601.07190

[18] MemArt (2025). "KVCache-Centric Memory for LLM Agents." OpenReview.

[19] Packer, C. et al. (2023). "MemGPT: Towards LLMs as Operating Systems." arXiv:2310.08560

[20] Mem0 (2025). "Building Production-Ready AI Agents with Scalable Long-Term Memory." arXiv:2504.19413

[21] Shan, L. et al. (2025). "Cognitive Memory in Large Language Models." arXiv:2504.02441

[22] Liu, J. et al. (2025). "A Comprehensive Survey on Long Context Language Modeling." arXiv:2503.17407

[23] Lee, S. et al. (2024). "InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management." OSDI 2024, Seoul National University.

[24] Ong, D., Kim, H., Gwak, S. et al. (2025). "THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation." NAACL 2025.

[25] Jeon, H. et al. (2026). "LRAgent: Multi-LoRA Agents with KV Cache Sharing." arXiv:2602.01053

[26] Yamanaka, Y. et al. (2026). "Store then On-Demand Extract: A Memory Architecture for LLM Agents." arXiv:2602.16192

[27] Yoshizato, T., Shimizu, H. et al. (2026). "AIM-RM: Agent-based Inventory Management with Retrieval Memory." AAMAS 2026. arXiv:2602.05524

[28] Zamazal, L. / GLG, a.s. (2026). "RFC: External Memory Provider API for OpenClaw." GitHub Issue #49233. github.com/openclaw/openclaw/issues/49233

GLG, a.s. — UAML (Universal Agent Memory Layer) is available at uaml-memory.com. Technical documentation and API reference at smart-memory.ai.