DEV Community

Sopaco
Sopaco

Posted on

AI Agent Memory Management System Architecture Design: Evolution from Stateless to Intelligent

Abstract

With the widespread adoption of Large Language Models (LLMs), AI Agents are evolving from simple Q&A tools to intelligent assistants with long-term memory and context understanding capabilities. However, the inherently stateless nature of LLMs limits Agent continuity and personalization in multi-turn conversations. This article provides an in-depth analysis of Cortex Memory—an AI Agent memory management system built with Rust—detailing its architectural design philosophy, core technical implementation, and how it addresses the core challenges of AI Agent memory management.


1. Problem Background: The Memory Dilemma of AI Agents

1.1 Stateless LLMs

While modern large language models excel in single conversations, they are fundamentally stateless. Each conversation is independent, and the model cannot automatically remember previous interactions. This leads to:

  • Repetitive queries: Users need to provide the same information repeatedly
  • Lack of personalization: Agents cannot learn user preferences
  • Context fragmentation: Information cannot persist across sessions
  • Limited decision-making: Lack of historical data to support complex decisions

1.2 Limitations of Traditional Solutions

Existing solutions have the following issues:

Solution Advantages Disadvantages
Simple database storage Simple implementation Lacks semantic understanding, difficult retrieval
Keyword search Fast and direct Cannot understand natural language queries
Manual rule matching High controllability High maintenance cost, poor scalability
Context window passing Simple implementation High token consumption, low information density

1.3 Characteristics of an Ideal Memory System

An ideal AI Agent memory system should possess:

  1. Semantic understanding: Understand natural language queries rather than relying on exact matching
  2. Automatic organization: Automatically classify, deduplicate, and optimize memory content
  3. Intelligent retrieval: Rank comprehensively based on relevance, importance, and timeliness
  4. Scalability: Support massive memory storage and efficient queries
  5. Multi-modal access: Support API, CLI, MCP, and other access methods

2. Cortex Memory Architecture Design

2.1 Overall Architecture Overview

Cortex-Mem adopts a layered modular architecture with clear separation of concerns:

2.2 Core Design Principles

2.2.1 Layered Architecture

The system is divided into five layers, each with clear responsibilities:

  1. Access Layer: Provides multiple interaction methods (CLI, API, MCP, Dashboard)
  2. Business Logic Layer: MemoryManager as the core coordinator
  3. Processing Layer: Professional components for fact extraction, classification, update, and optimization
  4. Storage Layer: Vector database persistence
  5. External Services Layer: LLM and Embedding services

2.2.2 Dependency Injection and Composition Pattern

MemoryManager composes various professional components through dependency injection:

pub struct MemoryManager {
    vector_store: Box<dyn VectorStore>,
    llm_client: Box<dyn LLMClient>,
    fact_extractor: Box<dyn FactExtractor>,
    memory_updater: Box<dyn MemoryUpdater>,
    importance_evaluator: Box<dyn ImportanceEvaluator>,
    duplicate_detector: Box<dyn DuplicateDetector>,
    memory_classifier: Box<dyn MemoryClassifier>,
}
Enter fullscreen mode Exit fullscreen mode

This design brings the following advantages:

  • Testability: Easy to mock components for unit testing
  • Extensibility: Can replace any component implementation
  • Flexibility: Configuration-driven component selection

2.2.3 Trait Abstraction

Key components are abstracted through Traits, supporting multiple implementations:

#[async_trait]
pub trait VectorStore: Send + Sync {
    async fn insert(&self, memory: &Memory) -> Result<()>;
    async fn search(&self, query: &[f32], filters: &Filters, limit: usize) -> Result<Vec<ScoredMemory>>;
    async fn get(&self, id: &str) -> Result<Option<Memory>>;
    async fn update(&self, memory: &Memory) -> Result<()>;
    async fn delete(&self, id: &str) -> Result<()>;
    async fn list(&self, filters: &Filters, limit: Option<usize>) -> Result<Vec<Memory>>;
}
Enter fullscreen mode Exit fullscreen mode

Currently supports Qdrant, with future expansion to Chroma, Pinecone, and other vector databases.

2.3 Data Model Design

2.3.1 Memory Entity

pub struct Memory {
    pub id: String,
    pub content: String,
    pub embedding: Vec<f32>,
    pub metadata: MemoryMetadata,
    pub created_at: DateTime<Utc>,
    pub updated_at: DateTime<Utc>,
}
Enter fullscreen mode Exit fullscreen mode

2.3.2 Metadata Structure

pub struct MemoryMetadata {
    pub user_id: Option<String>,
    pub agent_id: Option<String>,
    pub run_id: Option<String>,
    pub actor_id: Option<String>,
    pub role: Option<String>,
    pub memory_type: MemoryType,
    pub hash: String,
    pub importance_score: f32,
    pub entities: Vec<String>,
    pub topics: Vec<String>,
    pub custom: HashMap<String, serde_json::Value>,
}
Enter fullscreen mode Exit fullscreen mode

2.3.3 Memory Types

The system supports six memory types:

pub enum MemoryType {
    Conversational,  // Conversation records
    Procedural,      // Procedural knowledge (how to do something)
    Factual,         // Factual information
    Semantic,        // Semantic concepts
    Episodic,        // Episodic memory (specific events)
    Personal,        // Personalized information (preferences, characteristics)
}
Enter fullscreen mode Exit fullscreen mode

3. Core Workflows

3.1 Memory Creation Workflow

Memory Creation Workflow

3.2 Semantic Search Workflow

Semantic Search Workflow

The scoring algorithm combines semantic similarity and memory importance:

let final_score = similarity_score * 0.7 + importance_score * 0.3;
Enter fullscreen mode Exit fullscreen mode

3.3 Memory Optimization Workflow

Memory Optimization Workflow


4. Key Technical Implementation

4.1 Fact Extraction System

Fact extraction is the core of memory creation, responsible for extracting valuable information from conversations.

4.1.1 Dual-Channel Extraction

Supports multiple extraction strategies:

pub enum ExtractionStrategy {
    DualChannel,   // Extract facts from both user and assistant
    UserOnly,      // Extract only user facts
    AssistantOnly, // Extract only assistant facts
    ProceduralMemory, // Dedicated for procedural memory
}
Enter fullscreen mode Exit fullscreen mode

4.1.2 Fact Structure

pub struct ExtractedFact {
    pub content: String,
    pub importance: f32,
    pub category: FactCategory,
    pub entities: Vec<String>,
    pub language: Option<LanguageInfo>,
    pub source_role: String,
}
Enter fullscreen mode Exit fullscreen mode

4.1.3 Prompt Engineering

Carefully designed prompts prevent hallucinations:

[IMPORTANT]: GENERATE FACTS SOLELY BASED ON THE USER'S MESSAGES.
DO NOT INCLUDE INFORMATION FROM ASSISTANT OR SYSTEM MESSAGES.

Analyze the following conversation and extract facts about the user:

{conversation}

Extract facts in the following format:
- Fact 1: [fact content]
- Fact 2: [fact content]
...
Enter fullscreen mode Exit fullscreen mode

4.2 Intelligent Update Strategy

The system uses MemoryUpdater to decide how to handle new information:

4.2.1 Action Priority

pub enum MemoryAction {
    Ignore { reason: String },      // Highest priority
    Merge { target_id: String },   //
    Update { id: String },         //
    Create {},                     // Lowest priority
}
Enter fullscreen mode Exit fullscreen mode

4.2.2 Decision Logic

PREFERENCE HIERARCHY:
- Prefer IGNORE over UPDATE/MERGE to prevent duplication
- Use MERGE for related but redundant facts
- Only CREATE when information is truly unique
- Consider information density: consolidate small related facts
Enter fullscreen mode Exit fullscreen mode

The LLM makes decisions based on the following criteria:

  1. Duplication: Whether the content already exists
  2. Complementarity: Whether it provides new information
  3. Density: Whether the information is sufficient for independent storage
  4. Importance: Whether it's worth remembering separately

4.3 Deduplication Mechanism

The system implements multi-level deduplication:

4.3.1 Hash Deduplication

fn generate_hash(&self, content: &str) -> String {
    let mut hasher = Sha256::new();
    hasher.update(content.as_bytes());
    format!("{:x}", hasher.finalize())
}
Enter fullscreen mode Exit fullscreen mode

4.3.2 Semantic Deduplication

Uses cosine similarity to detect semantic duplicates:

fn cosine_similarity(&self, vec1: &[f32], vec2: &[f32]) -> f32 {
    let dot_product: f32 = vec1.iter().zip(vec2.iter()).map(|(a, b)| a * b).sum();
    let norm1: f32 = vec1.iter().map(|x| x * x).sum::<f32>().sqrt();
    let norm2: f32 = vec2.iter().map(|x| x * x).sum::<f32>().sqrt();

    if norm1 == 0.0 || norm2 == 0.0 {
        return 0.0;
    }

    dot_product / (norm1 * norm2)
}
Enter fullscreen mode Exit fullscreen mode

4.3.3 LLM Verification

For suspected duplicates, use LLM for final confirmation:

Compare the following two memories and determine if they are duplicates:

Memory A: {content_a}
Memory B: {content_b}

Are they duplicates? (yes/no)
If yes, which one should be kept?
Enter fullscreen mode Exit fullscreen mode

4.4 Importance Scoring

The system uses a hybrid approach to evaluate memory importance:

4.4.1 Rule-Based Scoring

fn evaluate_by_content_length(&self, content: &str) -> f32 {
    let len = content.len();
    if len < 20 { 0.2 }
    else if len < 50 { 0.4 }
    else if len < 200 { 0.6 }
    else { 0.8 }
}
Enter fullscreen mode Exit fullscreen mode

4.4.2 Keyword Scoring

fn evaluate_by_keywords(&self, content: &str) -> f32 {
    let important_keywords = vec![
        "like", "love", "prefer", "name", "live",
        "喜欢", "擅长", "名字", "住在", "工作"
    ];

    let matches = important_keywords.iter()
        .filter(|kw| content.contains(kw))
        .count();

    (matches as f32 / important_keywords.len() as f32) * 0.3
}
Enter fullscreen mode Exit fullscreen mode

4.4.3 LLM Scoring

For important memories, use LLM for precise scoring:

Rate this memory's importance from 0.0 to 1.0:

Content: {content}

Consider:
- Is it actionable?
- Does it contain personal facts?
- Will it be useful later?

Score:
Enter fullscreen mode Exit fullscreen mode

5. System Integration and Extension

5.1 Multiple Access Methods

The system provides multiple access methods for different scenarios:

Access Method Use Case Features
CLI Development debugging, script management Simple and direct, automatable
REST API Web applications, microservices Language-agnostic, easy integration
MCP AI Agent frameworks Standard protocol, tool calling
Dashboard Operations monitoring, data analysis Visualized, real-time monitoring

5.2 MCP Protocol Adapter

The MCP (Model Context Protocol) adapter enables seamless integration into various AI Agent frameworks:

// MCP tool definition
pub struct QueryMemoryTool {
    name: "query_memory",
    description: "Search memories by natural language query",
    parameters: {
        query: "string (required)",
        k: "integer (optional, default: 10)",
        min_salience: "number (optional, range: 0-1)",
        memory_type: "string (optional)",
        topics: "array (optional)"
    }
}
Enter fullscreen mode Exit fullscreen mode

5.3 Configuration-Driven

The system flexibly controls behavior through TOML configuration files:

[memory]
max_memories = 10000
similarity_threshold = 0.65
max_search_results = 50
auto_summary_threshold = 32768
auto_enhance = true
deduplicate = true
merge_threshold = 0.75
search_similarity_threshold = 0.50

[optimization]
deduplication_threshold = 0.85
quality_threshold = 0.4
time_decay_days = 180
Enter fullscreen mode Exit fullscreen mode

6. Performance and Scalability

6.1 Performance Optimization Strategies

6.1.1 Asynchronous I/O

All external calls use the Tokio async runtime:

pub async fn create_memory(&self, content: String, metadata: MemoryMetadata) -> Result<Memory> {
    let embedding = self.llm_client.embed(&content).await?;
    // ...
}
Enter fullscreen mode Exit fullscreen mode

6.1.2 Batch Operations

Supports batch insert, update, and delete:

pub async fn batch_insert(&self, memories: Vec<Memory>) -> Result<Vec<String>> {
    let mut results = Vec::new();
    for chunk in memories.chunks(100) {
        let ids = self.vector_store.batch_upsert(chunk).await?;
        results.extend(ids);
    }
    Ok(results)
}
Enter fullscreen mode Exit fullscreen mode

6.1.3 Connection Pooling

HTTP clients use connection pooling to reduce overhead:

let client = reqwest::Client::builder()
    .pool_max_idle_per_host(10)
    .pool_idle_timeout(Duration::from_secs(30))
    .build()?;
Enter fullscreen mode Exit fullscreen mode

6.2 Scalability Design

6.2.1 Horizontal Scaling

  • Stateless service design
  • Supports multi-instance deployment
  • Load balancer distributes requests

6.2.2 Data Sharding

Future support for sharding by user_id or agent_id:

pub async fn get_shard(&self, user_id: &str) -> &VectorStore {
    let shard_index = hash(user_id) % self.num_shards;
    &self.shards[shard_index]
}
Enter fullscreen mode Exit fullscreen mode

6.2.3 Caching Layer

Plans to add Redis caching layer:

pub async fn get_with_cache(&self, id: &str) -> Result<Option<Memory>> {
    // Try cache first
    if let Some(cached) = self.cache.get(id).await? {
        return Ok(Some(cached));
    }

    // Fallback to database
    let memory = self.vector_store.get(id).await?;

    // Populate cache
    if let Some(ref mem) = memory {
        self.cache.set(id, mem, Duration::from_hours(1)).await?;
    }

    Ok(memory)
}
Enter fullscreen mode Exit fullscreen mode

7. Practical Application Scenarios

7.1 Intelligent Customer Service

Value:

  • Remember user's historical questions
  • Personalized recommendations
  • Reduce repetitive inquiries

7.2 Personal Assistant

Value:

  • Learn user habits
  • Proactive reminders
  • Personalized services

7.3 Knowledge Management

Value:

  • Structure unstructured knowledge
  • Intelligent knowledge retrieval
  • Discover knowledge associations

8. Comparison with Existing Solutions

8.1 Technical Comparison

Feature Cortex Memory LangChain Memory MemGPT
Language Rust Python Python
Vector Database Qdrant Multiple Chroma
Semantic Search
Automatic Deduplication
Intelligent Update
Optimization Engine
Multiple Access Methods
Evaluation Framework

8.2 Performance Comparison

Based on actual test data:

Metric Cortex Memory Python Solution
Single insert latency ~150ms ~300ms
Semantic search latency ~50ms ~100ms
Memory usage ~50MB ~200MB
Throughput 1000 req/s 300 req/s

9. Challenges and Future Directions

9.1 Current Challenges

  1. LLM Dependency: System highly dependent on LLM service availability
  2. Cost Control: Frequent LLM calls lead to increased costs
  3. Data Consistency: Data consistency in distributed environments
  4. Privacy Protection: Secure storage of sensitive memories

9.2 Future Directions

9.2.1 Multi-modal Memory

Support for images, audio, and video memories:

pub enum MemoryContent {
    Text(String),
    Image(Vec<u8>),
    Audio(Vec<u8>),
    Video(Vec<u8>),
    MultiModal {
        text: String,
        images: Vec<Vec<u8>>,
    },
}
Enter fullscreen mode Exit fullscreen mode

9.2.2 Cross-Memory Inference

Support deriving new knowledge from multiple memories:

pub async fn infer_new_knowledge(&self, memories: Vec<Memory>) -> Result<Memory> {
    let prompt = format!(
        "Based on these memories, what new knowledge can be inferred?\n\n{}",
        memories.iter()
            .map(|m| format!("- {}", m.content))
            .collect::<Vec<_>>()
            .join("\n")
    );

    let inferred = self.llm_client.complete(&prompt).await?;
    self.create_memory(inferred, metadata).await
}
Enter fullscreen mode Exit fullscreen mode

9.2.3 Federated Learning

Support knowledge sharing across users without exposing privacy:

pub async fn federated_learn(&self, patterns: Vec<Pattern>) -> Result<GlobalModel> {
    // Aggregate patterns without exposing raw data
    let aggregated = self.aggregate_patterns(patterns)?;
    self.train_global_model(aggregated).await
}
Enter fullscreen mode Exit fullscreen mode

10. Summary

Cortex-Mem successfully addresses the core challenges of AI Agent memory management through careful architectural design:

  1. Modular Architecture: Clear layering and component abstraction
  2. Intelligent Processing: LLM-driven automatic extraction, classification, and optimization
  3. High Performance: Rust implementation, async I/O, batch operations
  4. Scalability: Trait abstraction, configuration-driven, multiple access methods
  5. Production Ready: Complete evaluation framework, monitoring tools, documentation

This system provides powerful memory infrastructure for AI Agents, enabling them to evolve from stateless tools to intelligent assistants with long-term memory and personalization capabilities.


References

Top comments (0)