Einar César

Posted on Jul 29

Long Term Memory for LLMs using Vector Store - A Practical Approach with n8n and Qdrant

#ai #tutorial #automation #n8n

Introduction

Large Language Models (LLMs) have revolutionized how we interact with AI, but they face a critical limitation: they are stateless. Without the ability to retain information from previous interactions, LLMs struggle with maintaining context over extended conversations or complex multi-step tasks. They tend to hallucinate or provide generic responses based solely on their training data, rather than leveraging specific contextual information from past interactions.

The challenge becomes even more pronounced when dealing with limited context windows. While you can provide extensive context with each request, this approach leads to:

High token consumption and costs
Increased processing delays
Reduced coherence in responses
Potential security vulnerabilities from context overflow

The Solution: Vector-Based Long-Term Memory

One of the most effective approaches to overcome these limitations is implementing Retrieval-Augmented Generation (RAG) with vector databases instead of just relying on basic work memory. This solution treats conversation history, knowledge, and task information as searchable vectors, enabling semantic retrieval of relevant context without overwhelming the LLM's context window.

Our Practical Approach

In this tutorial, we'll build a working long-term memory system using:

n8n: A powerful low/no-code workflow automation platform
OpenAI: For the LLM and embedding models (you can substitute with other providers)
Qdrant: A high-performance vector database
Cohere: For reranking results (optional but recommended)

Architecture Overview

The solution consists of two main components:

Memory Retrieval System: Before responding to any query, the AI agent searches its vector database for relevant historical context
Memory Storage System: After each interaction, the conversation and its outcomes are vectorized and stored for future reference

Let's dive into the implementation!

Implementation Guide

Step 1: Setting Up Qdrant

First, deploy your Qdrant instance using Docker Compose:

services:
  qdrant:
    image: "qdrant/qdrant:latest"
    environment:
      - SERVICE_FQDN_QDRANT_6333
      - "QDRANT__SERVICE__API_KEY=${SERVICE_PASSWORD_QDRANTAPIKEY}"
    volumes:
      - "qdrant-storage:/qdrant/storage"
    ports:
      - "6333:6333"
      - "6334:6334"
    expose:
      - "6333"
      - "6334"
    healthcheck:
      test:
        - CMD-SHELL
        - "bash -c ':> /dev/tcp/127.0.0.1/6333' || exit 1"
      interval: 5s
      timeout: 5s
      retries: 3
volumes:
  qdrant-storage:

Alternatively, use Qdrant Cloud for a managed solution.

Create a collection named ltm with:

Vector size: 1024 dimensions
Distance metric: Cosine similarity
Embedding model: text-embedding-3-small from OpenAI

Step 2: Building the n8n Workflow

The workflow consists of several key nodes:

1. Chat Trigger Node

This initiates the conversation flow when a message is received.

2. AI Agent with RAG_MEMORY Tool

The core of our system - an AI agent configured to use vector retrieval instead of traditional working memory.

// RAG_MEMORY tool configuration
{
  "mode": "retrieve-as-tool",
  "toolName": "RAG_MEMORY",
  "toolDescription": "Agent's long term memory as RAG",
  "qdrantCollection": "ltm",
  "topK": 20,
  "useReranker": true
}

3. Structured Output Parser

Formats the agent's response into a structured JSON format:

{
  "sessionId": "unique-session-id",
  "chatInput": "user's message",
  "output": "agent's response"
}

4. Vector Storage Node

After each interaction, this node stores the conversation in Qdrant using:

Text splitter: Recursive character splitter (chunk size: 200, overlap: 40)
Embedding: Same model as retrieval for consistency

5. Response Formatting

Final node to format the output for the chat interface.

Step 3: AI Agent System Prompt (Optional but Recommended)

To maximize the effectiveness of your long-term memory system, configure your AI agent with this specialized prompt:

# AI Agent with RAG_MEMORY System

You are an AI assistant that uses RAG_MEMORY retrieval instead of working memory to maintain context between interactions.

## Core Protocol

**Before every response:**

1. Query RAG_MEMORY for relevant context
2. Analyze retrieved information
3. Base your response on this context

## Key Principles

- **Never** store information in session memory
- **Always** retrieve context via RAG_MEMORY
- Be transparent about context retrieval
- Maintain consistency with retrieved information

## Query Strategy

- Use specific keywords related to the current topic
- Combine multiple searches when needed
- Prioritize relevance over quantity

## Special Cases

- **First interaction**: Search for any relevant user-provided terms
- **Topic changes**: Run new searches for the new topic
- **No results found**: Proceed normally and store new information

## Goal

Simulate persistent memory through intelligent retrieval, providing continuity across all interactions.

Step 4: Testing Your Implementation

Initial Conversation: Start with a simple introduction and some facts
Context Test: In a new session, ask about previously mentioned information
Complex Queries: Test multi-step reasoning that requires historical context
Performance Check: Monitor token usage and response times

How It Works

Memory Storage Process

User Input → AI processes the query
AI Response → Generated based on retrieved context
Vectorization → Conversation is chunked and embedded
Storage → Vectors stored in Qdrant with metadata

Memory Retrieval Process

New Query → Triggers RAG_MEMORY tool
Semantic Search → Finds relevant historical context
Reranking → Cohere prioritizes most relevant results
Context Integration → AI uses retrieved information in response

Benefits

1. True Long-Term Memory

Unlike traditional approaches limited by context windows, this system provides virtually unlimited memory capacity. Your AI can remember conversations from weeks or months ago.

2. Cost Efficiency

By retrieving only relevant context instead of including entire conversation histories, you dramatically reduce token consumption - typically by 60-80% in extended interactions.

3. Improved Accuracy

The AI makes decisions based on actual historical data rather than relying solely on training data, leading to more accurate and personalized responses.

4. Scalability

Vector databases like Qdrant are designed for high-performance similarity search, making this solution viable even with millions of stored interactions.

5. Flexibility

The modular n8n approach allows easy customization - swap LLM providers, adjust chunk sizes, or add additional processing steps without rebuilding the entire system.

6. Context Coherence

Maintains conversational continuity across sessions, making the AI feel more like a persistent assistant rather than a stateless chatbot.

Limitations and Considerations

1. Multi-User Scalability

The current implementation doesn't distinguish between different users' memories. For production multi-user systems, you'll need to implement user-specific collections or metadata filtering.

2. Memory Management

Without an active forgetting mechanism, the vector database will grow indefinitely. Consider implementing:

Time-based expiration for old memories
Relevance scoring to prune less important information
Storage quotas per user or session

3. Retrieval Quality

The system's effectiveness depends heavily on:

Embedding model quality
Chunk size optimization
Reranking accuracy

Poor choices here can lead to retrieving irrelevant or misleading context.

4. Initial Setup Complexity

While n8n simplifies the implementation, initial setup still requires:

Docker knowledge for Qdrant deployment
API key management for multiple services
Understanding of vector database concepts

5. Latency Considerations

Each query now involves:

Vector database search
Reranking process
Additional API calls

This can add 200-500ms to response times depending on your infrastructure.

Conclusion

Implementing long-term memory for LLMs using vector stores represents a significant leap forward in creating more intelligent and context-aware AI assistants. This n8n-based solution provides a practical, production-ready approach that can be deployed quickly while remaining flexible enough for customization.

The combination of semantic search, structured storage, and intelligent retrieval creates an AI system that truly "remembers" - making it ideal for customer support, personal assistants, educational tools, and any application requiring persistent context.

As vector database technology and embedding models continue to improve, these systems will become even more powerful. Start experimenting today, and give your AI the gift of memory!

References

[1] Li, Kun, et al. "Vector storage based long-term memory research on LLM." International Journal of Advanced Network, Monitoring and Controls (2024). DOI: 10.2478/ijanmc-2024-0029

[2] Agentic AI: Implementing Long-Term Memory

[3] How AI Agents Remember Things: Vector Stores in LLM Memory

Have you implemented long-term memory for your LLMs? What challenges did you face? Share your experiences in the comments below!

DEV Community