DEV Community

Cover image for Building Persistent Memory for Voice AI Agents with MongoDB
Pash10g for MongoDB

Posted on

Building Persistent Memory for Voice AI Agents with MongoDB

Voice agents have a dirty secret: they’re goldfish. End a conversation, come back tomorrow, and they have no idea who you are. For chat apps, you can scroll up and see history. For voice? It’s gone.

We spent the last few weeks fixing this problem for our voice agent platform, and I want to share what we learned by building it with MongoDB Atlas and Gemini Live.

Demo code: Check out the complete working demo at GitHub: voice-memory-demo.


The problem with voice agent memory

When you’re building a voice agent, there are three hard problems:

  1. Session isolation: Each browser/device needs its own memory space.
  2. Multi-user deployments: An agent deployed on a company website has thousands of users.
  3. Semantic retrieval: Users don’t say “What’s my name?” the same way twice.

The naïve approach is stuffing everything into conversation history. But voice sessions are expensive (OpenAI charges per audio minute), and you can’t keep 50 previous conversations in context.

You need actual persistent memory.


Memory as a tool

Here’s the thing that took us a while to figure out: don’t inject memory into the prompt. Expose it as a callable tool and let the AI decide when to use it.

We defined an agentMemory tool with four operations:

  • get
  • set
  • delete
  • query
const agentMemorySchema = {
  name: 'agentMemory',
  description: 'Store and retrieve memories about the user and conversation',
  parameters: {
    type: 'object',
    properties: {
      operation: {
        type: 'string',
        enum: ['get', 'set', 'delete', 'query'],
        description: 'The operation to perform',
      },
      key: {
        type: 'string',
        description: 'Memory key (for get/set/delete)',
      },
      value: {
        type: 'string',
        description: 'Value to store (for set)',
      },
      query: {
        type: 'string',
        description: 'Natural language query (for query operation)',
      },
    },
    required: ['operation'],
  },
};
Enter fullscreen mode Exit fullscreen mode

Now, when a user says:

“I’m Pavel, I live in Tel Aviv”

the AI decides to call:

agentMemory.set({ key: 'user_name', value: 'Pavel' });
agentMemory.set({ key: 'user_location', value: 'Tel Aviv' });
Enter fullscreen mode Exit fullscreen mode

Next session, the AI can call:

agentMemory.query({ query: 'user location' });
Enter fullscreen mode Exit fullscreen mode

to retrieve relevant context.

The AI is making the decisions about what’s worth remembering — rather than us hardcoding rules.


User isolation: the cookie problem

When you deploy an agent on a company’s website, hundreds of users interact with it. User A shouldn’t see User B’s memories. We needed isolation.

Our approach: generate a UUID per browser/device, and store it in localStorage scoped by deployment:

export const useUserCookie = (deploymentId: string) => {
  const [userCookie] = useState<string>(() => {
    if (typeof window !== 'undefined') {
      const existing = localStorage.getItem(`userCookie_${deploymentId}`);
      if (existing) return existing;

      const newCookie = crypto.randomUUID();
      localStorage.setItem(`userCookie_${deploymentId}`, newCookie);
      return newCookie;
    }

    return crypto.randomUUID();
  });

  return { userCookie };
};
Enter fullscreen mode Exit fullscreen mode

Every API call includes this cookie. Memory operations filter by it.

Simple, but there’s a catch — we’ll get to that.


Global vs private memory

Not all memories should be user-scoped.

  • If someone asks “What are your business hours?” and the agent looks it up, that answer should be available to everyone.
  • If they share their email, that’s private.

We use Gemini to classify on write operations:

async function classifyMemory(
  key: string,
  value: string,
): Promise<{
  isGlobal: boolean;
  processedValue: string;
  reasoning: string;
}> {
  const prompt = `Analyze this memory and classify it:
Key: ${key}
Value: ${value}

Categories that should be GLOBAL (shared across all users):
- Product information, pricing, features
- Company info, business hours, policies
- Factual data about services

Categories that should be PRIVATE (user-specific):
- User names, contact info, preferences
- Personal details, account info
- Session-specific data

If GLOBAL and contains emails/phones, obfuscate them.

Return: { isGlobal: boolean, processedValue: string, reasoning: string }`;

  // ... Gemini call

  // return { isGlobal, processedValue, reasoning };
}
Enter fullscreen mode Exit fullscreen mode

The obfuscation part matters. If a user mentions a support email and we store it globally, we replace it with [EMAIL]. Phone numbers become [PHONE].

Global memories are useful context, but we don’t want to leak PII across users.


Voyage AI embeddings

For semantic search, we went with Voyage AI’s voyage-3.5-lite model.

Why not OpenAI embeddings? Cost. We’re generating embeddings on every write operation, and Voyage AI is significantly cheaper at scale.

const VOYAGE_API_URL = 'https://api.voyageai.com/v1/embeddings';
const VOYAGE_MODEL = 'voyage-3.5-lite'; // 1024 dimensions

export async function generateEmbedding(text: string): Promise<number[]> {
  const response = await fetch(VOYAGE_API_URL, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      Authorization: `Bearer ${process.env.VOYAGE_AI_API_KEY}`,
    },
    body: JSON.stringify({
      input: text,
      model: VOYAGE_MODEL,
    }),
  });

  const data = await response.json();
  return data.data[0].embedding; // 1024-dim vector
}
Enter fullscreen mode Exit fullscreen mode

Embeddings get stored alongside the memory document in MongoDB.


MongoDB schema

Here’s what a memory document looks like:

{
  deploymentId: 'deployment_abc123',
  key: 'user_preference_communication',
  value: 'Prefers email over phone calls',
  userCookie: 'uuid-xxx-yyy-zzz', // or "global" for shared memories
  isGlobal: false,
  embedding: [0.123, -0.456, /* ... 1024 floats ... */],
  createdAt: ISODate('2024-12-01T10:00:00Z'),
  updatedAt: ISODate('2024-12-01T10:00:00Z'),
}
Enter fullscreen mode Exit fullscreen mode

You need two indexes:

  1. Vector search index on the embedding field (for semantic queries).
  2. Text search index on key and value fields (for keyword matching).

Hybrid search with $rankFusion

Pure vector search misses obvious keyword matches. Pure text search misses semantic similarity.

MongoDB 8.0 introduced $rankFusion, which combines both.

const searchPipeline = [
  {
    $rankFusion: {
      input: {
        pipelines: {
          vectorSearch: [
            {
              $vectorSearch: {
                index: 'memory_vector_index',
                path: 'embedding',
                queryVector: queryEmbedding,
                numCandidates: 100,
                limit: 30,
                filter: {
                  deploymentId,
                  $or: [
                    { userCookie: effectiveUserCookie },
                    { isGlobal: true },
                  ],
                },
              },
            },
          ],
          textSearch: [
            {
              $search: {
                index: 'memory_text_index',
                compound: {
                  should: [
                    {
                      text: {
                        query: searchQuery,
                        path: 'key',
                        fuzzy: {},
                      },
                    },
                    {
                      text: {
                        query: searchQuery,
                        path: 'value',
                        fuzzy: {},
                      },
                    },
                  ],
                },
              },
            },
            {
              $match: {
                $or: [
                  { userCookie: effectiveUserCookie },
                  { isGlobal: true },
                ],
              },
            },
            { $limit: 30 },
          ],
        },
      },
      combination: {
        weights: { vectorSearch: 0.7, textSearch: 0.3 },
      },
    },
  },
];
Enter fullscreen mode Exit fullscreen mode

We settled on 70/30 weighting after testing:

  • Vector search (0.7) handles “tell me about my communication preferences” matching user_preference_communication.
  • Text search (0.3) handles “What’s my name?” matching a key literally called user_name.

Voice implementation: 2 different beasts

We support both OpenAI and Gemini for voice. They’re completely different architectures. The demo uses Gemini Live, but here’s how both work.

OpenAI Realtime (WebRTC)

OpenAI’s approach uses WebRTC. Audio goes directly to their servers, and they handle speech-to-text and text-to-speech internally.

import { OpenAIRealtimeWebRTC } from '@openai/agents/realtime';

const transport = new OpenAIRealtimeWebRTC({
  mediaStream,  // from getUserMedia()
  audioElement, // for playback
});

await transport.connect({
  apiKey: clientSecret, // from /api/session
  initialSessionConfig: {
    modalities: ['audio'],
    audio: {
      input: {
        transcription: { model: 'gpt-4o-mini-transcribe' },
        turnDetection: { type: 'server_vad' },
      },
    },
  },
});

// Tool calls arrive as events
transport.on('response.function_call_arguments.done', async (msg) => {
  const { name, arguments: args, call_id } = msg;

  // Execute tool, then:
  const result = await handleToolCall(name, args);
  transport.sendFunctionCallOutput({
    call_id,
    output: JSON.stringify(result),
  });
});
Enter fullscreen mode Exit fullscreen mode

The nice part: audio codec handling is invisible.

The annoying part: tool calls come as events you need to catch and route yourself.

Gemini Live (WebSocket)

Gemini uses raw WebSockets. You’re responsible for audio format conversion. Here’s how we set it up in the demo:

import { GoogleGenAI, Modality } from '@google/genai';

const ai = new GoogleGenAI({ apiKey });

const session = await ai.live.connect({
  model: 'models/gemini-2.5-flash-native-audio-preview-12-2025',
  config: {
    responseModalities: [Modality.AUDIO],
    speechConfig: {
      voiceConfig: {
        prebuiltVoiceConfig: { voiceName: 'Puck' },
      },
    },
    systemInstruction: { parts: [{ text: systemPrompt }] },
    tools: [{ functionDeclarations: tools }],
  },
  callbacks: {
    onmessage: async (message) => {
      // Handle audio data
      if (message.data) {
        const buffer =
          typeof message.data === 'string'
            ? Buffer.from(message.data, 'base64')
            : message.data;

        const int16Array = new Int16Array(
          buffer.buffer,
          buffer.byteOffset,
          buffer.byteLength / 2,
        );

        // Queue for playback at 24kHz
        audioQueue.push(int16Array);
        processAudioQueue();
      }

      // Handle tool calls
      if (message.toolCall?.functionCalls) {
        for (const call of message.toolCall.functionCalls) {
          const result = await onToolCall(call.name, call.args || {});
          await session.sendToolResponse({
            functionResponses: [
              {
                id: call.id,
                name: call.name,
                response: { result: JSON.stringify(result) },
              },
            ],
          });
        }
      }
    },
  },
});
Enter fullscreen mode Exit fullscreen mode

Gemini expects 16 kHz PCM input and outputs 24 kHz PCM. You need an AudioWorklet for capture:

// audio-processor.js (AudioWorklet)
class AudioCaptureProcessor extends AudioWorkletProcessor {
  process(inputs, outputs, parameters) {
    const input = inputs[0];

    if (input && input[0]) {
      // Convert Float32 to Int16 for sending
      const int16Array = new Int16Array(input[0].length);

      for (let i = 0; i < input[0].length; i++) {
        int16Array[i] = Math.max(
          -32768,
          Math.min(32767, input[0][i] * 32768),
        );
      }

      this.port.postMessage({ type: 'audio', data: int16Array.buffer });
    }

    return true;
  }
}

registerProcessor('audio-capture-processor', AudioCaptureProcessor);
Enter fullscreen mode Exit fullscreen mode

More work than WebRTC, but more control over the audio pipeline.


Routing tool calls

When the AI calls a tool during voice, you need to route it to your backend:

const handleToolCall = useCallback(
  async (name: string, args: Record<string, unknown>) => {
    if (name === 'agentMemory') {
      const response = await fetch('/api/memory', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
          ...args,
          deploymentId: DEPLOYMENT_ID,
          userCookie,
        }),
      });

      const data = await response.json();
      return data.result;
    }

    return { error: 'Unknown tool' };
  },
  [userCookie],
);
Enter fullscreen mode Exit fullscreen mode

The memory API handles all four operations (get, set, delete, query) and routes them to the appropriate service functions.


Hybrid search with $rankFusion: the technical deep dive

Pure vector search, while excellent for capturing semantic similarity (finding memories conceptually related to the query), often fails at lexical matching (finding exact keywords). Conversely, a pure text index search excels at keyword matching but struggles when a user’s query uses different phrasing to refer to the same concept (for example, “my address” vs. “where I live”).

MongoDB 8.0’s $rankFusion aggregation stage is the technical solution that combines the best of both worlds. It operates by performing the two distinct search pipelines — vectorSearch and textSearch — and then applying the Reciprocal Rank Fusion (RRF) algorithm to merge the results.

How $rankFusion works

  1. Independent search pipelines
  • The vectorSearch pipeline uses the memory_vector_index to find the top numCandidates (100) results based on the cosine similarity between the user’s query embedding (queryVector) and the stored memory embeddings. It then limits this to 30 and applies a filter for deploymentId and user isolation (userCookie or isGlobal: true).

  • The textSearch pipeline uses the memory_text_index to find the top 30 lexical matches across the key and value fields, leveraging fuzzy matching for robustness. It applies the same user isolation $match filter.

  1. Reciprocal Rank Fusion (RRF)

This is the core of the hybrid search. For each document found in both pipelines, RRF assigns a score based on its rank in each result list:

[
Score(d) = \sum_{r \in R} \frac{1}{k + rank_r(d)}
]

where:

  • R is the set of search results (vector and text),
  • k is a constant (typically 60),
  • rank_r(d) is the document’s rank in the result set r.

This process effectively rewards documents that appear high in either list, or moderately high in both.

  1. Weighted combination

The combination object allows for fine‑tuning the influence of each search type before the final score is calculated. The 70/30 weighting:

  • vectorSearch: 0.7 — prioritizes semantic relevance, which is crucial for a natural language voice agent. A higher weight here ensures the agent’s intent (for example, “tell me about my communication preferences”) strongly biases the results toward conceptually similar memories.
  • textSearch: 0.3 — keeps a smaller but significant weight for keyword accuracy. This handles unambiguous, direct queries like “What’s my name?” where a literal match to a key like user_name is paramount, preventing a purely semantic search from hallucinating or prioritizing a less relevant but semantically rich document.

This hybrid approach ensures the voice agent is robust: it handles the nuances of natural language via vector search while retaining the precision required for direct, fact‑based queries via text search.


Performance notes

A few things we learned:

  • Embedding on write is fine. Voyage AI calls take ~100 ms. Users expect a slight delay when the AI is “remembering” something.
  • Query embedding latency matters more. We cache query embeddings for repeated searches in the same session.
  • MongoDB Atlas handles the vector math. Don’t try to do similarity calculations client‑side. The $vectorSearch aggregation stage is optimized for this.

What’s next?

We’re still tuning the 70/30 hybrid search weights based on production data. We’re also looking at cross-deployment memory sharing for agents that need to access a global knowledge base while maintaining user isolation within that deployment.

The memory-as-tool pattern has been the biggest win. Instead of complex prompt engineering to inject context, we let the AI decide what’s relevant. It forgets less and hallucinates less because it’s pulling from actual stored data rather than trying to infer from truncated conversation history.

Top comments (0)