DEV Community: Oracle Developers

How I Taught an AI to Sound Like Me: Agent Memory with Oracle AI Database

Anya Summers — Thu, 23 Jul 2026 16:21:34 +0000

A step-by-step tutorial for building three layers of agent memory in Oracle AI Database to help an AI agent learn to write social media posts in your voice.

Companion notebook: https://github.com/oracle-devrel/oracle-ai-developer-hub/tree/main/apps/oracle-agent-memory

Key takeaways

The problem is not that AI writes badly. It’s that AI writes from zero.A stateless model has no memory of your older posts, your cadence, your weird little phrases, or what you never say. So it defaults to the internet-average voice, which is why so much AI-written social content feels the same.
Good writing help needs three kinds of memory.Episodic memory gives the agent examples of what you’ve written before. Semantic memory gives it a structured style profile. Reflective memory lets that profile evolve as your writing changes.
Oracle AI Database keeps the memory stack simple.Posts, vectors, JSON style profiles, and reflection logs all live in the same database. That means the agent can retrieve similar posts, load your voice profile, and update its understanding without stitching together a pile of separate services.
The reflection loop is what makes it feel like learning.Every few new posts, the agent compares your current style profile against your latest writing, creates a conservative diff, and updates the profile without overreacting to one weird week of posts.
The actual agent is deceptively small.The final generatePost function only needs a style profile, a few similar examples, and one LLM call. The hard part is not the prompt. The hard part is giving the prompt the right memory.

The last time you went on social media, did it feel... stale? Every post you scroll past reads the same with slightly different words. A generic opener starting with "Most X think Y", three points and a call to action, the same six emojis. Yeah, those were created by AI.

I'm not here to say that all AI-generated content is bad, but we're definitely seeing a lack of originality these days. Which is a shame, because using generative AI as a tool in the creative process is incredible. But copy/pasting the output of a "write me a LinkedIn post about security issues with AI agents" is the wrong way to go about it.

AI models are stateless. Every time you ask one to write a post for you, it starts from zero. It has no idea what you've written before, what worked, what fell flat, or how you sound when you're not trying. So it falls back on the average... which is exactly what you're seeing in your feed these days.

But it doesn't have to be this way. You can still use an AI agent to help you with social posts AND to sound like your natural voice. You just have to give it some memory.

This post walks through how to build an AI agent with three layers of memory backed by Oracle AI Database 26ai, with a reflection loop that updates the agent's understanding of your voice over time. The stack is TypeScript end-to-end: Node.js backend, React + Vite frontend, the official oracledb driver. If you prefer Python, langchain-oracledb is the direct equivalent.

To learn a little bit more about agent memory, check out this blog by Casius Lee.

What we're building

Our agent has one job: given a topic and a platform (LinkedIn, X, whatever), drafts a post that sounds like me. Easy enough as a one-shot LLM call. But the cool part is how we make it better over time without changing the prompt.

To do that, the agent needs three different kinds of memory:

Layer	What it stores	How it's used
Episodic memory	Every post I've written, embedded as a vector	Retrieve the K most similar past posts as examples
Semantic memory	A structured JSON object describing my voice traits	Inject into the system prompt as explicit guidance
Reflective memory	Observations about how my writing style is evolving over time	Periodically refine the semantic memory

All three live in Oracle AI Database, which makes this easier than it looks. Vector search, JSON, and relational rows all live in the same database with the same query engine. So in a single database, we can store everything we need to make this work.

Here's the loop:

Published posts build episodic memory, while periodic reflection updates the style profile used for future content generation.

Setup

If you don't already have a database, the repo includes a Terraform stack that provisions an Always Free Autonomous AI Database 26ai and writes a populated .env. Just clone the repository and run these three commands:

cd terraform
terraform init && terraform apply
terraform output -raw env_file > ../.env

Always Free covers the cost of the database forever. OCI Generative AI isn't on the always-free tier, but new accounts get $300 in trial credits, and the per-call cost for what we're about to build costs pennies. If you already have an Oracle 26ai instance, skip the Terraform and fill in .env by hand.

Then install the dependencies:

npm install

Once dependencies are installed, we're ready to start building! But before we do that, we should talk about the three tables in our schema, each representing one of the layers of our agentic memory.

-- Episodic memory: every post you've ever written
CREATE TABLE posts (
    id            VARCHAR2(36) PRIMARY KEY,
    user_id       VARCHAR2(64) NOT NULL,
    platform      VARCHAR2(32) NOT NULL,
    topic         VARCHAR2(256),
    content       CLOB NOT NULL,
    embedding     VECTOR(1024, FLOAT32),
    created_at    TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    is_deleted    NUMBER(1) DEFAULT 0
);

CREATE VECTOR INDEX posts_hnsw_idx ON posts (embedding)
    ORGANIZATION INMEMORY NEIGHBOR GRAPH
    DISTANCE COSINE
    PARAMETERS (TYPE HNSW, NEIGHBORS 32, EFCONSTRUCTION 200);


-- Semantic memory: the style profile per user
CREATE TABLE style_profile (
    user_id       VARCHAR2(64) PRIMARY KEY,
    profile       JSON NOT NULL,
    updated_at    TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    version       NUMBER(10) DEFAULT 1
);

-- Reflective memory: what changed and when
CREATE TABLE reflections (
    id            VARCHAR2(36) PRIMARY KEY,
    user_id       VARCHAR2(64) NOT NULL,
    triggered_at  TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    posts_window  JSON NOT NULL,
    diff          JSON NOT NULL,
    profile_after JSON NOT NULL
);

It's important to note that VECTOR(1024, FLOAT32) matches OCI's cohere.embed-english-v3.0 model. If you swap embedding functions, make sure you update the dimension to match. And as a bonus, JSON is a first-class type in Oracle 26ai with indexable paths, so the style profile doesn't need to be re-parsed on every read.

Before we get to the memory layers themselves, let's set up a thin wrapper around the OCI SDK for consistency and simplicity. Two functions: embed(), responsible for creating text embeddings from our social posts, and chat(), for communicating with the model.

// src/server/llm.ts
import * as common from 'oci-common';
import { GenerativeAiInferenceClient } from 'oci-generativeaiinference';
const provider = new common.ConfigFileAuthenticationDetailsProvider();
const client = new GenerativeAiInferenceClient({
  authenticationDetailsProvider: provider,
});

const compartmentId = process.env.OCI_COMPARTMENT_ID!;
export async function embed(texts: string[]): Promise<number[][]> {
  const res = await client.embedText({
    embedTextDetails: {
      inputs: texts,
      servingMode: { servingType: 'ON_DEMAND', modelId: 'cohere.embed-english-v3.0' },
      compartmentId
    }
  });

  return res.embedTextResult.embeddings;
}

export async function chat(args: { system: string; user: string }): Promise<string> {
  const res = await client.chat({
    chatDetails: {
      servingMode: { servingType: 'ON_DEMAND', modelId: 'cohere.command-r-plus-08-2024' },
      compartmentId,
      chatRequest: {
        apiFormat: 'COHERE',
        preambleOverride: args.system,
        message: args.user,
        temperature: 0.2,
        maxTokens: 1500
      }
    }
  });

  return res.chatResult.chatResponse.text;
}

ConfigFileAuthenticationDetailsProvider reads ~/.oci/config (the DEFAULT profile) automatically. servingType: 'ON_DEMAND' is the pay-as-you-go mode that uses your trial credits without provisioning a cluster. Everything from here on calls these embed() and chat() functions.

Episodic memory

The first layer is the simplest. Every time I publish a post, I save it. Every time I want to draft a new one, the agent retrieves the K most similar past posts to use as few-shot examples.

// src/server/memory.ts
import { randomUUID } from 'node:crypto';
import { withConn, oracledb } from './db';
import { embed } from './llm';
export async function savePost(args: {
  userId: string; platform: string; topic: string; content: string;
}): Promise<string> {
  const id = randomUUID();
  const [embedding] = await embed([args.content]);
  await withConn(async (conn) => {
    await conn.execute(
      `INSERT INTO posts (id, user_id, platform, topic, content, embedding)
       VALUES (:id, :userId, :platform, :topic, :content, :embedding)`,
      {
        id, userId: args.userId, platform: args.platform,
        topic: args.topic, content: args.content,
        embedding: { type: oracledb.DB_TYPE_VECTOR, val: new Float32Array(embedding) }
      },
      { autoCommit: true }
    );
  });

  return id;
}

Retrieval is a little more nuanced, and we take full advantage of the hybrid filter here. We want to find the most similar posts on the same platform, by this user, that aren't deleted. To do this, we perform a vector search plus a WHERE clause and are able to get the results we want with a single query.

export async function retrieveSimilarPosts(args: {
  userId: string; platform: string; topic: string; k?: number;
}) {
  const k = args.k ?? 5;
  const [queryEmbedding] = await embed([args.topic]);
  return withConn(async (conn) => {
    const r = await conn.execute<[string, string, string, number]>(
      `SELECT id, content, topic, VECTOR_DISTANCE(embedding, :q, COSINE) AS distance
       FROM posts
       WHERE user_id = :userId AND platform = :platform AND is_deleted = 0
       ORDER BY distance
       FETCH APPROX FIRST :k ROWS ONLY`,
      {
        q: { type: oracledb.DB_TYPE_VECTOR, val: new Float32Array(queryEmbedding) },
        userId: args.userId, platform: args.platform, k,
      }
    );

    return (r.rows ?? []).map(([id, content, topic, distance]) =>
      ({ id, content, topic, distance }));
  });
}

FETCH APPROX FIRST :k ROWS ONLY is what lets Oracle use the HNSW index for approximate nearest neighbor. Without APPROX, the query would fall back to exact scan, which is fine for thousands of vectors, but virtually unusable for millions.

Semantic memory

Episodic retrieval gets you "what have I said about this topic before." But now we need "how do I sound when I write." This is the perfect use case for semantic memory.

We keep our "style profile" as a structured JSON object. It stores attributes like voice traits more about the writing rather than in the writing. To capture a good approximation of what you sound like, our profile looks for the following behaviors:

{
  "tone": ["direct", "self-deprecating", "slightly skeptical of hype"],
  "sentenceLength": {
    "averageWords": 16,
    "habit": "short punchy sentences mixed with longer explanatory ones"
  },
  "structuralHabits": [
    "opens with a personal anchor or a small story",
    "uses italics for the one line that should stick",
    "closes posts with a question or a single-word punchline"
  ],
  "signaturePhrases": ["Happy coding!", "Let me explain", "Here's the thing"],
  "thingsINeverDo": [
    "use 'unlock', 'leverage', 'game-changer'",
    "more than two emoji per post"
  ],
  "topicsICareAbout": ["serverless", "AI agents", "developer experience"],
  "platformQuirks": {
    "linkedin": "longer hooks, line breaks every 1-2 sentences",
    "x": "thread-friendly, one idea per tweet"
  }
}

If you've ever asked someone to write something like this about themselves, you know people are absolutely terrible at this type of self-reflection. So naturally, we bypass the human element and ask the model to generate it from the first N posts a user adds in the system.

import { chat } from './llm';
const SEED_SYSTEM = `You are a voice analyst. You will read several social media posts by one author and produce a JSON style profile describing how they write. Be specific and concrete. "Tone is friendly" is useless. "Tone is direct, occasionally self-deprecating, slightly skeptical of hype" is useful.

Output ONLY valid JSON matching this schema:
{ 
  "tone": [string], 
  "sentenceLength": {
    "averageWords": int, "habit": string
  },
  "structuralHabits": [string], 
  "signaturePhrases": [string],
  "thingsINeverDo": [string], 
  "topicsICareAbout": [string],
  "platformQuirks": {string: string} 
}`;

export async function seedStyleProfile(userId: string, sampleSize = 20) {
  const rows = await withConn(async (conn) => {
    const r = await conn.execute<[string, string]>(
      `SELECT platform, content FROM posts
       WHERE user_id = :userId AND is_deleted = 0
       ORDER BY created_at DESC FETCH FIRST :n ROWS ONLY`,
      { userId, n: sampleSize }
    );

    return r.rows ?? [];
  });

  const postsText = rows
    .map(([p, c]) => `[${p}] ${c}`).join('\n\n---\n\n');
  const response = await chat({
    system: SEED_SYSTEM,
    user: `Posts:\n\n${postsText}`
  });

  const profile = JSON.parse(response);
  await withConn(async (conn) => {
    await conn.execute(
      `MERGE INTO style_profile sp
       USING (SELECT :userId AS user_id FROM dual) src ON (sp.user_id = src.user_id)
       WHEN MATCHED THEN UPDATE SET
         profile = :profile, updated_at = CURRENT_TIMESTAMP, version = version + 1
       WHEN NOT MATCHED THEN INSERT (user_id, profile) VALUES (:userId, :profile)`,
      { userId, profile: JSON.stringify(profile) },
      { autoCommit: true }
    );
  });

  return profile;
}

The MERGE statement handles both inserts and updates in one round trip (and deletes, it's pretty impressive). The Oracle JSON type validates the value on insert, and if the LLM emits malformed JSON, the insert fails, which is exactly what we want.

Reading it back is just as few easy lines:

export async function loadStyleProfile(userId: string) {
  return withConn(async (conn) => {
    const r = await conn.execute<[unknown]>(
      `SELECT profile FROM style_profile WHERE user_id = :userId`,
      { userId },
    );

    if (!r.rows?.length) return null;
    return r.rows[0][0] as StyleProfile;
  });
}

Reflective memory

I've built many agents where I'm done after implementing episodic and semantic memory. In many instances, that's good enough. But with this type of workload, aka people writing about what they care about, preferences change over time. Personally, I used to write nothing but dry, cold facts on serverless architectures. Today, I'm a pretty funny guy (right?!) and ponder on things that take software from good to great. Very different styles, but both me.

Building in reflective memory allows the agent to adjust over time. It's what gives us the impression that it's actually "learning."

Every K new posts (I use K=5), we trigger a reflection. The reflection is an LLM call that reads the current style profile and the K newest posts, then creates a structured diff: what's changed, what should be added, and what should be removed.

We go with the structured diff instead of a straight up overwrite to avoid profile thrashing. You aren't just your last 5 posts. You're a summary of everything you've ever posted with a recency bias. Asking for a diff lets the model commit to small, intentional updates, so you stay you and don't appear like you have violent mood swings every week.

const REFLECT_SYSTEM = `You are reviewing how an author's voice may have evolved. You have:

1. Their CURRENT style profile (built from older posts)
2. Their MOST RECENT posts (not yet incorporated)

Read the recent posts and compare to the profile. Decide whether the profile needs updating. Be conservative: most of the time, voice is stable and you should change little or nothing. Only return updates that you can point to specific evidence for in the recent posts.

Output ONLY valid JSON:
{ "additions": [{"field": string, "value": any, "evidence": string}],
  "removals":  [{"field": string, "value": any, "reason": string}],
  "rationale": string 
}

If nothing should change, return empty arrays.`;

export async function reflect(userId: string, windowSize = 5) {
  const profile = await loadStyleProfile(userId);
  if (!profile) return seedStyleProfile(userId);
  const rows = await withConn(async (conn) => {
    const r = await conn.execute<[string, string]>(
      `SELECT id, content FROM posts
       WHERE user_id = :userId AND is_deleted = 0
       ORDER BY created_at DESC FETCH FIRST :n ROWS ONLY`,
      { userId, n: windowSize }
    );

    return r.rows ?? [];
  });

  const postIds = rows.map(([id]) => id);
  const postsText = rows.map(([, c]) => c).join('\n\n---\n\n');
  const response = await chat({
    system: REFLECT_SYSTEM,
    user: `CURRENT PROFILE:\n${JSON.stringify(profile, null, 2)}\n\nRECENT POSTS:\n${postsText}`,
  });

  const diff = JSON.parse(response);
  const updated = applyDiff(profile, diff);
  await withConn(async (conn) => {
    await conn.execute(
      `UPDATE style_profile SET profile = :profile,
         updated_at = CURRENT_TIMESTAMP, version = version + 1
       WHERE user_id = :userId`,
      { profile: JSON.stringify(updated), userId }
    );

    await conn.execute(
      `INSERT INTO reflections (id, user_id, posts_window, diff, profile_after)
       VALUES (:id, :userId, :window, :diff, :after)`,
      {
        id: randomUUID(), userId,
        window: JSON.stringify(postIds),
        diff: JSON.stringify(diff),
        after: JSON.stringify(updated)
      },
      { autoCommit: true }
    );
  });

  return updated;
}

The applyDiff function is simple. It iterates through the additions and removals, and edits the profile object in place. It's in the repo if you're interested. I need to point out again that the only reason applyDiff works is because we tell the model to be conservative with its reflections. Remember, we don't want wild swings in the profile. Your posts won't make it past the "AI sniff test" if you're calm and collected one day, and corporate and metric-driven the next.

If that does happen though, we can use the reflection log as a point-in-time snapshot we can rollback to. Just rebuild from a previous profile_after snapshot and the agent effectively "unlearns" that unwanted style.

Agent memory in action

Now that we've gone through all three types of memory, it's time to build the generatePost function and see them all in action as we compose the prompt.

// src/server/agent.ts
import { chat } from './llm';
import { loadStyleProfile, retrieveSimilarPosts } from './memory';
export async function generatePost(args: {
  userId: string; platform: string; topic: string;
}) {
  const profile = (await loadStyleProfile(args.userId)) ?? {};
  const examples = await retrieveSimilarPosts({ ...args, k: 5 });
  const examplesText = examples.map((e) => e.content).join('\n\n---\n\n');
  const system = `You are drafting a social media post in the user's voice.
    STYLE PROFILE (how this user writes):
    ${JSON.stringify(profile, null, 2)}
    
    EXAMPLES (recent posts by this user on similar topics):
    ${examplesText}


    Write ONE draft post. Match the style profile and the cadence of the examples. Do not copy phrases from the examples. Do not mention that you are an AI or that you are following a profile.`;

  const draft = await chat({
    system,
    user: `Platform: ${args.platform}\nTopic: ${args.topic}`,
  });

  return { draft, basedOn: examples };
}

That's it. That's the whole thing. This is deceptively simple. It's doing two database reads to load the profile and perform a vector search, AND it's making a call to an LLM.

The agent will get better over time. The first time you use it, it will sound like everything else you see on social media these days. But as you edit the drafts and build up the data with examples in your true voice, it gets better and eerily starts sounding like you.

FAQs

Q: Why not just prompt the model to “write in my voice”?
Because the model does not actually know your voice unless you give it evidence. The article uses past posts plus a style profile so the agent has something concrete to imitate instead of guessing.

Q: What does episodic memory do here?
It stores every past post with an embedding. When you ask for a new draft, the agent finds the most similar posts on the same platform and uses them as examples.

Q: What is the style profile?
It is the semantic memory layer: a JSON object that describes how you write, including tone, sentence habits, structural patterns, signature phrases, topics you care about, and things you avoid.

Q: Why use reflective memory instead of just overwriting the profile?
Because you are not just your last five posts. Reflection creates small, evidence-backed updates so the profile can evolve without thrashing every time your writing mood changes.

Q: What happens if the agent learns the wrong style?
The reflection table keeps a history of changes. Since each reflection stores the diff and the resulting profile, you can roll back to an earlier snapshot and effectively make the agent unlearn that bad update.

To summarize

Agent memory has a lot more to it than simply "remembering things." There are different types of memory that represent similar artifacts, summaries, and long-term audits. To build a production-ready system, you need all three. Episodic and semantic memory to satisfy the business problem, and reflective to improve over time.

Oracle AI Database is the perfect database for these three types of memory. It supports vectors, JSON objects, and similarity search with filtering all in the same database. The schema for this is twenty lines. Reads are three-line functions. The complexity stays out of the database layer and in the prompt engineering where it belongs.

To walk through this project yourself, you can find the code on GitHub. It's built on TypeScript end-to-end, and is easily portable to whatever your preferred programming language is.

If you try this out, send me what you generate. I want to see how well "sounds like you" holds up across different writers.

Happy coding!

Build an Intelligent Document Processor in One Data Store

Anya Summers — Thu, 23 Jul 2026 16:19:23 +0000

Companion notebook: https://github.com/oracle-devrel/oracle-ai-developer-hub/tree/main/apps/idp-oracle-ai-database

Key takeaways

The use case is boring in the best possible way.Intelligent Document Processing is exactly where AI makes sense: incoming business PDFs, repetitive manual work, and structured fields that need to move into a process. In this article, the example is procure-to-pay: purchase orders, delivery notes, and invoices.
The twist is one data store.Instead of splitting blobs, JSON, vectors, relational data, and AI calls across S3, DynamoDB, Pinecone, SQL, and external APIs, the whole pipeline runs through Oracle AI Database.
Classification does not need an LLM.The app embeds labeled sample documents, embeds each new document, then uses k-nearest-neighbor vector search to decide whether it looks most like an invoice, purchase order, or delivery note.
The LLM only shows up when it is actually needed.Vectors can tell you what kind of document you have. They cannot reliably extract invoice numbers, totals, due dates, vendors, and line items. That structured extraction step uses UTL_TO_GENERATE_TEXT and validates the result against a Zod schema.
Oracle AI Database becomes the IDP engine, not just storage.It stores the original PDF as a BLOB, extracts text, summarizes it, creates embeddings, runs vector search, stores structured JSON fields, and calls OCI Generative AI from inside the database.

Everybody wants to build AI applications. But nobody knows a good use case.

One use case I have seen over and over again is processing incoming business documents.
In this article, I will show you how to build an Intelligent Document Processing (IDP) platform around the procure-to-pay cycle: it ingests purchase orders, delivery notes, and invoices, classifies them, and pulls out their structured fields.

The twist: we do all of it inside one data store.

The Issue with Data Stores

IDP platforms are nothing new. They are built and used in many companies already. It is the typical internal tool where a bit of AI saves a lot of manual data entry.

Building one usually means stitching together several data stores:

S3 for the document blobs
DynamoDB for key/values
Pinecone for the vectors
A SQL database for aggregations

…sometimes even more.

In this article, I want to demonstrate how you can build all of that with just one data store: Oracle AI Database.

What Is Oracle AI Database?

Oracle AI Database is Oracle's AI-native database, its flagship database with AI built into the engine rather than bolted on through external services.

It is a converged database, which means it doesn't just support typical SQL workloads. It is a multi-model database, which lets you also:

store JSON documents
store relational rows
store vectors
store BLOB files
… and more!

Instead of wiring up separate OpenAI, Cohere, and vector-store APIs, the database can do the work for you.

Oracle doesn't just give you a VECTOR data type (you could get that from a Postgres extension too). You also get text extraction, chunking, embeddings, vector search, and even calls out to generative-AI models. All driven from SQL and PL/SQL via the DBMS_VECTOR_CHAIN package.

Let's build with it.

The Architecture

We have a typical REST + SPA architecture. The frontend and backend run on AWS. The database lives in the free tier of Oracle Cloud (OCI).

A React SPA on S3 and CloudFront calls a Hono API on AWS Lambda, which connects to Oracle AI Database on OCI

Frontend: React SPA (Vite) + TanStack Router
Backend: Hono API on AWS Lambda (Function URL)
Hosting: S3 + CloudFront
Database: Oracle AI Database (OCI Autonomous, Always Free tier)

AWS provides only compute and hosting. Everything about a document (the original file, the extracted text, the structured JSON, and the vector), lives in Oracle AI Database.

The Documents We Process

The application detects and structures incoming documents. We picked the three documents of the procure-to-pay cycle:

For each type we generated a handful of sample PDFs and embedded these labeled examples into the database. When a new document arrives, the app compares its embedding against those labeled examples to decide which type it most resembles. There is no rules engine and no fine-tuning — just vectors and distance.

We want to get specific fields from each document. For example, for invoices we look for the following data (in a Zod schema):

// packages/schemas/src/invoice.ts
export const invoiceFields = z.object({
  envelope: commonEnvelope,
  vendor: z.string(),
  invoiceNumber: z.string(),
  invoiceDate: z.string(),
  dueDate: z.string().nullable(),
  currency: z.string().length(3),
  subtotal: z.number(),
  tax: z.number(),
  total: z.number(),
  lineItems: z.array(invoiceLineItem),
});

Our AI is extracting exactly this data. In an IDP application this data is used for further processing like sending out the order or validating invoices.

Viewing Documents and Content

The app's document detail view showing the original PDF, its extracted fields, and similar documents found via vector search

In the application you can open any document to see the original PDF, the extracted fields, and similar documents found via vector search.

Uploading Documents

The app's upload screen for adding a new document to be classified and processed

If you want to upload a new document you can do so as well! The uploader stores the document in the database, embeds it, and finds similar documents again. More details about this process follow in the rest of the article.

A Two-Minute Primer on Vectors

A vector embedding is just a list of numbers that represents the *meaning* of a piece of text. You can picture each document as a point in space.

Documents plotted as points in vector space, where documents of the same type cluster together and different types sit farther apart

Documents of the same type land near each other; different types land further apart.

To classify a new document, we embed it and measure the distance to the labeled examples we already stored. The closest examples win.

For example, when a new document comes in we check whether it sits closer to a purchase order, a delivery note, or an invoice:

distance(new, purchase-order-sample) = 0.1 ✅
distance(new, delivery-note-sample) = 0.7 ❌
distance(new, invoice-sample) = 0.4 ❌

The smallest distance is the most similar, so we classify the new document as a purchase order.

A new document compared by distance to a purchase order, delivery note, and invoice sample; the nearest sample wins the classification

Prerequisites

To follow along you need:

Node 20+ and pnpm 10+
An OCI Free account with an Oracle AI Database (Autonomous, Always Free tier), with the wallet downloaded locally
An OCI API key for OCI Generative AI (used by the extraction step)
An AWS account (only if you want to deploy; you can run everything locally without it)

Two short provisioning guides in the repo walk you through the slow parts:

docs/01-provision-oracle.md — create the database, download the wallet, run the migrations, load the embedding model.
docs/02-provision-oci-genai.md — create the API key and register the in-database OCI Generative AI credential.

When the database is provisioned and .env is filled in (see .env.example), bootstrap everything with:

pnpm install
pnpm db:setup # creates the idp user, schema, and indexes
pnpm db:setup-onnx # loads the ONNX embedding model as "doc_embedder"
pnpm db:setup-oci-credential # registers the OCI Generative AI credential + smoke-tests it

Setting Up the Database

The Schema

The whole application lives in two tables. documents holds the file and everything we derive from it, including the 384-dimension embedding as a native VECTOR column:

CREATE TABLE documents (
  id                RAW(16)         DEFAULT SYS_GUID() PRIMARY KEY,
  doc_type          VARCHAR2(16)    DEFAULT 'unknown' NOT NULL,
  status            VARCHAR2(32)    DEFAULT 'pending' NOT NULL,
  original_filename VARCHAR2(512)   NOT NULL,
  mime_type         VARCHAR2(128)   NOT NULL,
  byte_size         NUMBER          NOT NULL,
  page_count        NUMBER,
  language          VARCHAR2(8),
  failed_reason     VARCHAR2(512),
  created_at        TIMESTAMP       DEFAULT SYSTIMESTAMP NOT NULL,
  updated_at        TIMESTAMP       DEFAULT SYSTIMESTAMP NOT NULL,
  file_blob         BLOB            NOT NULL,
  extracted_text    CLOB,
  embedding         VECTOR(384, FLOAT32),
  CONSTRAINT documents_doc_type_chk
    CHECK (doc_type IN ('invoice', 'purchase_order', 'delivery_note', 'unknown')),
  CONSTRAINT documents_status_chk
    CHECK (status IN ('pending','text_extracted','classified','fields_extracted','embedded','done','failed'))
);

CREATE TABLE document_fields (
  document_id RAW(16)   PRIMARY KEY,
  payload     JSON      NOT NULL,
  created_at  TIMESTAMP DEFAULT SYSTIMESTAMP NOT NULL,
  updated_at  TIMESTAMP DEFAULT SYSTIMESTAMP NOT NULL,
  CONSTRAINT document_fields_document_fk
    FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
);

A vector index makes nearest-neighbor search hit a graph instead of a brute-force scan:

CREATE VECTOR INDEX documents_embedding_idx
  ON documents (embedding)
  ORGANIZATION INMEMORY NEIGHBOR GRAPH
  DISTANCE COSINE
  WITH TARGET ACCURACY 95;

The per-type structured fields live in the document_fields table as a native JSON column, written with a MERGE upsert and read back with JSON_SERIALIZE. Full DDL is in packages/db/migrations.

Loading the Embedding Model

Oracle AI Database generates embeddings inside the database from an ONNX model you upload once. We use Oracle's pre-built all_MiniLM_L12_v2.onnx (384-dim output), loaded with DBMS_VECTOR.LOAD_ONNX_MODEL and registered under the name doc_embedder. pnpm db:setup-onnx does the download and load and prints:

Phase 1: ADMIN pulls all_MiniLM_L12_v2.onnx into DATA_PUMP_DIR
✓ all_MiniLM_L12_v2.onnx = 133322334 bytes

Phase 2: idp loads "doc_embedder" from DATA_PUMP_DIR
✓ model doc_embedder loaded
✓ embedding dimension = 384

Registering the OCI Generative AI Credential

The extraction step (the only LLM call in the pipeline) runs from the database via DBMS_VECTOR_CHAIN.UTL_TO_GENERATE_TEXT, which calls OCI Generative AI. For that, the database needs a credential built from your OCI API key:

BEGIN
  DBMS_VECTOR_CHAIN.CREATE_CREDENTIAL(
    credential_name => 'OCI_CRED',
    params => JSON('{
      "user_ocid":        "ocid1.user.oc1..xxxx",
      "tenancy_ocid":     "ocid1.tenancy.oc1..xxxx",
      "compartment_ocid": "ocid1.compartment.oc1..xxxx",
      "private_key":      "<PEM body, without the BEGIN/END lines>",
      "fingerprint":      "aa:bb:cc:..."
    }')
  );
END;
/

You also need the IAM policy allow group <your-group> to manage generative-ai-family in tenancy. pnpm db:setup-oci-credential grants the database privileges, opens the outbound network ACL to the OCI Generative AI host, registers OCI_CRED, and runs a smoke test:

Phase 3: smoke test UTL_TO_GENERATE_TEXT against meta.llama-3.3-70b-instruct in eu-frankfurt-1
  ✓ response: PONG.

If you see PONG, the whole chain (API key → fingerprint → network → credential → model) works.

Ingesting Documents

The core of the application is the ingest pipeline.

The six-step ingest pipeline: store the file, extract text, summarize, embed, classify with k-NN, and extract the fields

When a document arrives, the pipeline:

Stores the uploaded file as a BLOB row with status pending.
Calls DBMS_VECTOR_CHAIN.UTL_TO_TEXT to extract the text and saves it.
Calls DBMS_VECTOR_CHAIN.UTL_TO_SUMMARY to generate a short extractive summary.
Generates a 384-dim embedding with VECTOR_EMBEDDING using the loaded ONNX model.
Classifies the document by running a k-NN vector search against the labeled examples — no LLM call.
Extracts the typed fields with DBMS_VECTOR_CHAIN.UTL_TO_GENERATE_TEXT.

It sounds straightforward, and it is — but it's worth pausing on the fact that one database stores the file, extracts its text, summarizes it, computes the embedding, runs the vector search, and makes the LLM call. Let's look at each step.

Step 1 — Text Extraction with `UTL_TO_TEXT`

The first step extracts all text from the PDF. In the database we use the function DBMS_VECTOR_CHAIN.UTL_TO_TEXT for that. It can read a file (BLOB) and returns the text within the file. We save the extracted text in the column extracted_text.

UPDATE documents
SET extracted_text = DBMS_VECTOR_CHAIN.UTL_TO_TEXT(file_blob)
WHERE id = HEXTORAW(:id);

Important caveat: UTL_TO_TEXT can only read embedded text. It can not understand images, scans, or hand-written annotations on your documents. For those, you typically need a vision-capable LLM.

Step 2 — Summaries with `UTL_TO_SUMMARY`

In the next step, we want a summary of the document. For that, we use the SQL function UTL_TO_SUMMARY. With the database provider we use here, the summary is produced by Oracle Text inside the database — an extractive summary of the most representative sentences, not an LLM call. If you want a generative summary instead, UTL_TO_SUMMARY can also be pointed at an external provider such as Claude or OpenAI.

SELECT DBMS_VECTOR_CHAIN.UTL_TO_SUMMARY(
extracted_text,
JSON('{"provider":"database","glevel":"sentence","numParagraphs":3}')
) FROM documents WHERE id = HEXTORAW(:id);

Step 3 — Embeddings with `VECTOR_EMBEDDING`

Before we can classify by vectors, we need a vector. Oracle generates one inside the database from the ONNX model we loaded earlier:

UPDATE documents
SET embedding = VECTOR_EMBEDDING(doc_embedder USING extracted_text AS data)
WHERE id = HEXTORAW(:id);

No external embedding service, no second store. The 384-dim vector ends up in the same row as the BLOB and the extracted text.

One important note on size. VECTOR_EMBEDDING here embeds the entire extracted text in a single call. That is fine since our documents are quite small. For larger documents, you need to chunk your documents first! Oracle has a built-in mechanism for that as well with UTL_TO_CHUNKS.

Chunking flow: UTL_TO_TEXT extracts plain text, UTL_TO_CHUNKS splits it into chunks, UTL_TO_EMBEDDINGS turns each chunk into a 384-dim vector

In one SQL statement the whole chain looks like this:

-- TEXT -> CHUNKS -> EMBEDDINGS, in one statement
SELECT et.*
FROM documents d,
     DBMS_VECTOR_CHAIN.UTL_TO_EMBEDDINGS(
       DBMS_VECTOR_CHAIN.UTL_TO_CHUNKS(
         DBMS_VECTOR_CHAIN.UTL_TO_TEXT(d.file_blob),
         JSON('{ "by":"words", "max":"200", "overlap":"20", "split":"recursively" }')
       ),
       JSON('{ "provider":"database", "model":"doc_embedder" }')
     ) et
WHERE d.id = HEXTORAW(:id);

For us, embedding the documents directly suffices.

Step 4 — Classifying with k-NN

There are two ways to classify a document:

1. Ask an LLM "what kind of document is this?"
2. Ask your vectors which labeled examples it resembles.

We go with option 2. Because it doesn't incur any LLM costs. And it gives us the powers of a vector store.

We run a k-nearest-neighbors search:

embed the new document
find its k nearest labeled examples
take the majority document type among them

// packages/db/src/repositories/documents.ts (abridged)
async classifyByVector(id: string, k = 5, unknownThreshold = 0.5) {
  return withConnection(async (conn) => {
    const result = await conn.execute(
      `SELECT b.doc_type AS DOC_TYPE,
              VECTOR_DISTANCE(a.embedding, b.embedding, COSINE) AS DISTANCE
       FROM documents a, documents b
       WHERE a.id = HEXTORAW(:id)
         AND b.id != HEXTORAW(:id)
         AND b.embedding IS NOT NULL
         AND b.doc_type IN ('invoice', 'purchase_order', 'delivery_note')
         AND b.status = 'done'
       ORDER BY DISTANCE
       FETCH FIRST :k ROWS ONLY`,
      { id, k },
      { outFormat: oracledb.OUT_FORMAT_OBJECT },
    );
    const neighbors = (result.rows ?? []).map((r) => ({
      docType: r.DOC_TYPE,
      distance: Number(r.DISTANCE),
    }));

    if (!neighbors.length || neighbors[0].distance > unknownThreshold) {
      return { docType: 'unknown', confidence: 0 };
    }

    const counts: Record<string, number> = {};
    for (const n of neighbors) counts[n.docType] = (counts[n.docType] ?? 0) + 1;
    const [winner, votes] = Object.entries(counts).sort((a, b) => b[1] - a[1])[0];
    return { docType: winner, confidence: votes / neighbors.length };
  });
}

We only compare against labeled examples that finished processing (status = 'done'), take the k nearest by cosine distance, and let them vote. If even the closest example is farther than our unknownThreshold of 0.5, we mark the document unknown instead of guessing.

For example, for a new purchase order the nearest neighbors might come back as (nearest first):

doc_1 — distance 0.12 — purchase order
doc_2 — distance 0.19 — purchase order
doc_3 — distance 0.24 — purchase order

All three are purchase orders and well inside the threshold, so we classify the new document as a purchase order.

Step 5 — Extracting Fields with `UTL_TO_GENERATE_TEXT`

Vectors tell us what a document is. They can't tell us what's in it.

For that we need structured output. For example for an invoice we look for the following schema:

// packages/schemas/src/invoice.ts
export const invoiceFields = z.object({
  envelope: commonEnvelope,
  vendor: z.string(),
  invoiceNumber: z.string(),
  invoiceDate: z.string(),
  dueDate: z.string().nullable(),
  currency: z.string().length(3),
  subtotal: z.number(),
  tax: z.number(),
  total: z.number(),
  lineItems: z.array(invoiceLineItem),
});

This data is necessary for our business processes.

This is the first time we actually need to call an LLM. And we can do that directly from the database again! With the function DBMS_VECTOR_CHAIN.UTL_TO_GENERATE_TEXT.

For each of our documents we have such a Zod validation schema. This schema is converted to JSON and passed onto our LLM call.

// packages/schemas/src/registry.ts
export const fieldsSchemaByType = {
  invoice: invoiceFields,
  purchase_order: purchaseOrderFields,
  delivery_note: deliveryNoteFields,
} as const;

export type ExtractableDocType = keyof typeof fieldsSchemaByType;

export function getJsonSchemaForType(docType: ExtractableDocType): object {
  return zodToJsonSchema(fieldsSchemaByType[docType], {
    target: 'jsonSchema7',
    $refStrategy: 'none',
  });
}

Then we call the database function like that:

SELECT DBMS_VECTOR_CHAIN.UTL_TO_GENERATE_TEXT(
  :prompt,
  JSON('{
    "provider":        "ocigenai",
    "credential_name": "OCI_CRED",
    "url":             "https://inference.generativeai.eu-frankfurt-1.oci.oraclecloud.com/20231130/actions/chat",
    "model":           "meta.llama-3.3-70b-instruct",
    "chatRequest":     { "maxTokens": 4096, "temperature": 0 }
  }')
) AS out FROM dual;

In our :prompt we say:

You extract structured fields from a document. Respond with a single JSON object….
JSON Schema:
${JSON.stringify(jsonSchema)}

After the call returns, we validate all data against our Zod schema to make sure all fields are available.
If they are not, the processing fails.

That is all for ingesting! We didn't need to leave our one data store at all.

Validate It End to End

With the database set up, run the API and seed it with the committed sample PDFs:

pnpm dev:api # Hono on :8787
pnpm seed # uploads every sample PDF and waits for ingest

pnpm seed prints a per-file result and a summary you can sanity-check:

✓ invoice-01.pdf type=invoice status=done 3.1s
✓ purchase-order-01.pdf type=purchase_order status=done 2.8s
✓ delivery-note-01.pdf type=delivery_note status=done 2.6s
...
Summary
by type: {"invoice":10,"purchase_order":10,"delivery_note":10}
by status: {"done":30}

If a document lands on status=failed, the reason is stored in documents.failed_reason (a common one is no_text_extracted for a scanned/image PDF — see the OCR caveat above). The two provisioning guides each end with a troubleshooting table covering the usual wallet, credential, and region errors.

Deployment

You can deploy the whole thing into your own AWS account:

pnpm cdk:deploy

It's hosted on S3 + CloudFront, with the API in a single Lambda Function URL. At low scale it costs essentially nothing.

Have fun trying it out!

FAQs

Q: What kind of documents does the app process?
Purchase orders, delivery notes, and invoices. Those three documents cover the basic procure-to-pay flow: ordering goods, receiving them, and getting billed.

Q: Why use vectors for document classification?
Because documents of the same type tend to land near each other in vector space. A new document can be classified by comparing its embedding to labeled examples, without fine-tuning a model or paying for an LLM call.

Q: When does the pipeline use an LLM?
Only during structured field extraction. After the document type is known, the app asks OCI Generative AI to return fields that match the right schema, such as invoice totals, dates, currency, and line items.

Q: What happens if the uploaded PDF is scanned or image-only?
The article calls this out as a caveat: UTL_TO_TEXT can read embedded text, but it does not understand scanned images or handwriting. Those cases usually need OCR or a vision-capable model.

Q: Why is one database such a big deal here?
Because the traditional version of this stack would need several systems: one for files, one for relational data, one for JSON or key-value data, one for vectors, plus external AI calls. Here, the document, extracted text, embedding, classification, structured fields, and AI workflow stay in one place.

Summary

In this article, we went through a whole IDP pipeline. From uploading PDFs, to embedding vectors, classifying documents with k-NN, and even making our own call to OCI Generative AI. All within one data store.

This is one of the biggest benefits of using a converged database such as Oracle AI Database.

In a traditional stack this would have been at least 4 systems:

S3 (BLOB)
Postgres (relational)
Pinecone (Vectors)
DynamoDB/MongoDB (JSON)

…and additionally API calls to external LLM providers. With our used database all of that stays within one system.

Production RAG Evaluation: Keyword, Vector, SQL, or Hybrid Search?

Anya Summers — Thu, 23 Jul 2026 16:17:57 +0000

Short answer: Production RAG evaluation should measure whether the system retrieves the right evidence, answers from that evidence, respects permissions, handles fresh data, and refuses unsupported questions. Compare keyword, vector, SQL, and hybrid retrieval against the same question set. Use retrieval metrics, answer-quality checks, and production failure tests before deciding which path belongs in the application.

RAG, or retrieval-augmented generation, retrieves evidence from an authoritative source and gives that evidence to a language model before it answers. Production RAG evaluation tests both halves of that process: whether retrieval found the right evidence and whether the generated answer used it correctly.

A RAG demo can pass with a few clean documents and one friendly question. Production is where the system starts meeting real users.

They ask for exact IDs. They ask vague questions. They ask about data that changed five minutes ago. They ask across tenants, versions, tables, PDFs, status fields, and long conversations. Sometimes the right answer is not in the corpus at all.

That is why the useful question is not "Should I use vector search or hybrid search?" The useful question is "Which retrieval path gives the application the right evidence, under the constraints this system actually has?"

This article turns that question into a practical evaluation plan for developers building RAG with Oracle AI Database. The companion notebook benchmarks three retrieval methods:

keyword retrieval for exact terms, identifiers, and lexical matches
vector retrieval for semantic similarity and vocabulary mismatch
RRF hybrid retrieval when keyword and vector candidates both add value

SQL or natural-language-to-SQL is treated as a separate route for current structured data. Evaluate it with query-correctness, permission, freshness, and result-limit tests rather than forcing it into document-retrieval metrics.

The goal is not to declare a universal winner. The goal is to build the evidence needed to choose the right retrieval strategy for a production RAG system.

Key takeaways

Evaluate retrieval and generated answers separately. A model can retrieve relevant evidence and still produce an unsupported or incorrect answer.
Use SQL or NL2SQL for current structured facts, and use keyword, vector, or hybrid retrieval for unstructured content. Route mixed questions across both.
Compare keyword, vector, and RRF hybrid retrieval against the same ground-truth question set before choosing a production default.
Test freshness, tenant isolation, metadata filters, exact identifiers, citations, and abstention alongside average retrieval metrics.
Treat hybrid search as a measured option, not an automatic winner. Keep numerical claims unpublished until they are traceable to the notebook exports.

What does production RAG look like?

Short answer: Production RAG is a measured retrieval and answer system. It needs repeatable ingestion, versioned chunks, access controls, freshness rules, citations, abstention, observability, and regression tests. Features such as BM25, vector search, RRF, reranking, HyDE, and incremental indexing only matter when you can prove they improve the answers users actually need.

A common developer question is: "I already have hybrid search, reranking, citations, and a no-hallucination policy. What else makes RAG production ready?"

The missing piece is usually the evaluation harness. Retrieval features are easy to add. Proving that they still work after a chunking change, embedding model swap, schema change, or reranker update is the harder production problem.

A production RAG system should have:

A ground-truth question set with required evidence and expected answer behaviour.
Retrieval metrics for keyword, vector, SQL, and hybrid paths.
Answer evaluation for groundedness, correctness, citation validity, and abstention.
Versioned ingestion, parsing, chunking, embeddings, prompts, and retrieval configuration.
Metadata filters for tenant, permission, source, status, freshness, and document version.
Observability for empty results, retrieval misses, latency, citation failures, and stale evidence.
A rollback path when a new retrieval change improves the average but breaks an important query class.

Chunking needs its own tests. Short support notes, long PDFs, policy documents, tables, code, and product documentation do not share one ideal chunk size. Treat chunk size, overlap, parsing, parent document links, and table handling as versioned configuration. Then test those choices against the same question set before you publish the change.

This is the practical bar: if you cannot detect a broken retrieval change, the system is not production ready yet.

How do I improve a RAG pipeline over a sparse SQL database?

Short answer: Do not embed every row from a sparse SQL database. Route structured questions to SQL, filter null and empty fields before building searchable text, and embed only fields with useful language. Use keyword retrieval for exact IDs and status values, vector retrieval for descriptive text, and metadata filters for valid rows.

A common developer question is: "My database has many empty tables and null columns. I embedded rows, but the model retrieves poor context. How do I make the RAG pipeline efficient?"

The problem is usually not the model. The problem is that the retrieval corpus contains low-information chunks. If a row has many empty columns, turning the whole row into text gives the retriever noise that still competes for context-window space.

Separate the jobs before adding more retrieval tricks:

User question	Best first route	Example
Aggregations, counts, dates, filters	SQL or NL2SQL	"How many open high risks have no mitigation?"
Exact identifiers and controlled values	Keyword plus SQL predicates	"Show risk RSK-1042 with status OPEN"
Narrative similarity	Vector retrieval over meaningful text	"Find incidents involving delayed supplier access"
Mixed exact and semantic intent	Keyword and vector candidates fused with RRF	"OPEN risks similar to the supplier outage"

Create a retrieval view rather than embedding every physical row. The view should include stable IDs, required business metadata, and a deliberate search_text field built from non-empty descriptive columns.

For example, a useful retrieval record might include:

risk_id: RSK-1042
tenant_id: acme
status: OPEN
severity: HIGH
search_text: Supplier access delay caused a missed shipment window. Mitigation owner is reviewing backup routing options.

That is different from serialising twenty columns where half the fields are empty. Preserve nulls as database state for SQL reasoning. Do not turn the word "null" into semantic content unless the absence itself is the thing being searched.

For sparse relational data, evaluate routes separately before combining them:

SQL quality: does the generated or selected SQL return the correct rows?
keyword quality: do exact IDs, codes, statuses, and controlled terms match reliably?
vector quality: do descriptive fields retrieve semantically related incidents or risks?
hybrid quality: does RRF improve mixed queries without adding noisy rows?

RRF is useful when both keyword and vector result lists contain useful evidence. It is not a cleanup step for a bad corpus.

How should RAG handle real-time dynamic data?

Short answer: Frequently changing structured data should usually be queried live, not re-embedded on every update. Use SQL or NL2SQL for current rows, and reserve RAG for unstructured text that benefits from semantic retrieval. For changing documents, update affected embeddings, track freshness metadata, and test stale-versus-current answer behaviour.

A common developer question is: "My backend data updates every five minutes. Should I use RAG, SQL, caching, MCP, tool calling, or something else?"

Start by asking what kind of data needs to be fresh.

If the answer lives in current structured rows, query the database at request time. Re-embedding the full dataset every five minutes creates a constant race with the source of truth. The vector copy can become stale before the indexing job finishes.

If the answer lives in unstructured documents, use retrieval. But make freshness explicit. Store source timestamps, version IDs, current-version flags, ingestion times, and deletion state. Then evaluate whether the system chooses the current evidence instead of a stale chunk.

Use a simple routing model:

Need	Route
Current account, risk, ticket, inventory, or status data	SQL or NL2SQL against governed live tables
Policies, manuals, support notes, or PDFs	Keyword, vector, or hybrid retrieval over indexed documents
Current facts plus explanatory documents	SQL for current state, retrieval for explanation, answer composition with citations
Cross-session user preferences or agent context	Scoped agent memory, not a document index

Caching, MCP, and tool calling are useful, but they solve different problems.

A cache reduces repeated work, but it must be invalidated when source data changes. MCP can expose database tools to an assistant, but it does not make stale data fresh. Tool calling lets a planner choose SQL, retrieval, or another service, but every tool still needs permissions, timeouts, result limits, and traceable outputs.

The production pattern is not "embed everything." It is "route each question to the freshest authoritative source, then evaluate whether the answer used that source correctly."

When should I use SQL, vector search, keyword search, or hybrid search?

Short answer: Use SQL for structured facts, keyword search for exact terms, vector search for semantic similarity, and hybrid search when the same workload contains both lexical and semantic queries. The right choice depends on query distribution, freshness needs, permission rules, latency, and measured retrieval quality.

Each retrieval path has a job.

SQL is the right first route when the question is about rows, filters, joins, dates, counts, totals, statuses, permissions, or current business state. If the data already has structure, keep using it.

Keyword search is strong when the user supplies exact terms: error codes, product names, SKUs, ticket IDs, risk IDs, function names, policy clauses, and other tokens where spelling matters.

Vector search is useful when the query and the document use different language for the same concept. It helps with paraphrases, fuzzy intent, natural-language descriptions, and vocabulary mismatch.

Hybrid search is appropriate when the workload has both patterns and both result lists contribute. A common implementation is Reciprocal Rank Fusion, or RRF. It retrieves candidates from keyword and vector search independently, then fuses ranks:

RRF score(document) = sum(1 / (60 + rank_in_result_set))

RRF avoids pretending that keyword scores and vector distances live on the same numeric scale. A document found by both routes receives contributions from both rankings. A document found by only one route can still survive if it ranks well enough.

The mistake is using "hybrid" as a default badge of seriousness. Hybrid retrieval adds query work, tuning, latency, and operational complexity. Add it when the evaluation shows it improves the questions that matter.

What metrics should I use for RAG evaluation?

Short answer: Use retrieval metrics to test whether the system finds the right evidence, then answer metrics to test whether the model uses that evidence correctly. Retrieval metrics include NDCG, MAP, recall, and precision. Answer metrics should cover groundedness, correctness, citation validity, and abstention quality.

RAG evaluation has two layers that should not be collapsed into one score.

First, evaluate retrieval. Retrieval metrics are model-independent and can run without a generation API. That makes them useful for frequent regression checks.

Metric	What it tells you	Useful question
NDCG@k	Whether relevant documents appear high in the ranking	Did the retriever put the best evidence near the top?
Recall@k	How much known relevant evidence was recovered	Did retrieval miss evidence the answer needed?
Precision@k	How much of the returned set was relevant	How much noise did retrieval add?
MAP@k	Ranking quality across multiple queries	Is performance consistently useful across the test set?

The notebook records retrieval results for keyword, vector, and RRF hybrid retrieval:

Method	NDCG@10	MAP@10	Recall@10	Precision@10
Keyword	{{KEYWORD_NDCG_10}}	{{KEYWORD_MAP_10}}	{{KEYWORD_RECALL_10}}	{{KEYWORD_PRECISION_10}}
Vector	{{VECTOR_NDCG_10}}	{{VECTOR_MAP_10}}	{{VECTOR_RECALL_10}}	{{VECTOR_PRECISION_10}}
RRF hybrid	{{HYBRID_NDCG_10}}	{{HYBRID_MAP_10}}	{{HYBRID_RECALL_10}}	{{HYBRID_PRECISION_10}}

These placeholders must stay placeholders until the notebook has been run cleanly and the exported metrics have been reviewed. Do not turn them into claims manually.

Second, evaluate generated answers. A retrieved document can be relevant while the generated answer is still wrong, unsupported, overconfident, or badly cited.

Use answer-level checks for:

groundedness: does the answer stay within the retrieved context?
correctness: does it match the reference answer?
citation validity: do cited documents support the claims attached to them?
abstention quality: does the system refuse when the corpus does not contain enough evidence?

Model-based judging is a review signal, not ground truth. Keep the raw generated answers, retrieved IDs, reference answers, scores, and rationales so a human can inspect surprising results.

How do I test production failure modes?

Short answer: Add a production challenge set next to the clean benchmark. Include exact identifiers, paraphrases, stale and current versions, tenant isolation, metadata filtering, messy questions, multi-hop questions, and unsupported questions. These cases catch the failures that average retrieval scores often hide.

A clean benchmark is useful, but production traffic is not clean. Developers need a small challenge set that reflects the application’s actual risk.

The notebook includes ten production-style cases:

An exact identifier query.
A paraphrase query.
A stale version query that should not use old evidence.
A current version query that should prefer the latest evidence.
A tenant isolation query.
A messy user question with irrelevant wording.
A question requiring two documents.
An unsupported question where the system should abstain.
A lexical entity query.
An embedding-migration sequence question.

Those cases should be adapted to the domain before publication. For a risk-register system, include risk IDs, mitigation statuses, sparse rows, tenant boundaries, and null-heavy records. For a support assistant, include error codes, product versions, stale documentation, and unsupported product claims. For a real-time operational chatbot, include recently changed records and cache-invalidation cases.

The key is to keep the challenge set stable. Run it before changing chunking, embedding models, metadata filters, SQL generation, rerankers, or prompts. If a change improves the average but breaks tenant isolation or stale-data handling, it is not an improvement.

Decision guide: keyword, vector, SQL, or hybrid

Use the query type and failure cost to pick the starting route.

Your workload contains	Start with	Then test
Current structured facts	SQL or NL2SQL	Query correctness, permissions, freshness, and safe result limits
Error codes, IDs, SKUs, names, exact clauses	Keyword retrieval	Whether vector search improves paraphrases without losing exact matches
Natural-language questions and vocabulary mismatch	Vector retrieval	Exact-identifier failures and metadata constraints
Both lexical and semantic questions	RRF hybrid retrieval	Candidate depth, latency, and ranking gains over both baselines
Sparse relational tables	SQL plus selective semantic fields	Null handling, exact IDs, and route-level quality
Data that changes every few minutes	Live SQL for structured data, incremental indexing for documents	Stale-versus-current answer behaviour
Regulated or multi-tenant data	Any method with mandatory metadata filters	Isolation, auditability, deletion, and freshness
Unsupported questions	Retrieval plus abstention policy	False answers and false refusals

The important move is not choosing one retrieval method forever. It is making the retrieval route explicit, measuring it, and changing it only when the evidence says the system gets better.

Frequently asked questions about production RAG evaluation

When should I use hybrid search instead of vector search for RAG?

Short answer: No. Hybrid search helps when a workload contains both exact lexical queries and semantic queries, and when both candidate lists contribute relevant evidence. It can add latency and tuning work. Compare hybrid retrieval against keyword and vector baselines on the same question set before adopting it.

Can I evaluate RAG without an LLM API?

Short answer: Yes. Retrieval evaluation can run without a generation model. Use relevance judgements to calculate NDCG, MAP, recall, and precision for each retrieval method. Add answer-level evaluation later to test groundedness, correctness, citation validity, and abstention.

Should I use RAG or NL2SQL for a SQL database?

Short answer: Use SQL or NL2SQL when the answer depends on structured rows, joins, filters, dates, counts, or current state. Use RAG for unstructured documents and descriptive text. Many production applications need routing: SQL for live facts and retrieval for supporting explanations.

What is Reciprocal Rank Fusion in hybrid search?

Short answer: Reciprocal Rank Fusion, or RRF, combines independently ranked keyword and vector results by rank position. It avoids comparing raw keyword scores with vector distances directly. Documents that rank highly in either list can survive, while documents found by both methods receive contributions from both rankings.

How often should I evaluate a production RAG system?

Short answer: Run retrieval regression tests whenever chunking, parsing, embeddings, indexes, filters, rerankers, prompts, or source schemas change. Also run a scheduled production challenge set to detect corpus drift, stale evidence, permission failures, and changing user-query patterns.

Next steps

For Oracle AI Database RAG, start with the implementation and documentation that matches the route you need:

Run the production RAG evaluation notebook to compare keyword, vector, and RRF hybrid retrieval and export the evidence used by this article.
Read the Oracle AI Vector Search User's Guide for vector data, indexes, similarity search, and retrieval implementation details.
Understand hybrid search in Oracle AI Database for keyword and semantic search modes, RRF, weighted RRF, and score fusion.
Use Select AI for natural-language interaction with a database when current structured data should be queried through SQL or NL2SQL.
Get started with Oracle AI Agent Memory when the application also needs persistent, scoped memory across agent sessions.

How to Troubleshoot Vector Search in an AI Application

Anya Summers — Thu, 23 Jul 2026 16:16:20 +0000

Short answer: Troubleshoot vector search by isolating the retrieval pipeline one stage at a time. Check that document and query embeddings use the same model, dimensions match the vector column and index, chunks preserve useful context, filters are not hiding relevant rows, top-k results contain expected evidence, and hybrid search is tested when exact terms matter.

When an AI application gives a bad answer, the model may not be the problem. Often the answer was not present in the retrieved context. Vector search troubleshooting should start before generation: inspect embeddings, chunk text, metadata filters, similarity scores, index state, and retrieval results directly.

For Oracle AI Database, the useful posture is simple: keep vectors, source metadata, SQL filters, and evaluation checks close enough that you can debug retrieval like an application data path, not like a black box.

Key takeaways:

Debug retrieval before changing prompts.
Check embedding consistency and vector dimensions first.
Use known-query tests and the RAG evaluation notebook to measure changes.

How do I know whether vector search is the problem?

Short answer: Run the user query, inspect the top-k chunks or rows, and ask whether the answer is actually present. If the expected evidence is missing, too low, stale, filtered out, or duplicated, the failure is retrieval-side. If the evidence is correct but the answer is wrong, investigate generation, prompting, citations, or answer evaluation.

Use a two-step diagnostic:

Run retrieval only.
Inspect the returned chunks without the LLM.

Ask:

Are the expected source documents in top-k?
Is the correct chunk ranked high enough?
Are similarity scores clustered or clearly separated?
Are irrelevant chunks crowding out useful evidence?
Did a metadata, tenant, version, or permission filter remove the right answer?

Key takeaways:

Retrieval inspection is faster than prompt tuning.
Top-k evidence should be readable and attributable.
If the answer is not in retrieved context, generation cannot reliably fix it.

Are my embeddings consistent?

Short answer: Poor vector search often starts with inconsistent embeddings. Use the same embedding model and preprocessing path for documents and queries. Record the embedding model ID, vector dimension, chunking version, parser version, and embedding timestamp so mixed-model or stale-vector problems are visible.

Check for:

document embeddings created by one model and query embeddings created by another
vectors with dimensions that do not match the table or index configuration
a partial embedding-model migration
preprocessing differences between ingestion and query time
HTML, boilerplate, or table formatting embedded inconsistently

For production systems, store the embedding model name and version with every vector row. That makes troubleshooting possible when quality drops after a model or preprocessing change.

Key takeaways:

Embedding model drift can look like poor search quality.
Vector dimensions should be validated before retrieval tests.
Store embedding metadata with the vector row.

Are chunking and parsing hurting retrieval?

Short answer: Chunking can make vector search fail even when embeddings and indexes are correct. Chunks that are too large mix unrelated topics. Chunks that are too small lose context. Tables, PDFs, code, transcripts, and nested sections need structure-aware parsing before they become useful retrieval evidence.

Inspect failed retrieved chunks manually. Look for:

headings missing from the chunk
tables split away from headers
sentences cut in the middle
repeated boilerplate dominating embeddings
source IDs or page references missing
transcript chunks without speakers or timestamps

DBMS_VECTOR_CHAIN can help build text-processing and chunking workflows, but the production decision is still empirical: run the same known-query set before and after a chunking change.

Key takeaways:

Good chunk text should make sense outside the original document.
Bad parsing can make every vector index look weak.
Chunking changes need retrieval regression tests.

Are filters, ACLs, or metadata hiding good results?

Short answer: Vector search can return poor results when the right evidence exists but mandatory filters exclude it. Check tenant IDs, ACLs, document status, source version, language, date range, delete flags, and current-version markers before assuming similarity search failed.

This is common in enterprise RAG. A query can fail because:

Filter issue	Symptom	Troubleshooting check
Tenant mismatch	Correct document does not appear	Run the same query with expected tenant scope
Stale current flag	Old content ranks or new content is hidden	Compare source timestamp and chunk state
Delete flag missing	Removed content still appears	Verify deleted chunks are excluded
Over-strict metadata filter	Top-k is empty or weak	Test filters one at a time
ACL propagation gap	User sees too much or too little	Compare source ACLs with chunk metadata

Key takeaways:

Metadata filters are part of retrieval, not a post-processing detail.
Empty or weak results can be a filter problem.
Troubleshooting should log both the vector query and the applied filters.

Is the vector index configured and queried correctly?

Short answer: Validate that vectors are stored in the expected column, the query uses the same vector field, the index is built and usable, the distance metric matches the embedding model, and the top-k or threshold setting is appropriate for the application.

At minimum, check:

the vector column is populated
the query embedding has the expected dimension
the index exists and is available
the query uses the intended vector column
the similarity metric is appropriate
top-k is large enough to expose relevant candidates
thresholds are not cutting off acceptable results

If approximate nearest neighbor settings are used, test recall against a smaller exact or high-recall baseline when possible. ANN tuning is a tradeoff between latency and recall; do not tune it without a known-query set.

Key takeaways:

Index state and query field mistakes are common and easy to miss.
Top-k and threshold settings should be measured, not guessed.
ANN tuning should be validated against retrieval quality.

When should I compare vector-only and hybrid search?

Short answer: Compare vector-only and hybrid search when users ask for exact identifiers, error codes, product names, file names, commands, versions, or clauses. Vector search handles semantic similarity. Keyword search handles exact lexical evidence. Hybrid search should prove it improves the workload before becoming the default.

Use this decision table:

Query type	Likely failure in vector-only search	What to test
Error code or product ID	Exact token gets buried	Keyword and hybrid retrieval
Natural-language paraphrase	Keyword may miss synonyms	Vector retrieval
Clause, command, or file name	Similar text outranks exact text	Hybrid retrieval with RRF
Mixed exact and semantic query	One signal dominates	Keyword, vector, and hybrid baselines

Oracle AI Database supports hybrid search patterns that combine full-text and vector similarity search. The key is to compare routes against the same questions, not separate anecdotes.

Key takeaways:

Hybrid search is a measured option, not a badge.
Exact identifiers need lexical signals.
The RAG evaluation notebook is the right proof path for comparisons.

How do I measure whether troubleshooting improved retrieval?

Short answer: Create a small evaluation set with known questions, expected documents, forbidden documents, expected rank, and notes about failure mode. Track recall@k, precision@k, MRR, NDCG, and examples of failed queries before and after each change.

Start with a table like this:

Question	Expected evidence	Failure mode	Pass condition
"How do I reset my API key?"	API key rotation guide	Chunking or filter miss	Expected chunk appears in top 5
"ORA-12345 error"	Error reference	Exact identifier miss	Exact match appears in top 3
"Current refund policy"	Latest policy version	Stale embedding	Current version appears; old version excluded

The RAG evaluation notebook already gives a useful pattern: separate retrieval methods, run a challenge set, export metrics, and keep a manifest so results are reproducible.

Key takeaways:

Troubleshooting needs before-and-after metrics.
Keep failed queries as regression tests.
Do not publish a fix until the known-query set improves.

How do I troubleshoot vector search with Oracle AI Database?

Short answer: Use Oracle AI Database to inspect vectors, chunks, metadata, filters, source state, and retrieval routes together. Start with Oracle AI Vector Search for semantic retrieval, add hybrid search when exact terms matter, and use DBMS_VECTOR_CHAIN for repeatable chunking and text-processing workflows.

Useful docs and runnable assets:

A practical Oracle troubleshooting workflow:

Confirm vectors exist and dimensions match.
Inspect the exact chunk text returned by top-k.
Run the query with and without metadata filters.
Compare vector-only, keyword, and hybrid retrieval.
Check source timestamps, delete flags, and embedding model versions.
Run the known-query evaluation set and export results.

Key takeaways:

Keep retrieval artifacts and metadata queryable.
Use hybrid search when lexical and semantic evidence both matter.
Treat evaluation output as the proof that troubleshooting worked.

Vector search troubleshooting checklist

Short answer: A healthy vector search path has consistent embeddings, valid dimensions, useful chunks, correct index configuration, appropriate filters, inspected top-k results, known-query tests, and retrieval logs that show what changed when quality improved or regressed.

Use this checklist:

Same embedding model for indexing and querying.
Matching vector dimensions.
Correct vector column queried.
Index exists and is usable.
Chunk text is readable and context-rich.
Metadata filters are visible and testable.
Top-k results include expected evidence.
Similarity thresholds do not remove good candidates.
Hybrid search is tested for exact identifiers.
Stale and duplicate embeddings are monitored.
Retrieval metrics are logged over time.

Key takeaways:

Troubleshooting vector search is pipeline debugging.
Most fixes should be measurable with retrieval metrics.
Oracle AI Database gives the strongest story when vectors, SQL filters, source metadata, and evaluation evidence stay close together.

How to Detect RAG Index Drift: Deleted Docs, Stale Chunks, and Duplicate Embeddings

Anya Summers — Thu, 23 Jul 2026 16:14:54 +0000

Short answer: Detect RAG index drift by reconciling source records, chunk hashes, deletion markers, embedding versions, and retrieval results against the source of truth. A production RAG system should prove that deleted content no longer retrieves, updated content replaces stale chunks, duplicate embeddings are suppressed, and freshness filters are tested before users find stale citations.

The boring production RAG failures are usually the expensive ones. A document is updated, but the old chunks still rank. A source row is deleted, but its embedding remains active. A re-ingestion job runs twice, creates near-duplicates, and retrieval starts returning conflicting evidence.

That is RAG index drift. It is not a model problem first. It is a lifecycle problem across source data, chunks, embeddings, metadata, indexes, and evaluation.

Key takeaways:

Treat the source of truth as authoritative, not the vector index.
Track chunk lifecycle state explicitly: current, superseded, deleted, failed, or quarantined.
Test deletion and duplicate handling as production behaviours, not cleanup chores.

What is RAG index drift?

Short answer: RAG index drift happens when the retrieval index no longer matches the authoritative source. The source changed, but chunks, embeddings, metadata, or retrieval filters did not change with it. The result is stale evidence, orphan embeddings, duplicate chunks, wrong citations, and answers grounded in content that should no longer be active.

A RAG system has at least two views of the world:

Layer	What it believes
Source system	The current document, row, policy, ticket, or file state
Retrieval system	The chunks, embeddings, metadata, and indexes available to search

Drift appears when those views diverge. It can happen after a failed ingestion job, a partial batch update, a source-system delete, an embedding-model migration, a parser change, or a re-ingestion run that does not enforce idempotency.

Key takeaways:

Drift is a source-to-index consistency problem.
Vector search can faithfully retrieve content that should no longer exist.
Freshness metadata is only useful if retrieval filters enforce it.

Why do deleted documents still show up in retrieval?

Short answer: Deleted documents still show up when deletion is handled in the source system but not propagated to chunk rows, embedding rows, metadata filters, and search indexes. A delete event must either remove derived retrieval records or mark them inactive before they can appear in top-k results.

"We deleted it from the database" and "it no longer shows up in retrieval" are different claims. RAG creates derived artifacts: chunks, embeddings, summaries, cached answers, and sometimes reranker inputs. If only the source row is deleted, derived records can continue to rank.

Use explicit deletion semantics:

Delete pattern	Risk	Safer production behaviour
Hard delete source only	Orphan chunks remain searchable	Cascade or reconcile derived records
Soft delete source	Retriever ignores source state	Add mandatory `is_current` and `is_deleted` filters
Re-ingest after delete	Deleted content returns as a new chunk	Use source IDs and tombstones
Cache survives delete	Bot cites removed content	Invalidate answer and retrieval caches

Key takeaways:

Deletion must propagate to every derived retrieval artifact.
Tombstones help prevent deleted content from returning during re-ingestion.
The retrieval query should exclude deleted and superseded content by default.

How do I detect stale chunks, orphan embeddings, and when to refresh RAG embeddings?

Short answer: Compare the retrieval tables with the source of truth on a schedule. Check source IDs, source modified timestamps, chunk hashes, current-version flags, deletion markers, embedding model IDs, and index participation. Then run challenge queries that verify stale and deleted evidence cannot appear in top-k results.

A reconciliation job should answer simple questions:

Does every active chunk point to an active source?
Does every active source have the expected current chunks?
Did the source text change without the chunk hash changing?
Did the chunk text change without a new embedding?
Are deleted or superseded chunks still searchable?
Are chunks embedded with an old model mixed into the current retrieval path?

The existing RAG evaluation notebook pattern is useful here. Reuse the separation between retrieval methods, challenge questions, exported metrics, and a run manifest. Extend the challenge set with deletion, stale-version, duplicate, and orphan-vector cases.

Key takeaways:

Reconciliation detects corpus integrity problems.
Retrieval evaluation detects whether those problems affect user-visible answers.
Keep a run manifest so drift checks are repeatable and debuggable.

How should I handle re-ingestion duplicates?

Short answer: Make ingestion idempotent. Use stable source IDs, chunk ordinals, canonical text hashes, parser-version metadata, embedding-model metadata, and uniqueness rules. Near-duplicate chunks should be merged, superseded, or quarantined before they compete in retrieval.

Duplicates are not harmless. Two near-identical chunks can both rank, crowd out better evidence, or disagree because one is stale. The user sees this as confused citations or answers that mix versions.

Use duplicate controls:

Duplicate type	Detection signal	Action
Same source, same chunk hash	Exact duplicate	Skip insert or update existing row
Same source, new chunk hash	Source changed	Supersede old chunk and embed new chunk
Different source, near-identical text	Possible copied content	Keep both only if provenance differs meaningfully
Same text, different embedding model	Model migration artifact	Route by active embedding model version
Same source, different parser version	Parser migration artifact	compare retrieval quality before promoting

Key takeaways:

Idempotent ingestion is a production requirement.
Chunk hashes prevent unnecessary re-embedding.
Near-duplicates need policy because they can degrade retrieval quality.

What should a RAG reconciliation job check?

Short answer: A RAG reconciliation job should compare source state, chunk state, embedding state, and retrieval behaviour. It should produce counts, examples, and failed challenge queries rather than only saying the job succeeded.

Use a reconciliation report like this:

Check	What it catches	Example failure
Active chunk has missing source	Orphan embedding	Source document deleted but chunk still active
Source updated after chunk hash	Stale chunk	Policy changed but old text remains searchable
Deleted source returned in top-k	Delete propagation failure	Bot can cite removed document
Duplicate active chunks by source and hash	Non-idempotent ingestion	Batch job inserted the same chunk twice
Near-duplicate active chunks	Re-ingestion or parser drift	Old and new versions compete in retrieval
Embedding model mismatch	Partial migration	Old vectors mixed with current vectors
Current-version flag mismatch	Bad lifecycle state	Superseded chunk marked current

The report should be actionable. Include source IDs, chunk IDs, timestamps, hashes, model versions, retrieval route, and sample queries that expose the issue.

Key takeaways:

Reconciliation needs both data checks and retrieval checks.
Counts are not enough; include examples engineers can inspect.
Failed reconciliation should block promotion of a new ingestion run.

How do I measure whether drift is affecting answers?

Short answer: Add drift cases to the same evaluation harness used for retrieval quality. Test whether deleted documents are absent, current versions beat stale versions, duplicate chunks do not crowd out better evidence, and generated answers cite only active evidence. Track retrieval metrics and answer-quality checks separately.

Start with a small drift challenge set:

Ask for a document that was deleted and should not be cited.
Ask a question where old and new versions both exist; only the current one should rank.
Ask an exact-ID query after source deletion; retrieval should return no active evidence.
Ask a question affected by duplicate chunks; top-k should not be crowded by copies.
Ask after an embedding-model migration; results should come from the active model path.

For retrieval, measure whether forbidden evidence appears in top-k. For answers, measure groundedness, citation validity, and abstention quality. If retrieval returns deleted content, the answer generator is already in a bad position.

Key takeaways:

Drift tests belong in the production evaluation set.
Forbidden evidence is as important as required evidence.
Do not publish freshness claims without traceable evaluation results.

How do I keep generated code from using stale database patterns?

Short answer: Treat AI-generated implementation code as another drift surface. Coding assistants can produce outdated connection-pooling or driver patterns when they are not grounded in current documentation. For database-backed RAG, verify generated code against the current driver docs, pin package versions, and add tests for pool creation, acquisition, release, health, and shutdown.

This matters because production RAG is not only retrieval logic. It is also connection handling, pooling, timeouts, retries, lifecycle management, and observability. An assistant can generate code that looks plausible but reflects an older driver style or misses the current recommended pooling API.

Use a documentation-grounded review loop:

Code area	Drift risk	Review check
Connection pool creation	Deprecated or outdated API shape	Match current driver docs
Pool sizing	Too many sessions or no backpressure	Set min, max, increment, and queue behaviour deliberately
Connection acquisition	Leaks under errors	Use context managers or explicit release paths
Health checks	Dead connections reused	Test pool health and recovery behaviour
Async code	Sync pool used in async path	Verify async pool APIs separately
Shutdown	Pool left open	Close the pool during app teardown

The practical rule is simple: do not accept generated infrastructure code just because it runs once. Point the assistant at the current docs, then test the behaviour you expect in production.

Key takeaways:

AI-generated code can drift from current database-driver practices.
Connection pooling should be reviewed against current python-oracledb documentation.
Add runtime tests for connection acquisition, release, health, and shutdown.

How to implement this with Oracle AI Database

Short answer: Use Oracle AI Database to keep source metadata, chunks, embeddings, SQL filters, provenance, and deletion state close together. Store lifecycle fields next to retrieval fields, enforce freshness and delete filters in SQL, and run keyword, vector, and hybrid retrieval against the same governed metadata.

A practical Oracle AI Database design should store:

source ID and source system
source modified timestamp and ingestion timestamp
chunk ID, chunk ordinal, and chunk hash
parser version and embedding model version
current, superseded, deleted, and quarantined flags
tenant, ACL, classification, and provenance fields
retrieval-route metrics and challenge-set results

Useful docs and runnable assets:

The important architecture point is proximity. If chunks, vectors, metadata, permissions, and deletion state live together, reconciliation can be expressed as database checks plus retrieval tests. If every layer lives in a different system, deletion and freshness become distributed-systems problems.

Key takeaways:

Keep lifecycle metadata in the retrieval path, not in a separate spreadsheet or job log.
Apply is_current, is_deleted, tenant, and permission filters before generation.
Use evaluation artifacts and manifests when promoting parser, chunking, or embedding changes.

Production checklist for RAG freshness

Short answer: A production RAG system is fresh only if source updates, deletes, parser changes, embedding changes, cache invalidation, and retrieval filters are observable and tested. If the first signal of drift is a user complaint, the system is missing reconciliation.

Use this checklist before calling a RAG deployment production-ready:

Every active chunk has an active source.
Every active source has expected current chunks.
Deleted sources cannot appear in retrieval.
Superseded chunks are excluded by default.
Chunk hashes prevent duplicate inserts.
Embedding model versions are recorded and filterable.
Parser and chunking versions are recorded.
Retrieval challenge sets include stale, deleted, and duplicate cases.
Caches invalidate on source update and delete.
Reconciliation failures create tickets or block promotion.

Key takeaways:

Freshness is not an ingestion schedule; it is a tested invariant.
Reconciliation should run before users notice drift.
The vector index is not the source of truth. The source system is.

Agent Memory Is Not RAG: Conversation IDs, Durable State, and Scoped Recall

Anya Summers — Thu, 23 Jul 2026 16:13:14 +0000

Short answer: RAG retrieves external evidence. Agent memory stores durable context about users, agents, threads, preferences, decisions, and prior work. Production agents need explicit conversation IDs, tenant scopes, provenance, deletion rules, and permission-aware recall. Without those boundaries, memory becomes prompt stuffing and state leaks across sessions.

Developers often ask memory questions while talking about RAG. That makes sense: both involve retrieval. But they solve different problems.

RAG answers, "What source evidence should the model use right now?" Memory answers, "What should this agent remember across turns, sessions, and workflows?"

Key takeaways:

RAG is not a memory system by itself.
A bigger context window is not durable memory.
Production memory needs scope, lifecycle, provenance, deletion, and governance.

RAG vs agent memory: what is the difference?

Short answer: RAG retrieves documents or data to ground an answer. Agent memory persists useful state across interactions. RAG should cite source evidence. Memory should recall scoped facts, preferences, decisions, summaries, tool results, and workflow state when they remain valid and allowed.

Use this distinction:

Capability	RAG	Agent memory
Primary job	Retrieve external evidence	Persist and recall useful state
Typical unit	Document chunk, row, source result	User preference, thread summary, decision, durable fact
Main risk	Wrong or inaccessible evidence	Stale, overbroad, or cross-session memory
Required scope	Source, tenant, ACL, version	User, agent, tenant, thread, conversation

Key takeaways:

RAG can be stateless.
Memory is stateful by definition.
Memory must be correctable and deletable.

Why does an MCP tool need a conversation ID?

Short answer: A user ID identifies the person, not the conversation. A conversation ID or correlation ID lets tools isolate state, connect logs, avoid cross-tab leakage, and prove which retrieved data belonged to which interaction.

If an LLM application calls your MCP tool without a conversation ID, your tool cannot tell whether two requests belong to the same thread or two parallel sessions. In regulated systems, that is not just inconvenient. It is a traceability problem.

Ask the calling application to pass:

user ID
tenant ID
agent ID
conversation or thread ID
request ID
permission context

Key takeaways:

Conversation IDs are state isolation, not decoration.
Correlation IDs make debugging and audit possible.
Memory scope should come from explicit user, tenant, agent, and conversation identifiers rather than user ID alone.

What should an agent remember?

Short answer: An agent should remember durable, useful, scoped information: user preferences, task state, decisions, summaries, validated facts, and tool results that remain relevant. It should not remember secrets, transient noise, unsupported claims, or data the user is not allowed to retain.

Memory needs policy. Without policy, every interaction becomes a candidate memory, and the system becomes noisy.

Use memory categories:

Memory type	Example	Rule
Preference	User prefers Python examples	Keep if useful and non-sensitive
Task state	Draft article awaiting benchmark data	Keep with thread scope
Durable fact	Project uses Oracle AI Database	Keep with provenance
Tool result	Retrieval run completed	Keep with timestamp and source
Sensitive value	Password, private key, token	Do not store

Key takeaways:

Memory promotion should be deliberate.
Memories need provenance and timestamps.
Deletion and correction are product requirements.

How do I implement this with Oracle AI Database?

Short answer: Use Oracle AI Agent Memory for persistent, scoped memory, and keep RAG, SQL, and tool outputs governed by the same access rules. Store user, agent, tenant, thread, memory, provenance, and retrieval metadata in a durable database-backed path.

Useful docs:

For Oracle positioning, the important message is that memory is a data problem. Once memory needs persistence, scoping, retrieval, lifecycle rules, audit, and deletion, it belongs in a governed data layer rather than an ad hoc prompt or local cache.

Key takeaways:

Use Oracle AI Agent Memory as the canonical product path for scoped persistent memory.
Keep live facts in SQL and source-grounded evidence in RAG.
Apply the same security rules to memory retrieval that apply to document retrieval.

How does memory affect latency?

Short answer: Memory can reduce repeated work when it recalls the right scoped context, but it can add latency if every turn performs broad retrieval. Keep memory retrieval narrow, scoped, and measurable. Use summaries for thread continuity and durable memories for facts that matter beyond the current conversation.

A practical memory stack separates:

current prompt context
thread summary
durable user or task memories
retrieved source evidence
live tool or SQL results

Key takeaways:

Memory retrieval should be selective.
Thread summaries and long-term memories serve different purposes.
Measure whether memory improves answer quality enough to justify the cost.

What should I do next?

Start by defining memory scope. Decide what is stored by user, tenant, agent, thread, and conversation. Then define promotion, update, delete, and retrieval rules. Only after that should you wire memory into an agent loop.

RAG Chunking and Parsing for Tables, PDFs, Transcripts, and Media

Anya Summers — Thu, 23 Jul 2026 16:12:21 +0000

Short answer: RAG chunking fails when it cuts content away from the context that makes it meaningful. A chunk should preserve headings, table headers, source IDs, timestamps, speaker labels, media traits, and security metadata. Good parsing and chunking make retrieved evidence understandable before the model ever sees it.

Most RAG tutorials spend a few minutes on chunking and then move on to embeddings. In production, chunking is often where retrieval quality is won or lost.

The model cannot use evidence that was damaged before retrieval. If a table row loses its headers, a transcript loses the speaker, or a bullet loses its parent heading, the retriever may find text that looks related but cannot support an answer.

Key takeaways:

Chunking is retrieval design, not housekeeping.
The goal is not equal-sized text. The goal is evidence that can stand alone.
Parsing quality matters as much as embedding quality.

Why does chunking break a RAG application with vector search?

Short answer: Chunking breaks RAG when boundaries ignore meaning. Fixed-size chunks can split sentences, detach bullets from headings, cut tables mid-row, and separate numbers from labels. The retriever then returns fragments that are semantically close but incomplete.

The common failure is a chunk that is technically relevant but useless. For example, a chunk that says "up to 30 days" is not answerable unless it also carries what the 30 days refers to.

Structure-aware chunking should preserve:

document title
section and parent headings
list context
table headers
page or slide number
source path
version and timestamp
tenant and access metadata

Key takeaways:

Inspect retrieved chunks from failed queries before changing models.
Attach parent context to small chunks.
Keep metadata with the chunk, not in a separate system that retrieval cannot filter.

How should I chunk tables and spreadsheets?

Short answer: Keep table headers, row labels, sheet names, and units with every retrieved table fragment. Do not let a row become a list of disconnected numbers. If a table is central to the answer, store a text representation and structured fields that can be queried directly.

Tables are hard because their meaning is relational. The value 12 means nothing without the column name, row label, unit, and sometimes the preceding section.

Use a table strategy:

Table problem	Safer approach
Row loses column headers	Repeat headers in each table chunk
Sheet context disappears	Include workbook, sheet, and section names
Numeric values need filtering	Store structured fields for SQL
Table spans pages	preserve continuation markers and page references

Key takeaways:

Table RAG often needs both text retrieval and structured querying.
Repeat headers deliberately; do not rely on proximity.
Preserve provenance so citations can point to the real source.

How should I parse PDFs?

Short answer: Treat PDFs as layout artifacts, not clean documents. Parse text, headings, reading order, tables, page numbers, and source coordinates where possible. Test parsing output before embedding it, because a bad PDF extraction can make every downstream retrieval method look worse than it is.

PDFs can have multi-column layouts, footnotes, tables, captions, scanned pages, and broken reading order. If parsing turns a policy into scrambled text, embeddings will faithfully index the mess.

A production parser should produce:

clean text
reading order
table boundaries
headings and hierarchy
page references
image captions or extracted descriptions when needed
source IDs for citation

Key takeaways:

Do not benchmark retrieval until extraction quality is visible.
Keep page and source references for citations.
Use a challenge set with tables, scanned pages, and multi-column layouts.

How should I chunk transcripts and media summaries?

Short answer: Keep timestamps, speaker labels, cleaned text, summary fields, and raw references together. For video or creator search, evaluate extracted traits as structured metadata before using vectors to rank results.

Raw transcripts are long, repetitive, and full of filler. Summary-only storage loses exact quotes. The practical path is both: cleaned chunks for retrieval, source timestamps for audit, and summaries or topics for navigation.

For media RAG, extracted traits should be tested. If the app needs to find "curly haired creator" or "risk discussion at minute 42," those labels need an evaluation set. Vector similarity is not a yes/no trait detector.

Key takeaways:

Speaker and timestamp metadata are part of the evidence.
Store raw references even when retrieval uses cleaned text.
Evaluate extracted media traits before ranking on them.

How do I implement this with Oracle AI Database?

Short answer: Use Oracle AI Database to keep chunks, embeddings, metadata, and retrieval filters close together. Use DBMS_VECTOR_CHAIN for text processing and chunking workflows, Oracle AI Vector Search for embeddings and similarity search, and hybrid search when lexical and semantic retrieval both matter.

Useful docs and runnable assets:

Key takeaways:

Keep chunk text and metadata in the same governed retrieval path.
Version chunking configuration so retrieval changes can be compared.
Test tables, PDFs, transcripts, and media separately because each fails differently.

What should I do next?

Build a chunking evaluation set before tuning chunk size. Take failed user questions, inspect the retrieved chunks, and identify whether the missing evidence was a parsing problem, a boundary problem, a metadata problem, or a retrieval-ranking problem. That diagnosis is faster than changing the model and hoping the corpus improves.

Real-Time RAG: Live SQL, Incremental Indexing, and Freshness Tests

Anya Summers — Thu, 23 Jul 2026 16:10:59 +0000

Short answer: Real-time RAG should not re-embed every changing record. Query current structured data live with SQL or NL2SQL, and use retrieval for documents that benefit from semantic search. For changing documents, update affected chunks incrementally, track versions and timestamps, and test whether answers prefer current evidence over stale evidence.

The common real-time RAG question is simple: "My data changes every few minutes. Should I embed it, cache it, query it, or build an agent?"

The answer depends on the data shape. Current rows are not documents. Operational data should usually stay operational. Documents, transcripts, support notes, and policies can be indexed, but they need freshness metadata and incremental updates.

Key takeaways:

Query live structured data instead of re-embedding it on every update.
Use incremental indexing for changing documents.
Freshness must be evaluated, not assumed.

When you move RAG to production, when should live SQL beat retrieval?

Short answer: Use live SQL when the answer depends on current rows, counts, filters, dates, statuses, permissions, or joins. RAG is the wrong first route for questions that a database can answer directly and deterministically.

If a risk register changes every five minutes, embedding every row creates a stale copy. If the user asks for open risks, overdue mitigations, current owners, or counts, the system should query live tables.

Use this routing table:

Question type	Best route
Current count, status, owner, balance, risk, ticket	SQL or NL2SQL
Policy explanation, support note, manual text	RAG over indexed documents
Current fact plus explanation	SQL for fact, RAG for explanation
Exact ID plus description	SQL or keyword first, vector second

Key takeaways:

Do not force structured facts into a vector-only path.
SQL has built-in advantages for current state, permissions, and joins.
NL2SQL still needs guardrails, result limits, and inspection.

How should documents stay fresh?

Short answer: Track source timestamps, chunk hashes, embedding model versions, current-version flags, and delete state. Re-embed only changed chunks. Test stale and current versions in the same evaluation set so freshness regressions are visible.

Incremental indexing is not just a performance optimization. It is a correctness requirement. A stale chunk can produce a confident wrong answer.

A production document pipeline should record:

source ID and source system
source modified timestamp
ingestion timestamp
chunk hash
embedding model
current version flag
deletion or superseded state
tenant and ACL metadata

Key takeaways:

Freshness metadata belongs in retrieval filters.
Re-embedding unchanged chunks wastes compute and can create noise.
Deletion and supersession must propagate to retrieval.

What role do caching, MCP, and tool calling play?

Short answer: Caching reduces repeated work, MCP exposes tools, and tool calling chooses actions. None of them makes stale data fresh. Each still needs permissions, timeouts, result limits, audit, and a freshness strategy.

Use caching for stable, permission-safe results. Invalidate it when source data changes. Use MCP or tool calling when the assistant needs to query a database, call an API, or run a retrieval tool. Keep each tool narrow and auditable.

Key takeaways:

Tool calling is routing, not governance.
MCP needs scoped tools and traceable outputs.
Cache invalidation should be tied to source change events.

How do I implement this with Oracle AI Database?

Short answer: Use Select AI or application-controlled SQL for live structured data, Oracle AI Vector Search for document retrieval, and hybrid search where keyword and semantic search both matter. Keep source timestamps, version flags, tenant filters, and ACLs in the database path.

Useful docs and runnable assets:

Key takeaways:

Use live SQL for current structured facts.
Use vector and hybrid retrieval for document evidence.
Keep freshness filters near the data, not in the prompt.

How should I test freshness?

Short answer: Create challenge questions where old and new evidence both exist. The system should choose the current record or current document version, cite it, and ignore stale evidence unless the user explicitly asks for history.

Freshness tests should include:

recently updated row
stale document version
deleted document
superseded policy
cached answer invalidation
user asking for current state
user asking for historical state

Key takeaways:

Freshness is a user-visible quality metric.
The evaluation set should include stale evidence on purpose.
A system that cannot distinguish current from old evidence is not real-time.

What should I do next?

Map each question type to live SQL, document retrieval, hybrid retrieval, or agent tool use. Then build a freshness challenge set before scaling ingestion. If the system cannot pass stale-versus-current tests, do not publish it as real-time RAG.

Secure Enterprise RAG: ACLs, Tenant Filters, Provenance, and Oracle Deep Data Security

Anya Summers — Thu, 23 Jul 2026 16:09:37 +0000

Short answer: Secure enterprise RAG means access policy travels with the evidence. Source ACLs, tenant IDs, labels, provenance, masking, audit, and deletion state must be enforced before retrieved chunks reach the model. "Private" or "self-hosted" is not enough. The system is secure only when retrieval follows the same rules as the source data.

Enterprise RAG usually starts with a reasonable goal: let employees ask questions over documents, tickets, policies, emails, and operational data without sending sensitive information to the wrong place.

The risk is that teams build retrieval first and security later. That is backwards. Once chunks, embeddings, summaries, and generated answers exist, the data has already moved through several surfaces.

Oracle Deep Data Security is the right frame for this article: enforce security close to governed data, retrieval, SQL, metadata, audit, masking, labels, roles, and access policy before sensitive evidence reaches the model. Treat it as a security architecture message, not as a standalone product claim.

Key takeaways:

Secure RAG starts at ingestion, not at prompt time.
ACLs and tenant filters must be retrieval controls.
Self-hosted RAG can still leak if permissions, deletion, and audit are weak.

What does production RAG governance have to secure?

Short answer: Secure the source data, chunks, embeddings, metadata, retrieval filters, prompts, tool calls, generated answers, citations, logs, and memory. If any layer can bypass source permissions, the system can expose data that the user should not see.

A secure RAG system needs controls across the path:

Layer	Security requirement
Source	Capture owner, tenant, role, classification, and delete state
Chunk	Preserve source ACLs and provenance metadata
Embedding	Treat vectors as derived sensitive data
Retrieval	Filter by permission before generation
Generation	Cite only accessible evidence
Logs	Record evidence and tool use without leaking secrets
Memory	Scope by user, tenant, agent, and conversation

Key takeaways:

Embeddings and summaries can still reveal sensitive information.
Permission filters must run before the model receives evidence.
Logs need enough detail for audit without becoming a second data leak.

How should ACLs and tenant filters work?

Short answer: ACLs should be stamped onto chunks during ingestion and enforced during retrieval. Tenant filters should be mandatory query predicates, not optional prompt instructions. A model cannot be trusted to ignore evidence that retrieval already exposed.

The retrieval query should only consider evidence the user can access. That means every chunk needs enough metadata to answer:

Which tenant owns this evidence?
Which user, group, role, or department can see it?
What source system did it come from?
Is it current, deleted, superseded, or embargoed?
What classification or label applies?

Key takeaways:

ACL propagation is part of ingestion.
Tenant filters are non-negotiable retrieval predicates.
Prompt instructions are not a permission boundary.

How does provenance reduce risk?

Short answer: Provenance tells you which source, version, row, chunk, document, or tool result contributed to an answer. Without provenance, teams cannot audit a response, fix a bad retrieval path, or prove that the model used accessible evidence.

Every generated answer should be traceable back to:

document or table source
source version
chunk or row ID
retrieval route
user and tenant scope
tool call or SQL query
timestamp

This is not just compliance paperwork. It is how engineers debug production RAG. When users report a bad answer, provenance tells you whether the source was stale, the chunk was damaged, the filter failed, or the generation step overreached.

Key takeaways:

Provenance is both a security control and a debugging tool.
Citations should point to accessible evidence.
Store enough retrieval metadata to reconstruct what happened.

How do I implement this with Oracle AI Database?

Short answer: Keep sensitive data, metadata, vectors, SQL, retrieval filters, audit, and memory close to the governed database layer. Use Oracle AI Database security capabilities for authentication, roles, application context, network encryption, auditing, sensitive data protection, and security products where required.

Useful docs:

The security guide describes Oracle AI Database security areas including users, authentication, privileges, application security, application contexts, sensitive data protection, network encryption, auditing, and additional security products such as Oracle Advanced Security, Oracle Label Security, Oracle Database Vault, Oracle Data Safe, Audit Vault and Database Firewall, and Oracle Key Vault.

Key takeaways:

Use database security controls where the data and retrieval metadata live.
Do not copy enterprise data into a separate AI layer and then rebuild governance from scratch.
Treat Oracle Deep Data Security as the data-layer story for governed AI applications.

What should I do next?

Build a security checklist before building the chatbot UI. For every source, define the ACL metadata, tenant field, provenance fields, deletion behaviour, masking rule, audit trail, and memory scope. Then test with users who should and should not see the same evidence.

How to Evaluate Production RAG: Keyword, Vector, SQL, and Hybrid Retrieval

Anya Summers — Thu, 23 Jul 2026 16:04:12 +0000

Short answer: Evaluate production RAG by testing each retrieval route against the same questions. Keyword search should handle exact terms. Vector search should handle semantic matches. SQL should handle current structured facts. Hybrid retrieval should prove it improves mixed workloads. Then evaluate whether the generated answer is grounded, cited, current, permission-safe, and willing to abstain.

A RAG demo can look good with one document, one question, and one happy path. Production is different. Users ask for IDs, stale policies, tenant-specific records, live data, tables, and questions that are not actually answerable from the corpus.

The wrong move is to debate retrieval methods in the abstract. The useful move is to build an evaluation harness that makes each method earn its place.

Key takeaways:

Evaluate retrieval before changing prompts or models.
Treat SQL as a first-class route for current structured data.
Do not call hybrid search a win until it beats keyword and vector baselines on your actual query mix.

What should production RAG evaluation measure?

Short answer: Production RAG evaluation should measure retrieval quality, answer quality, freshness, permission safety, and operational reliability. A high average retrieval score is not enough if the system fails exact identifiers, stale documents, tenant isolation, or unsupported questions.

Use two layers.

Layer	What it measures	Failure it catches
Retrieval evaluation	Whether the right evidence appears in top-k	Wrong chunk, missing row, stale document, noisy table
Answer evaluation	Whether the model uses evidence correctly	Unsupported answer, bad citation, false confidence

For retrieval, start with recall@k, precision@k, NDCG@k, and MAP@k. For answers, score groundedness, correctness, citation validity, and abstention quality.

Key takeaways:

Retrieval metrics are useful because they can run without an LLM call.
Answer metrics are necessary because relevant evidence can still produce a bad answer.
Production cases need to sit next to clean benchmark cases.

How do I compare vector-only vs hybrid RAG with keyword and SQL retrieval?

Short answer: Build one question set and run every route against it. Keyword should win exact lexical queries. Vector should win paraphrases. SQL should win current structured facts. Hybrid should improve mixed intent without making exact or governed queries worse.

Use this comparison model:

Query type	First route to test	What to measure
Error codes, product names, risk IDs	Keyword retrieval	Exact match, top-k position, false positives
Paraphrases and fuzzy intent	Vector retrieval	Semantic recall, noise, ranking quality
Counts, filters, current state	SQL or NL2SQL	Query correctness, permissions, freshness
Mixed exact and semantic intent	Hybrid retrieval	Gain over both baselines, latency cost

Oracle AI Database supports vector search and hybrid search patterns. The Oracle hybrid-search documentation describes combining full-text and vector similarity search, including fusion approaches such as RRF.

Key takeaways:

Compare retrieval routes against the same questions, not separate anecdotes.
SQL is not a fallback for RAG. It is the correct path for structured current facts.
Hybrid search should be evaluated as a tradeoff: better recall versus added complexity and latency.

What should the evaluation set include?

Short answer: Include the cases that break real systems: exact identifiers, paraphrases, stale and current versions, tenant boundaries, metadata filters, table lookups, multi-hop questions, and unsupported questions. A clean evaluation set gives clean scores and hides production risk.

A practical first set can be 50 to 100 questions. Each question should include expected evidence, expected answer behaviour, and known failure modes.

Include:

exact IDs, error codes, SKUs, names, and model numbers
paraphrases where the corpus uses different words than the user
stale and current versions of the same document
tenant and permission boundaries
questions that require SQL, not semantic retrieval
table and spreadsheet questions
unsupported questions where the system should abstain

Key takeaways:

Keep the evaluation set stable so changes are comparable.
Add new failed production queries after triage.
Label required evidence and forbidden evidence, not only final answers.

How do I implement this with Oracle AI Database?

Short answer: Store documents, chunks, embeddings, metadata, and structured records in the database path. Run keyword, vector, SQL, and hybrid retrieval as separate routes. Use metadata filters before generation, and export metrics after every retrieval change.

An Oracle implementation should separate the routes clearly:

Oracle Text or keyword search for exact terms.
Oracle AI Vector Search for semantic retrieval.
Select AI or application-controlled SQL for current structured data.
Hybrid search for workloads where keyword and vector evidence both matter.
Mandatory filters for tenant, source, version, status, and permissions.

Useful docs and runnable assets:

Key takeaways:

Keep retrieval logic visible and testable.
Apply permission and freshness filters before evidence reaches the model.
Publish numerical claims only after the notebook or benchmark run has exported traceable results.

What should I do next?

Run the evaluation before choosing the architecture. If the system fails exact terms, improve lexical retrieval. If it fails paraphrases, improve vector retrieval or parsing. If it fails current facts, route to SQL. If it fails mixed intent, test hybrid retrieval. If it fails memory continuity, move to scoped agent memory rather than stuffing more context into the prompt.

One Database for the Whole LangChain Ecosystem: Memory, Persistence, and Deep Agents on Oracle AI Database

Anya Summers — Thu, 23 Jul 2026 16:01:57 +0000

Retrieval, memory, persistence, and a bring-your-own-model deep-agents harness for LangChain and LangGraph, all on one Oracle AI Database instance behind a single connection pool.

Companion notebook: https://github.com/oracle-devrel/oracle-ai-developer-hub/blob/main/notebooks/langchain_ecosystem/research_agent_with_deepagents_oracle.ipynb

Key Takeaways

Oracle AI Database is the unified backend for LangChain and LangGraph agent infrastructure. The blog argues that vectors, chat history, semantic cache, checkpoints, documents, and relational data can live behind one Oracle connection pool instead of several separate services.
langchain-oracledb now covers both retrieval and memory. It adds OracleSemanticCache for paraphrase-aware LLM caching and OracleChatMessageHistory for durable, session-scoped chat history.
langgraph-oracledb adds production persistence for LangGraph agents. The new package provides checkpointing and long-term memory so agent workflows can resume across restarts, deployments, and long-running tasks.
The deep-agents setup is model-flexible. Developers can use Claude, OCI Generative AI, OpenAI, vLLM, or another LangChain chat model while keeping the agent’s state and memory in Oracle.
The main production benefit is reduced fragmentation. Instead of managing consistency, backup, governance, latency, and audit across multiple stores, the blog argues teams can run retrieval, memory, and persistence in one database system.

Oracle-native LangChain and LangGraph integrations share one Oracle AI Database for retrieval, memory, persistence, and agent state.

We’ve observed that most Langchain examples keep vectors in one service, chat history in another, cache in an entirely different service and documents in a fourth (a vector DB, a Redis, a Postgres, an object store). It works in a notebook and perhaps even in PoCs but the fragmentation shows up as a massive disadvantage in production, where the consistency, backup, latency, TTFT(time-to-first-token) and audit stories of four systems all have to line up with each other.

Agent memory is everything an agent stores and retrieves as it works. It's the mix of components, tools, and libraries that let an agent recall information, reuse key details in later interactions, hold onto context for long horizon tasks, and refine what it knows to adapt over time. In practice, that means the conversation so far, the durable facts it has learned across sessions, the working state of a multi-step task, and a cache of answers it has already computed.

An agent that holds those stores together stays continuous and grounded, but if built on a fragmented infrastructure, the cognitive and operational load on the agent increases which leads to failure modes and data synchronization issues.

That’s why over at Oracle, the team have invested in the langchain ecosystem to bring the benefits of the converged AI database to AI developers through three major improvements and updates to the open source libraries of the langchain ecosystem.

The langchain-oracledb package adds semantic LLM caching via the OracleSemanticCache class and durable chat message history via the OracleChatMessageHistory class, completing its retrieval-and-memory story.
langgraph-oracledb launches as a new package, bringing graph checkpointing and long-term agent memory to LangGraph.
And langchain-oci puts a provider-agnostic deep-agents factory on top: the same agent harness running on Claude, OCI Generative AI, or your own model. All of it backed by a single Oracle AI Database instance and a single oracledb connection pool.

Memory is becoming the defining layer of the agent stack. The work Oracle is doing to integrate Oracle AI Database on OCI with LangChain gives developers a real path to building memory-first agents that persist, retrieve, and reason over context at scale.

Harrison Chase, Co-Founder and CEO, LangChain

In this companion notebook, we walk through an end-to-end example of using LangChain ecosystem packages to build a deep research agent, a common use case we see in enterprise.

The update to langchain-oracledb package makes Oracle AI Database a first-class backing store for LangChain applications, ensuring AI Developers build enterprise ready AI applications. See a working full code example of the features across the langchain ecosystem packages in this deep research use case.

The update adds two new primitives: OracleSemanticCache for LLM response caching and OracleChatMessageHistory for durable session memory. They join the package's existing retrieval primitives: OracleVS for vector search, OracleEmbeddings for in-database embedding generation, OracleHybridSearchRetriever and OracleTextSearchRetriever for retrieval, and OracleDocLoader, OracleTextSplitter, and OracleSummary for document processing.

Alongside it, Oracle released langgraph-oracledb, a brand-new package that gives LangGraph agents Oracle-native checkpointing and long-term memory. Both packages are available now on PyPI and documented in the official LangChain documentation.

This release collapses that stack into one.

langchain-oracledb already solved retrieval: vectors, hybrid search, and the document pipeline in one engine. What stayed scattered was the memory. The chat history lived in one service, the LLM cache in another. With this update, a single Oracle AI Database instance holds the vector index, the chat history, the semantic cache, and the staged documents, behind one Oracle AI Database connection pool and one set of credentials.

Agent memory isn't one thing. A production agent needs short-term memory across turns, long-term memory across sessions, semantic caching for repeat queries, and retrieval over both structured and unstructured data. Putting that substrate in one database — vector search natively built in — reduces cognitive load for both the developers building the system and the agents operating inside it.

Richmond Alake, Director of AI Developer Experience, Oracle Database

What's New in langchain-oracledb

The update extends the package from a retrieval integration into a full retrieval-and-memory layer, with two new primitives:

OracleSemanticCache: semantic LLM response caching. A drop-in BaseCache that matches prompts by vector distance rather than exact string, so paraphrased questions hit the cache too. A tunable score_threshold decides how loose a paraphrase still counts, and entries are isolated by LangChain's llm_string, so a model upgrade invalidates its own entries without touching the rest. Wiring it in globally is one line, set_llm_cache(...), or attach it to a single chat model and leave an agent's intermediate calls uncached.
OracleChatMessageHistory: durable, session-scoped chat history. Implements BaseChatMessageHistory as rows in an Oracle table. One table holds thousands of concurrent sessions, survives application restarts, and supports bounded reads (history_size=N) so token costs stay capped without deleting older rows. It plugs straight into RunnableWithMessageHistory.

Figure 1: A RAG pipeline where every stage maps to a langchain-oracledb primitive.

For teams still on the Oracle classes in langchain-community, langchain-oracledb replaces them with identical class names and constructor signatures, so migration is an import-path change. The new primitives sit beyond that entirely: the semantic cache and chat message history do not exist in the community package. (The vector store, embeddings, and document pipeline are also available for JavaScript as @oracle/langchain-oracledb on npm. The new memory primitives are Python-first.)

The whole picture fits in a screenful:

import oracledb
from langchain_core.globals import set_llm_cache
from langchain_oracledb import OracleChatMessageHistory, OracleSemanticCache
from langchain_oracledb.vectorstores.oraclevs import OracleVS
 
connection = oracledb.connect(user="agent", password="...", dsn="localhost:1521/FREEPDB1")
 
# Retrieval, session memory, and LLM caching: one backend, one connection.
vector_store = OracleVS(client=connection, embedding_function=embeddings, table_name="DOCS")
history = OracleChatMessageHistory(session_id="customer-42", client=connection, table_name="CHAT_HISTORY")
set_llm_cache(OracleSemanticCache(client=connection, embedding=embeddings, table_name="LLM_CACHE"))

Three primitives that would normally mean three services, three SDKs, and three failure modes. Here, three tables.

In the customer sessions and developer workshops we’ve run, the pattern is remarkably consistent: teams don't struggle to prototype agent memory — they struggle to ship it. The prototype dies somewhere between the laptop and the security review. Development runs against the Oracle AI Database container on a laptop; production runs against Autonomous AI Database with wallet-based authentication. AI teams find moving between them is a connect-string change, not a re-architecture.

langgraph-oracledb: A New Package for LangGraph Agents

The second piece of news is a launch, not an update. langgraph-oracledb is a new package that implements LangGraph's persistence interfaces on Oracle AI Database:

OracleSaver and AsyncOracleSaver: graph checkpointing. Every step of a LangGraph agent's state is checkpointed per thread_id, so a conversation or long-running task resumes exactly where it left off, across invocations, restarts, and deploys.
OracleStore and AsyncOracleStore: long-term agent memory. A namespaced key-value store with put, get, search, and batch operations, plus optional HNSW or IVF vector indexes so agents can search their own memories semantically, not just by key.

Both accept the same oracledb connections and pools as langchain-oracledb, which is the point. An application that mixes LangChain chains and LangGraph agents shares one system of record (same backend, same pool, same vector index semantics) rather than fragmenting across several.

The packages ship with hands-on worked examples:

The AI on-call triage assistant in the Oracle AI Developer Hub builds a LangGraph supervisor that delegates to an issue analyst and a policy specialist: vector search over a real past-issue corpus, each on-caller's saved preferences, and per-thread checkpoints all sharing one Oracle AI Database.
A companion semantic-caching and durable chat-history notebook covers the primitives underneath: OracleSemanticCache skips the model when a new question means the same as one already answered — so you never pay Claude twice for the same answer — and OracleChatMessageHistory keeps each session's transcript durable and isolated across restarts.

Deep Agents on Oracle, Any Model

LangGraph is what developers reach for when an agent outgrows a single prompt-response loop and becomes a stateful workflow: one that branches, pauses for human approval, and runs long enough that state has to survive a restart. The framework's mechanism for all of this is persistence. Graph state is checkpointed at every step, so a run can be resumed, replayed, or continued in a later session. Every one of those checkpoints needs somewhere durable to live.

langgraph-oracledb is a new package that gives them one. It implements LangGraph's persistence interfaces in full on Oracle AI Database:

from langchain_anthropic import ChatAnthropic
from langchain_core.tools import tool
from langchain_oci import create_deepagents_agent
 
@tool
def get_stock_quote(ticker: str) -> str:
    """Return the latest stock quote for a ticker symbol."""
    return QUOTES.get(ticker.upper(), f"No quote for {ticker}")
 
agent = create_deepagents_agent(
    tools=[get_stock_quote],
    model=ChatAnthropic(model="claude-sonnet-4-6"),  # bring your own model: Claude, OCI GenAI, vLLM
    system_prompt="You are a concise financial assistant. Use your tools for live data.",
)

The model is the one thing you swap; the system of record stays Oracle. Claude does the reasoning, and the agent's plan, files, and per-thread state checkpoint to the same Oracle AI Database backing the chains and the LangGraph persistence: one pool, one transaction surface.

Figure 2: An agent loop that checkpoints every step to Oracle, with any model.

Oracle AI Database as the Unified Memory Core for AI Agents

Oracle AI Database is the unified retrieval and memory core across all three packages. Instead of treating the database as a passive persistence layer, the integrations treat it as the active retrieval engine that makes each LangChain and LangGraph pattern work in production.

Figure 3: Agent memory is not a single thing; Oracle holds every kind in one place.

Oracle AI Vector Search brings the retrieval strategies LangChain developers actually need into a single engine: vector similarity for semantic recall and unstructured knowledge retrieval, full-text and hybrid search for precision over keywords, and relational queries for structured, transactional memory that demands consistency. Combined with Oracle's operational story (backups, replication, high availability, governance), teams get a path from prototype to production without swapping storage layers along the way.

Who This Release Is For

The updated langchain-oracledb, the new langgraph-oracledb, and the deep-agents factory in langchain-oci are designed for:

AI developers and engineers building LangChain RAG pipelines, applications, retrievers, or conversational agents who need durable memory and retrieval in one place
Teams building LangGraph agents who need checkpointing and long-term memory on a production-grade backend
Teams building deep agents who want a provider-agnostic harness (Claude, OCI Generative AI, or a self-hosted model) with the agent's plan and state persisted to Oracle
ML engineers moving LangChain prototypes from ephemeral in-memory stores to production-grade persistence
Teams already running Oracle AI Database who want LangChain and LangGraph applications to write to the system of record directly
Technical leaders evaluating Oracle AI Database for unified agent infrastructure at scale

Documentation and quickstarts live in the Oracle AI Database LangChain and LangGraph integration guides; source is in the oracle/langchain-oracle repo, with runnable example notebooks in the Oracle AI Developer Hub. Install both packages in the same environment to run the full Oracle-native integration across LangChain and LangGraph.

Frequently Asked Questions

Do I need Oracle Cloud to use these packages?

No. Everything runs against Oracle AI Database Free in a local Docker container, with your own embedding model and your own LLM, and no OCI account. langchain-oci adds OCI Generative AI and other managed options when you want them, but they are opt-in. The code that runs on the free container runs unchanged against Oracle Autonomous AI Database in the cloud, so moving to production is a connect-string change, not a rewrite.

Which Oracle Database version do I need?

Oracle AI Database 23ai or later, which is where AI Vector Search and the VECTOR data type live. That covers OracleVS, OracleEmbeddings, and the memory primitives. Hybrid keyword-and-vector search through DBMS_HYBRID_VECTOR.SEARCH needs 26ai. The free 23ai container runs the chains, the LangGraph persistence, and the deep-agents examples end to end.

Can I bring my own model, or am I tied to OCI Generative AI?

Bring your own. The deep-agents factory takes any LangChain chat model through model=, so Claude, OpenAI, a self-hosted vLLM endpoint, or OCI Generative AI all drop in with nothing else changed. When you pass your own model, the OCI model id and auth are ignored.

How do I migrate from the Oracle classes in langchain-community?

It is an import-path change. langchain-oracledb ships the same class names and constructor signatures as the Oracle classes in langchain-community, so existing retrieval code keeps working once you swap the import. The new memory primitives, OracleSemanticCache and OracleChatMessageHistory, are not in the community package, so there is nothing to migrate there, only to adopt.

Do langchain-oracledb and langgraph-oracledb share one database and pool?

Yes, and that is the point. Both accept the same oracledb connections and pools, so an application that mixes LangChain chains and LangGraph agents writes to one system of record, with one set of credentials and one transaction surface, rather than fragmenting across separate backends.

How is this different from adding a dedicated vector database?

A dedicated vector database gives you similarity search and sits beside your operational data rather than with it. Here the vectors live in the same database as your chat history, your checkpoints, and your relational data, so vector, full-text, hybrid, and SQL retrieval all run in one engine, inside one transaction, under one backup and governance story. Cross-service consistency stops being something you engineer around.

Is there JavaScript or TypeScript support?

The vector store, embeddings, and document pipeline are available for JavaScript as @oracle/langchain-oracledb on npm. The new memory primitives, the LangGraph persistence, and the deep-agents factory are Python-first.

All three packages are available now on PyPI:

pip install langchain-oracledb            # updated: semantic cache + chat history
pip install langgraph-oracledb            # new: checkpointing + agent memory
pip install "langchain-oci[deepagents]"   # new: bring-your-own-model deep-agents factory

From Prompt to Persistence (Part 2): Putting the Multi-Tenant Agent Memory Schema to Work

Anya Summers — Thu, 23 Jul 2026 16:00:29 +0000

Retrieval, sharing, and the agent loop on top of a multi-tenant memory schema

Companion notebook: https://github.com/oracle-devrel/oracle-ai-developer-hub/blob/8c9ed028b1bea545f2cbefe057470a0294bf29b7/notebooks/multitenant_schema_walkthrough.ipynb

Key takeaways

Part 2 turns the schema into an operating system for agent memory. It explains how short-term memory, durable long-term memory, shared memory, and retrieval work together in the agent loop.
The Memory Manager is the control point. Agents should not write directly to memory tables; the manager enforces tenant context, provenance, deduplication, versioning, deletion, and typed reads and writes.
Shared memory is scope-based, not a separate table. A memory becomes shared when it is written broadly enough inside a tenant, such as with agent_id IS NULL, so multiple agents can read and update it safely.
One database engine simplifies tenant-safe retrieval. The article argues that keeping policies, personas, entities, summaries, workflows, toolbox data, knowledge base chunks, and conversation events in one engine reduces cross-system security and deletion risks.

Part 1 designed the durable layer: eight typed memory tables, each carrying the same four scope columns (tenant_id, user_id, agent_id, thread_id) and the same lifecycle columns (version, valid_from, valid_until, deleted_at, source_event_id). Tenant isolation lives in the database through row-level security (RLS). The other three scope dimensions are application-supplied filters. Provenance on every durable row is what makes versioned supersession and a provable right-to-forget cascade possible. If you haven't read it, start there, because the system we create in this post assumes all of it.

A schema doesn't do anything on its own. It says what can be stored and how it's isolated, but it doesn't retrieve or rank rows, and it doesn't decide what's worth keeping. This post is about the code that does. We'll start with the short-term layer that sits in front of the durable tables, then work up through shared memory, the Memory Manager that fronts the whole schema, the single retrieval query that one engine makes possible, and the agent loop that ties reads and writes to specific tables at specific steps.

Short-term memory in a multi-tenant context

STM is structurally different from LTM. It's ephemeral by design and scoped to the current run, and most of it lives outside the database. But it interacts with LTM at well-defined seams, and in a multi-tenant system those seams need the same discipline as the durable layer.

Working memory: LLM Context Window + Session Memory

The LLM context window is the most visible piece of working memory: the tokens passed to the model on this turn. The Memory Manager assembles it from the durable layer (active guidelines, active personas, retrieved entities, retrieved summaries) plus the volatile tail (the last N conversation events for this thread). Nothing about the assembly is database-level. The database supplies the inputs, and the manager composes them into the prompt the model actually sees.

Session memory is the in-process scratchpad. Tool call results, intermediate reasoning state, retrieval candidates the agent decided not to surface yet, partial work the agent might want to reference later in the same run. In a single-tenant prototype, this could easily live in a Python dict on the AgentSession object. In multi-tenant SaaS, the question is where to persist it, and do we persist it at all?

class AgentSession:
    def __init__(self, tenant_id, user_id, agent_id, thread_id, run_id, memory):
        # Identity
        self.tenant_id = tenant_id
        self.user_id   = user_id
        self.agent_id  = agent_id
        self.thread_id = thread_id
        self.run_id    = run_id

        # Ephemeral STM (lost at end of run; reconstructable from conversation_memory)
        self.scratch     = {}     # tool outputs, intermediate state
        self.turn_buffer = []     # current turn's events before flush

        # Durable, via the memory manager
        self.memory = memory

The case for persisting session memory in multi-tenant SaaS is crash recovery. If the agent process dies mid-run, the user shouldn't have to start over. Since conversation memory already records everything durably, we can pull session state from it after a crash. The pattern that works: write session memory as conversation_memory events with event_type = 'session_state', keyed by run_id. Reads after a crash restore the scratchpad from the most recent state event for that run. No special retention class needed; these events follow the same lifecycle as every other conversation record.

That keeps session memory architecturally consistent with conversation memory (the same table, scope columns, RLS policy, and retention sweep). Anything recorded to conversation memory is durable anyway, so this isn't adding a new storage class; it's using the existing retention lifecycle. The retention_class column controls how long records stick around (short, standard, or audit windows), and a nightly sweep drops the partitions that have aged out. Session state tagged as 'short' gets cleaned up automatically after the retention window closes.

Semantic Cache

The semantic cache is a vector index over recent conversation history, sitting between Working Memory and Long-Term Memory. The pattern handles a specific failure mode: a user references something from earlier in the conversation that's already fallen out of the volatile-tail window, but the context they're referring to isn't important enough to have been promoted to entity or summarization memory. Pure recent-turn retrieval misses it; semantic search over the full conversation history finds it.

In implementation, the semantic cache is a derived projection over conversation_memory. Two valid shapes:

Per-thread (narrow). A vector index over conversation_memory rows filtered by (tenant_id, thread_id). Useful for long-running threads where context drift within the same conversation is the failure mode.

Per-user-recent (broad). A vector index over conversation_memory rows filtered by (tenant_id, user_id) and created_at > NOW() - INTERVAL '30 days'. Useful for cross-thread continuity ("you mentioned this last week when we were working on something else").

The per-user-recent shape is more useful in SaaS because users come back across sessions and threads more often than they hit the long-thread case. Both shapes can coexist; the manager picks one based on the query intent.

The semantic cache doesn't need its own table. It's a vector index over an existing table, plus a retrieval function that knows how to fuse its hits with the volatile tail of the prompt. Retrieval fusion deserves its own deep dive; for this post, the schema is just CREATE VECTOR INDEX idx_conv_semantic ON conversation_memory (embedding) … on a conversation_memory table that has a populated embedding column for events older than the volatile-tail window.

Shared Memory and Coordination

Both of these are top-level categories in the Oracle blog taxonomy because they're concerns that cut across the type hierarchy. In single-agent systems they collapse into normal LTM scoping. In multi-agent and multi-tenant systems they need separate treatment.

Shared Memory

Shared memory is any LTM row that multiple agents read and write under the same access boundary. The structural definition falls out of the scope columns directly: a row is shared when it sits at a scope broader than a single agent, which means agent_id IS NULL (shared across every agent in the tenant) or agent_id = '<group_id>' (shared across a defined coordination group).

A few examples to make this concrete inside a single SaaS tenant:

A support agent learns that "Acme's production database moved from us-east-1 to eu-west-1." Writing this as an entity_memory row at (tenant_id = 'acme', user_id = NULL, agent_id = NULL) makes it visible to every agent serving Acme. Whatever the support agent learns, every agent on the tenant can use. Compounding advantage across the whole tenant.
A billing agent and a support agent need to coordinate on whether a refund request has been approved. The decision lives in summarization_memory at (tenant_id = 'acme', user_id = :user_id, agent_id = NULL), where both agents can read it, neither agent owns it exclusively.
A research-assistant agent maintains a workflow ("how we evaluate competitor papers") that should be available to every research assistant Acme spawns. The workflow row sits at (tenant_id = 'acme', agent_id = NULL) so any agent instance can find it.

Shared memory doesn't need a new table type. It's a write-side discipline: when promoting a candidate to LTM, the manager picks the narrowest scope that's actually correct, and "actually correct" sometimes means broader than the agent that wrote it. The promotion gate from "From RAG to Memory Systems: Building Stateful AI Architecture" is where that decision belongs. Default to user scope, and promote to tenant scope only when the subject of the fact is the tenant entity itself and the fact has been observed from two independent sources.

What shared memory does need is read-side coordination. Two agents writing facts about the same subject at the same time can produce contradictory rows if the writes overlap. The supersession pattern handles the resolution (later write wins by version), but the agents themselves need to know they might be writing on top of each other. The simplest pattern is optimistic concurrency: the manager's supersede_fact method checks that the row being superseded is still at the version the caller saw at read time, and raises a retry-able conflict if it isn't.

Coordination

Coordination is the cross-agent messaging layer. Agent A finishes a step and hands off to Agent B. Agent C broadcasts an event that other agents subscribe to. Agent D queries the system for "who's working on this customer right now."

This post doesn't define a schema for coordination memory because the design space is still wide open. Many different patterns have emerged and new ones are still being experimented with. The right shape depends on the orchestration runtime as much as on the memory layer. The thing worth naming explicitly is that coordination is a separate concern from shared memory. Shared memory is "two agents read the same row." Coordination is "Agent A tells Agent B that something happened."

A working baseline for coordination in a multi-tenant SaaS context: events go into a partitioned table scoped by (tenant_id, coordination_group_id, created_at), with the same partitioning strategy as conversation_memory. Subscribers poll or stream from the table; producers append. The table participates in the same RLS policy as everything else, so coordination events never cross tenants by accident.

Shared memory and coordination, side by side.

The Memory Manager: one door into the schema

The schema is the contract. The Memory Manager is the only code allowed to touch it. Every read and write goes through the manager, and it enforces what the schema can't. Every operation carries tenant context, all durable write carry provenance, data lands in the table that matches its type, and supersession runs in one transaction instead of an error prone multi-step process.

The interface is small and shaped exactly like the typed tables it wraps. Eight write methods, eight read methods, one supersession method per supersedable type, one delete method. The one rule worth holding the line on is no escape hatch that bypasses the schema, because the moment one exists, every team that finds it will use it.

 class MemoryManager:
      """One door into the schema. Tenant context comes from scope_ctx and is
      enforced by RLS; the manager adds provenance, dedup, embedding, and
      transactional supersession on top."""

      def __init__(self, db, scope_ctx): ...   # scope_ctx sets the tenant context for RLS

      # WRITES — one typed method per memory type; provenance required on durable types
      def write_entity(self, *, subject, predicate, content, confidence,
                       written_by, source_event_id,
                       user_id=None, agent_id=None, thread_id=None): ...
      # write_guideline, write_persona, write_summarization, write_workflow,
      # write_toolbox, ingest_document, write_conversation_event follow the same shape

      # READS — scoped retrieval; the tenant predicate is applied automatically
      def search_entities(self, query, *, user_id=None, agent_id=None, top_k=10): ...
      # read_active_guidelines, read_active_personas, read_active_toolbox,
      # search_summarizations, search_workflows, search_knowledge_base, read_thread

      # SUPERSESSION — versioned, with optimistic concurrency
      def supersede_entity(self, entity_id, *, new_content, written_by,
                           source_event_id, confidence, expected_version): ...

      # DELETION — right-to-forget, one transaction each
      def forget_user(self, user_id): ...
      def forget_tenant(self): ...

Four invariants show up across every method. Tenant context is read from scope_ctx, never passed by the caller, because the application set it once per request and the database is already enforcing it via RLS. Provenance is a required parameter on durable types; the manager refuses any write to a durable type (entity, summarization, conversation_event) that's missing its source_event_id. Supersession is one method per supersedable type, and it takes expected_version so concurrent writes get a retry-able conflict instead of silently clobbering each other. And deletion comes in two flavors (user, tenant), both wrapped in a single transaction internally.

The read side has its own invariants, and they're as important as the write ones. Every read filters the same way: deleted_at IS NULL, valid_until IS NULL or still in the future. A superseded fact carries a stamped valid_until, so the moment a newer version lands the old one drops out of every read without anyone asking for it, and an expired record drops out the same way when its clock runs out. Because that predicate lives in the manager rather than in each caller, there's no read path that can accidentally surface a stale or superseded row.

Precedence across types is really two questions, and conflating them is where retrieval designs usually go wrong. The first is governance: an authored guideline outranks an inferred preference, every time. Policies load in full into the static prefix because a rule that applies has to apply, and an inferred value carries a confidence the manager can gate against a policy-supplied floor, so a weak guess never overrides an explicit instruction. The second is relevance: when an entity, a summary, and a vector hit all speak to the same query, the manager doesn't crown a winner. It tags each result with its type and a relevance tier and hands them back for the prompt to slot into the right region. Fusing those evidence types into a single ranking, or reranking across them, is a real pipeline and deserves its own post. The manager's job is to keep the candidates honest and labeled. Collapsing them comes later, in a stage built for it.

The manager is also the seam where filesystem-style ergonomics meet database substrate. From the agent's perspective, calling manager.write_entity(...) feels like writing a row to a notebook. The manager handles storage and indexing, the scope check, provenance, dedup, embeddings, supersession bookkeeping, and the tenant boundary. The agent code never writes raw SQL and never sees the schema directly. When the schema evolves, the manager changes in one place and every caller upgrades for free.

Memory Manager surface.

Why one engine wins for multi-tenant SaaS

By this point the schema is eight tables, four scope dimensions, row-level security on every table, transactional lifecycles, and a manager that wraps it all. The remaining question is where to put it. The polyglot persistence trap from From RAG to Memory Systems: Building Stateful AI Architecture applies here with extra force, because every cross-system pain point in a single-org agent compounds in a SaaS provider running thousands of tenants.

The accidental architecture goes the same way it always does. Postgres for users, accounts, and policies, and while Postgres handles vectors and full-text search these days, at multi-tenant SaaS scale the workloads tend to split out anyway. A dedicated vector database for the entity, summarization, and knowledge-base embeddings. Elasticsearch or OpenSearch for the lexical side of hybrid retrieval. Object storage for raw conversation transcripts and ingested documents. Maybe a graph database for entity relationships.

Each component is a reasonable choice for the job it was built for. The pain begins the moment a cross-system operation has to honor tenant boundaries.

Backups split four ways and each one needs a tenant-partitioned strategy. Security models split four ways and each one needs the same tenant predicate enforced. Deletion splits four ways and a partial deletion in one of them is a regulatory finding. Per-tenant encryption splits four ways and key rotation has to coordinate across all of them. Per-tenant data residency (Acme's data must stay in EU; Globex's in US) splits four ways and each system has to support the same residency rules independently.

Oracle AI Database is a great choice for this architecture because one engine can host policy data, persona data, entity data with embedding columns, summarization data with embedding columns, knowledge-base documents and chunks, workflow definitions, toolbox definitions, and conversation events, all under one row-level security model that ties tenant isolation to a session context the application sets once per request.

Because it's all one engine, the agent's Retrieve step pulls every memory type it needs in a single round trip. Each branch of a UNION ALL returns the same shape (type, vec_score, lex_score, relevance, payload), so guidelines, personas, toolbox definitions, entities, summarizations, workflows, knowledge-base chunks, and the recent conversation tail all come back from a single query plan, with the tenant predicate enforced by RLS on every branch (condensed for brevity):

SELECT type, vec_score, lex_score,
         CASE WHEN vec_score IS NULL  THEN NULL
              WHEN vec_score >= 0.7   THEN 'high'
              WHEN vec_score >= 0.5   THEN 'standard'
              ELSE                         'low'
         END AS relevance,
         payload, sort_bucket
  FROM (
    WITH
    guidelines AS (              -- enumerated: loads in full, no score
      SELECT 'guideline' AS type, CAST(NULL AS NUMBER) AS vec_score,
             CAST(NULL AS NUMBER) AS lex_score, 0 AS sort_bucket,
             JSON_OBJECT('guideline_key' VALUE guideline_key,
                         'guideline_value' VALUE guideline_value) AS payload
        FROM guideline_memory
       WHERE deleted_at IS NULL AND valid_until IS NULL
         AND (user_id   IS NULL OR user_id   = :user_id)
         AND (agent_id  IS NULL OR agent_id  = :agent_id)
         AND (thread_id IS NULL OR thread_id = :thread_id)
    ),
    -- Entity: hybrid. A vector candidate pool and an Oracle Text pool, FULL OUTER JOINed.
    entity_vec AS (
      SELECT id, subject, predicate, confidence,
             VECTOR_DISTANCE(embedding,
               VECTOR_EMBEDDING(ALL_MINILM_L12_V2 USING :query AS DATA), COSINE) AS vec_dist
        FROM entity_memory
       WHERE deleted_at IS NULL AND (valid_until IS NULL OR valid_until > SYSTIMESTAMP)
         AND (user_id IS NULL OR user_id = :user_id)
       ORDER BY vec_dist FETCH FIRST 10 ROWS ONLY
    ),
    entity_lex AS (
      SELECT id, subject, predicate, confidence, SCORE(11) AS lex_raw
        FROM entity_memory
       WHERE deleted_at IS NULL AND (valid_until IS NULL OR valid_until > SYSTIMESTAMP)
         AND (user_id IS NULL OR user_id = :user_id)
         AND CONTAINS(content, :lex_query, 11) > 0
       ORDER BY lex_raw DESC FETCH FIRST 10 ROWS ONLY
    ),
    entities AS (
      SELECT * FROM (
        SELECT 'entity' AS type,
               CASE WHEN v.vec_dist IS NOT NULL THEN 1.0 / (1.0 + v.vec_dist) END AS vec_score,
               l.lex_raw AS lex_score,   -- raw lexical SCORE, shown alongside, not fused
               3 AS sort_bucket,
               JSON_OBJECT('subject'   VALUE COALESCE(v.subject, l.subject),
                           'predicate' VALUE COALESCE(v.predicate, l.predicate),
                           'confidence' VALUE COALESCE(v.confidence, l.confidence)) AS payload
          FROM entity_vec v
          FULL OUTER JOIN entity_lex l ON v.id = l.id
         ORDER BY COALESCE(1.0 / (1.0 + v.vec_dist), 0) DESC
      ) WHERE ROWNUM <= 5
    )
    -- persona, toolbox (enumerated); summarization (hybrid); workflow,
    -- knowledge_base (vector-only); recent_conversation (the volatile tail)
    -- all follow these same shapes and the same scope predicates.
    SELECT type, vec_score, lex_score, payload, sort_bucket FROM guidelines
    UNION ALL
    SELECT type, vec_score, lex_score, payload, sort_bucket FROM entities
    -- UNION ALL ... the remaining six branches
  )
  ORDER BY sort_bucket, vec_score DESC NULLS LAST
  -- tenant_id predicate appended automatically by RLS on every branch

Three retrieval styles live in that one statement. Enumerated types (guideline, persona, toolbox) load in full with no score, because a policy or tool that applies must apply. Hybrid types (entity, summarization) pair a vector candidate pool with an Oracle Text pool through a FULL OUTER JOIN, so each row carries both a vector score and a raw lexical score, shown side by side rather than fused. Fusing the two into a single ranking, or adding a reranking pass over the merged candidates, is its own deeper topic. Workflow and knowledge base rank by vector similarity alone. The outer query maps the vector score to a relevance tier (high, standard, low), and the type column tags every row so the application can route each one to the right slice of the prompt. A single branch is worth seeing on its own, because it shows how a policy shapes what comes back.

The example below shows a single branch governed by a policy that supplies the confidence floor:

SELECT  e.id, e.subject, e.content, e.confidence, e.source_event_id,
        VECTOR_DISTANCE(
          e.embedding,
          VECTOR_EMBEDDING(ALL_MINILM_L12_V2 USING :query AS DATA),
          COSINE
        ) AS vec_dist
  FROM  entity_memory e
  JOIN  guideline_memory g
    ON  g.guideline_key = 'entity_retrieval'
   AND  g.valid_until IS NULL
   AND  g.user_id IS NULL                    -- tenant-scoped guideline
 WHERE  e.deleted_at IS NULL
   AND  (e.valid_until IS NULL OR e.valid_until > SYSTIMESTAMP)
   AND  e.confidence >= JSON_VALUE(g.guideline_value, '$.min_confidence')
   AND  (e.user_id IS NULL OR e.user_id = :user_id)   -- app-supplied filter, like the canonical predicate
 ORDER BY vec_dist
 FETCH FIRST 10 ROWS ONLY;
-- tenant_id predicate appended automatically by RLS on both tables

It collapses to one query plan over one transactional snapshot, governed by a single security model, backup strategy, and tenant boundary. The query optimizer decides whether to apply the JSON predicate before or after the vector ranking. The application doesn't coordinate across systems because there are no other systems to coordinate with.

A few things still don't collapse cleanly. High-volume conversation memory at SaaS scale (millions of events per day across thousands of tenants) sometimes belongs in a store tuned for high-cardinality append-only workloads, with the OLTP engine handling everything else. Large blob assets (raw PDFs of ingested documents, generated images, video tool outputs) belong in object storage, with only their metadata living in the database. The collapse that matters is between the seven LTM tables that hold the agent's actual knowledge (guidelines, personas, entities, summarizations, workflows, toolbox, knowledge base) which is where every meaningful retrieval join lives. Conversation memory and blobs can sit alongside; the join across the seven core types is the one the converged engine wins.

In a multi-tenant SaaS context, "one less system to enforce tenant isolation in" is the load-bearing benefit. Every external system is a place tenant boundaries can be forgotten. Collapsing the join surface into one engine collapses the security surface into one system, and that's the architectural property that compounds across every feature added afterward.

Polyglot vs converged in multi-tenant SaaS.

From schema to running agent

The schema doesn't exist for its own sake. It exists to be read from and written to by the agent loop, in a specific pattern that gives the architecture its predictability. Tying the two together is what turns the design from a data dictionary into an architecture.

The loop has five steps: Ingest, Retrieve, Infer & Act, Evaluate, Promote. Each step reads from and writes to specific tables, and the write rules are deliberately narrow.

Ingest writes to conversation_memory. Every user message, every tool call, every tool result, every model response lands as a conversation row first, before anything else happens. Conversation is the source from which everything else gets derived; if a piece of state didn't make it into a conversation row, it can't be reconstructed later.

Retrieve reads from the durable tables. Active guidelines via exact match by scope. Active personas via exact match by scope (the full set, every turn). Active toolbox via exact match by scope. Entities via hybrid retrieval. Summarizations via hybrid retrieval. Workflows via similarity match if the current task looks like one we've seen before. Knowledge base via vector search if the query is grounded in reference content. The semantic cache via vector search over recent conversation_memory if the user's reference points outside the volatile tail. No writes in this phase. Retrieve is read-only by design.

Freshness and conflict deserve a callout here. The manager filters deleted, expired, and superseded rows at read time, so nothing stale reaches assembly. When current rows still disagree, Retrieve doesn't pick a winner. The canonical typed tables are the source of truth, so a summary or semantic-cache hit ranks below the structured fact it came from. Each candidate comes back tagged by type, with its confidence and provenance, leaving fusion and reranking to a later stage. Guidelines are the exception. They apply in full, above the scored evidence, for the governance reasons the Memory Manager section covers.

Infer & Act reads the assembled context and writes back to conversation_memory (model response, tool calls, tool results, optional session_state events). It does not write to any other table directly. The agent's reasoning doesn't get promoted to entity memory by virtue of being uttered; it has to earn its way in through the promotion gate.

Evaluate runs the extraction passes for candidate entities, candidate personas, candidate summarizations, and candidate workflow updates. These get scored and queued. They don't get written to durable tables yet.

Promote is the only step allowed to call write_entity, write_persona, write_summarization, write_workflow, supersede_*, or ingest_document. The promotion gate lives here. Each candidate gets type-specific verification, dedup, scope assignment, and a transactional write. Promote is the only step that modifies the durable LTM layer.

This narrow write rule is why provenance works. Every durable write happens inside the Promote step, which means every durable write has a clear source_event_id (the conversation event the candidate was extracted from) and a clear written_by (the agent that ran the extraction, or 'promotion_job' if it ran in a background worker). Writing to durable memory from inside Infer & Act is a code-review smell worth catching every time.

Loop steps and table access in a multi-tenant context.

Where do we go from here

This schema is an advanced starting point for multi-tenant agent memory.

The eight core tables above are the canonical layer: the governed, versioned, tenant-isolated source of truth. Derived context (alternate embeddings, summarization rollups, pre-joined retrieval views, materialized projections) is everything optimized for retrieval that can be rebuilt from the canonical layer. Mixing the two is how memory systems start to drift. Separating them is how the schema stays honest as it scales to thousands of tenants.

Querying the derived layer well is its own challenge. Hybrid retrieval (vector plus lexical plus metadata, fused and reranked) is a multi-stage pipeline, and there's no single library call that does it for you. The converged query above is what makes that pipeline tractable to write as one query plan instead of four cross-system round trips per tenant, but it stops where the ranking starts: it returns vector and lexical scores side by side without fusing them. Fusing and reranking those candidates, and querying the canonical-versus-derived layer above, are where the next posts go. A vector store is not a memory system, and a single-org schema isn't a SaaS memory system. A typed, multi-tenant schema with provenance, scope, tenant isolation, and lifecycle is. Get this part right and everything else is optimization. Get it wrong and you'll spend the next year explaining why one tenant's data showed up in another tenant's results.

Appendix: the seed dataset

The companion repo for this article ships a runnable version of the DDL from Part 1 plus a seed dataset that covers all eight core LTM tables (plus the document chunks and deletion_events tables) across three example tenants. The seed follows a SaaS research-assistant scenario: each tenant has its own users, agents, ingested document collection, learned workflows, and toolbox configuration. The notebook walks through tenant provisioning (creating the per-tenant guideline and toolbox defaults), one write per type, the hybrid retrieval query against entity memory, the supersession of a fact after a paper is updated, the right-to-forget cascade for one user inside one tenant, and the tenant-termination cascade for an entire tenant.

If you only have time to try one thing from this article, clone the repo and run the seed. The schema and dataset are the foundation we'll build on in future posts.

DEV Community: Oracle Developers

How I Taught an AI to Sound Like Me: Agent Memory with Oracle AI Database

Key takeaways

What we're building

Setup

Episodic memory

Semantic memory

Reflective memory

Agent memory in action

FAQs

To summarize

Build an Intelligent Document Processor in One Data Store

Key takeaways

The Issue with Data Stores

What Is Oracle AI Database?

The Architecture

The Documents We Process

Viewing Documents and Content

Uploading Documents

A Two-Minute Primer on Vectors

Prerequisites

Setting Up the Database

The Schema

Loading the Embedding Model

Registering the OCI Generative AI Credential

Ingesting Documents

Step 1 — Text Extraction with UTL_TO_TEXT

Step 2 — Summaries with `UTL_TO_SUMMARY`

Step 3 — Embeddings with VECTOR_EMBEDDING

Step 4 — Classifying with k-NN

Step 5 — Extracting Fields with UTL_TO_GENERATE_TEXT

Validate It End to End

Deployment

FAQs

Summary

Production RAG Evaluation: Keyword, Vector, SQL, or Hybrid Search?

Key takeaways

What does production RAG look like?

How do I improve a RAG pipeline over a sparse SQL database?

How should RAG handle real-time dynamic data?

When should I use SQL, vector search, keyword search, or hybrid search?

What metrics should I use for RAG evaluation?

How do I test production failure modes?

Decision guide: keyword, vector, SQL, or hybrid

Frequently asked questions about production RAG evaluation

When should I use hybrid search instead of vector search for RAG?

Can I evaluate RAG without an LLM API?

Should I use RAG or NL2SQL for a SQL database?

What is Reciprocal Rank Fusion in hybrid search?

How often should I evaluate a production RAG system?

Next steps

How to Troubleshoot Vector Search in an AI Application

How do I know whether vector search is the problem?

Are my embeddings consistent?

Are chunking and parsing hurting retrieval?

Are filters, ACLs, or metadata hiding good results?

Is the vector index configured and queried correctly?

When should I compare vector-only and hybrid search?

How do I measure whether troubleshooting improved retrieval?

How do I troubleshoot vector search with Oracle AI Database?

Vector search troubleshooting checklist

How to Detect RAG Index Drift: Deleted Docs, Stale Chunks, and Duplicate Embeddings

What is RAG index drift?

Why do deleted documents still show up in retrieval?

How do I detect stale chunks, orphan embeddings, and when to refresh RAG embeddings?

How should I handle re-ingestion duplicates?

What should a RAG reconciliation job check?

How do I measure whether drift is affecting answers?

How do I keep generated code from using stale database patterns?

How to implement this with Oracle AI Database

Production checklist for RAG freshness

Agent Memory Is Not RAG: Conversation IDs, Durable State, and Scoped Recall

RAG vs agent memory: what is the difference?

Why does an MCP tool need a conversation ID?

What should an agent remember?

How do I implement this with Oracle AI Database?

How does memory affect latency?

What should I do next?

RAG Chunking and Parsing for Tables, PDFs, Transcripts, and Media

Why does chunking break a RAG application with vector search?

Step 1 — Text Extraction with `UTL_TO_TEXT`

Step 3 — Embeddings with `VECTOR_EMBEDDING`

Step 5 — Extracting Fields with `UTL_TO_GENERATE_TEXT`