DEV Community: Actian for Developers

Build an OpenClaw Memory Plugin with Actian VectorAI DB

Praise James — Thu, 09 Jul 2026 18:07:12 +0000

Context compaction is a common failure point for most OpenClaw memory systems. When the context window fills, OpenClaw compacts the session to make room for new messages. Any memory operations that were not flushed to durable storage can be lost during this transition.

The default memory-core backend uses a per-agent SQLite database powered by sqlite-vec. This architecture is lightweight, easy to set up, and works well for individual developers running a single agent on a local machine. It becomes fragile when you run multiple agents or when a file-based state needs to be shared or managed outside the runtime.

To address these limitations, this tutorial demonstrates how to build a production-ready OpenClaw memory plugin powered by Actian VectorAI DB. Rather than storing embeddings and memory records inside local SQLite files, the plugin externalizes memory into a dedicated vector database designed for high availability, durability, and independent scaling.

By moving memory management outside the runtime, agents gain access to a centralized knowledge layer that teams can share across processes, hosts, and deployment environments.

How OpenClaw Memory Works by Default

OpenClaw uses a default memory plugin called memory-core. It indexes MEMORY.md into chunks and stores them in a per-agent SQLite database located at ~/.openclaw/memory/<agentId>.sqlite. Inside this file are two key components: an FTS5 (Full-Text Search version 5) index for keyword search and a vec0 vector index powered by the sqlite-vec extension for semantic search.

The important design point is that the Markdown files in the agent workspace are the source of truth. Files like MEMORY.md and memory/YYYY-MM-DD.md hold the actual durable content. The SQLite database does not store primary memory. It only builds a derived, regenerable index over those files. This distinction matters because this tutorial replaces the indexing layer, not the underlying memory files.

In practice, this design introduces some failure modes that push teams toward external memory backends.

Failure modes in the default memory-core backend

Context compaction events can orphan in-flight memory operations. Even though recent fixes in the May 2026 release improve how OpenClaw closes local embedding providers when active-memory searches time out, the underlying issue remains tied to lifecycle boundaries during compaction.

One of the failure modes is that multiple agent instances cannot reliably share a single <agentId>.sqlite file. This creates contention in distributed setups where agents need shared knowledge, leading to inconsistent memory state across instances.

Another failure mode is that sqlite-vec extension must exist in the runtime environment. When it is missing, OpenClaw falls back to in-process JavaScript cosine similarity, which does not scale for production workloads or high-throughput memory search.

Context compaction lifecycle

Context compaction runs when the context window fills and OpenClaw removes older messages to free space. Before compaction begins, OpenClaw triggers an automatic memory flush, but this only captures memories the agent has explicitly written. Any in-flight operations or pending writes that have not reached the memory files are not persisted.

After compaction, the agent resumes with a compressed version of its previous conversation. At that point, memory recall depends on what has already been written to the memory files and indexed for retrieval. Any information that was still in-flight and had not been persisted before compaction cannot be recalled during that session. The issue is not that durable memory disappears, but that memory that was never successfully persisted cannot be recovered.

Because the SQLite database lives inside the agent process, the vector index is tightly coupled to the runtime. When the process restarts or compaction resets the context, the index does not exist independently. Moving the memory layer to VectorAI DB decouples storage from the agent lifecycle, so the memory system survives compaction and continues to serve queries across sessions.

The Plugin Slot System

OpenClaw uses a named slot system to control extensibility points inside the agent runtime. The memory system is one of these slots, exposed as plugins.slots.memory. This slot determines which backend handles all memory-related tool calls.

The slot system separates memory behavior from memory storage. That separation allows you to move from a local SQLite-based index to an external vector database without modifying agent logic or tool definitions. In practice, this means one config change followed by a restart is enough to replace the entire memory backend.

OpenClaw enforces a single active memory plugin at a time. The architecture does not support multiple concurrent memory backends in the same slot. Issue #60572 on GitHub proposes splitting memory into sub-slots such as memory.recall, memory.store, and memory.compaction, but this has not merged as of May 2026. As a result, every implementation must fully replace the existing memory provider rather than extend it.

" width="412" height="512">
Figure 1: Plugin slot architecture

Before writing the VectorAI DB plugin, it helps to study existing implementations of this pattern. The noncelogic/openclaw-memory-lancedb and lancedb/openclaw-lancedb-demo repositories show how external vector stores integrate into the slot system. The serenichron/openclaw-memory-mem0 plugin is the cleanest reference. It demonstrates the minimal structure required to implement a working memory backend and serves as the template for the VectorAI DB integration in this tutorial.

Setting Up VectorAI DB

Before writing the plugin, get VectorAI DB running locally and verify vector search works end-to-end. To execute the commands in this section, install the following:

NPM
- Docker
- Ollama
- OpenClaw CLI
- Together AI API key

In your project directory, create a docker-compose.yml:

services:
 vectorai:
   image: actian/vectorai:latest
   platform: linux/amd64 # If you are using a macOs device, else comment this line.
   container_name: vectorai_db
   ports:
    - "6573:6573"
    - "6574:6574"
   volumes:
    - ./data:/var/lib/actian-vectorai
   environment:
    - VECTORAI_LOG_LEVEL=info
   restart: unless-stopped

Start the VectorAI DB server by:

docker-compose up -d

Ensure Ollama is running and the model is ready:

ollama pull nomic-embed-text
ollama pull llama3.2:3b
ollama serve

Create a package.json file:

{
 "name": "vectorai-memory",
 "version": "1.0.0",
 "type": "module",
 "main": "dist/index.js",
 "openclaw": {
   "extensions": [
     "./dist/index.js"
   ]
 },
 "scripts": {
   "build": "esbuild index.ts --bundle --platform=node --format=esm --external:openclaw --external:@actian/vectorai-client --external:@grpc/grpc-js --external:typebox --outfile=dist/index.js",
   "build:check": "tsc --noEmit"
 },
 "dependencies": {
   "@actian/vectorai-client": "^1.0.2",
   "@sinclair/typebox": "^0.34.49",
   "openclaw": "^2026.5.28",
   "typebox": "^1.1.39"
 },
 "devDependencies": {
   "@types/node": "^22.0.0",
   "esbuild": "^0.28.0",
   "typescript": "^5.0.0"
 }
}

Create a tsconfig.json file:

{
   "compilerOptions": {
       "target": "ES2022",
       "module": "ESNext",
       "moduleResolution": "bundler",
       "strict": true,
       "esModuleInterop": true,
       "skipLibCheck": true
   },
   "include": [
       "index.ts"
   ]
}

Run the command:

npm install

This installs all the dependencies.

Create a collection the plugin will use. In your project directory, create a file create-collection.ts:

import { VectorAIClient } from "@actian/vectorai-client";


const client = new VectorAIClient("localhost:6574", {
   restUrl: "http://localhost:6573",
});


// dim=768 matches nomic-embed-text output dimensions.
await client.collections.create("openclaw-memory", {
   dimension: 768,
   distanceMetric: "COSINE",
});


// Smoke test: insert one vector and query it back.
await client.points.upsert("openclaw-memory", [
   {
       id: 1,
       vector: new Array(768).fill(0.01),
       payload: { text: "smoke test", source: "MEMORY.md", line: 1 },
   },
]);


const results = await client.points.search(
   "openclaw-memory",
   new Array(768).fill(0.01),
   { limit: 1 }
);


console.log("Collection ready. Top result:", results[0].payload.text);

Verify the connection by running the command:

npx tsx create-collection.ts

You will see an output as shown:
" width="512" height="62">
Figure 2: Verify connection

Building the Plugin

The OpenClaw memory slot expects a plugin that implements the memory tools and lifecycle hooks used by the agent runtime. The plugin below replaces the default SQLite/sqlite-vec backend with VectorAI DB.

Create a file index.ts:

import { defineToolPlugin } from "openclaw/plugin-sdk/tool-plugin";
import { Type } from "typebox";
import { VectorAIClient, hasId } from "@actian/vectorai-client";


const DEFAULT_GRPC = "localhost:6574";
const DEFAULT_REST = "http://localhost:6573";
const DEFAULT_OLLAMA = "http://localhost:11434";
const DEFAULT_COLLECTION = "openclaw-memory";
const EMBED_MODEL = "nomic-embed-text";
const DEFAULT_LIMIT = 5;
const EMBED_DIM = 768;


function hashId(input: string): number {
   return Math.abs(
       input.split("").reduce((acc, ch) => (Math.imul(31, acc) + ch.charCodeAt(0)) | 0, 0)
   );
}


async function embed(text: string, ollamaHost: string): Promise<number[]> {
   const res = await fetch(`${ollamaHost}/api/embeddings`, {
       method: "POST",
       headers: { "Content-Type": "application/json" },
       body: JSON.stringify({ model: EMBED_MODEL, prompt: text }),
   });
   if (!res.ok) throw new Error(`Ollama embed failed: ${res.status} ${res.statusText}`);
   return (await res.json()).embedding as number[];
}


export default defineToolPlugin({
   id: "vectorai-memory",
   name: "VectorAI Memory",
   description: "Routes OpenClaw memory tools to Actian VectorAI DB via Ollama embeddings.",


   configSchema: Type.Object({
       grpcAddr: Type.Optional(
           Type.String({ description: "VectorAI gRPC address (default: localhost:6574)" })
       ),
       restUrl: Type.Optional(
           Type.String({ description: "VectorAI REST URL (default: http://localhost:6573)" })
       ),
       ollamaHost: Type.Optional(
           Type.String({ description: "Ollama host for embeddings (default: http://localhost:11434)" })
       ),
       collection: Type.Optional(
           Type.String({ description: "VectorAI collection name (default: openclaw-memory)" })
       ),
   }),


   tools: (tool) => [


       tool({
           name: "vectorai_store",
           label: "Store in VectorAI",
           description: "Embed and store a text chunk in VectorAI DB. Returns the assigned numeric ID.",
           parameters: Type.Object({
               text: Type.String({ description: "The text content to embed and store." }),
               source: Type.String({ description: "The source file or identifier this text came from." }),
               line: Type.Optional(Type.Number({ description: "Line number within the source file." })),
           }),
           async execute({ text, source, line }, config, ctx) {
               ctx?.signal?.throwIfAborted();


               const grpcAddr = config?.grpcAddr ?? DEFAULT_GRPC;
               const restUrl = config?.restUrl ?? DEFAULT_REST;
               const ollamaHost = config?.ollamaHost ?? DEFAULT_OLLAMA;
               const collection = config?.collection ?? DEFAULT_COLLECTION;


               const client = new VectorAIClient(grpcAddr, { restUrl });
               const vector = await embed(text, ollamaHost);
               const id = hashId(`${source}:${line ?? 0}:${text.slice(0, 40)}`);


               await client.points.upsert(collection, [
                   { id, vector, payload: { text, source, line: line ?? 0 } },
               ]);


               return JSON.stringify({ stored: true, id, collection });
           },
       }),


       tool({
           name: "vectorai_search",
           label: "Search VectorAI",
           description: "Semantic search over stored memories. Returns ranked results with scores.",
           parameters: Type.Object({
               query: Type.String({ description: "Natural language query to find semantically similar stored entries." }),
               limit: Type.Optional(
                   Type.Number({
                       description: `Max results to return (default: ${DEFAULT_LIMIT}, max: 20).`,
                       minimum: 1,
                       maximum: 20,
                   })
               ),
           }),
           async execute({ query, limit }, config, ctx) {
               ctx?.signal?.throwIfAborted();


               const grpcAddr = config?.grpcAddr ?? DEFAULT_GRPC;
               const restUrl = config?.restUrl ?? DEFAULT_REST;
               const ollamaHost = config?.ollamaHost ?? DEFAULT_OLLAMA;
               const collection = config?.collection ?? DEFAULT_COLLECTION;


               const client = new VectorAIClient(grpcAddr, { restUrl });
               const vector = await embed(query, ollamaHost);
               const hits = await client.points.search(collection, vector, {
                   limit: limit ?? DEFAULT_LIMIT,
               });


               return JSON.stringify(
                   hits.map((h: any) => ({
                       id: h.id,
                       score: h.score,
                       text: h.payload?.text ?? null,
                       source: h.payload?.source ?? null,
                       line: h.payload?.line ?? null,
                   }))
               );
           },
       }),


       tool({
           name: "vectorai_recall",
           label: "Recall from VectorAI",
           description: "Retrieve a specific memory by its numeric ID.",
           parameters: Type.Object({
               id: Type.Number({ description: "The numeric ID of the memory point to retrieve." }),
           }),
           async execute({ id }, config, ctx) {
               ctx?.signal?.throwIfAborted();


               const grpcAddr = config?.grpcAddr ?? DEFAULT_GRPC;
               const restUrl = config?.restUrl ?? DEFAULT_REST;
               const collection = config?.collection ?? DEFAULT_COLLECTION;


               const client = new VectorAIClient(grpcAddr, { restUrl });
               const hits = await client.points.search(
                   collection,
                   new Array(EMBED_DIM).fill(0),
                   { limit: 1, filter: hasId([id]) }
               );


               const h = hits[0];
               return JSON.stringify(
                   h?.payload
                       ? { id: h.id, text: h.payload.text, source: h.payload.source, line: h.payload.line }
                       : null
               );
           },
       }),


       tool({
           name: "vectorai_forget",
           label: "Forget from VectorAI",
           description: "Permanently delete a stored memory by its numeric ID.",
           parameters: Type.Object({
               id: Type.Number({ description: "The numeric ID of the memory point to delete." }),
           }),
           async execute({ id }, config, ctx) {
               ctx?.signal?.throwIfAborted();


               const grpcAddr = config?.grpcAddr ?? DEFAULT_GRPC;
               const restUrl = config?.restUrl ?? DEFAULT_REST;
               const collection = config?.collection ?? DEFAULT_COLLECTION;


               const client = new VectorAIClient(grpcAddr, { restUrl });
               await client.points.delete(collection, { ids: [id] });


               return JSON.stringify({ deleted: true, id, collection });
           },
       }),


   ],
});

Run the command:

npm run build

This creates the file dist/index.js.

The plugin uses OpenClaw's plugin SDK and exports a standard plugin entry.
The defineToolPlugin() function is the plugin entry point. OpenClaw calls it during startup and exposes the runtime API used to register tools and lifecycle hooks.

The plugin creates a VectorAI DB client and an embedding helper that generates vectors using Ollama's nomic-embed-text model. Every memory operation uses the same embedding pipeline, which keeps retrieval and storage consistent.

Tool implementations

The plugin registers four tools that mirror OpenClaw's memory operations.

vectorai_store embeds a text chunk and stores it together with metadata such as the source file path and line number.

vectorai_search embeds the query, performs a vector search against the openclaw-memory collection, and returns the most relevant memories ranked by semantic similarity.

vectorai_recall retrieves a specific memory by ID.

Each stored record contains metadata. This metadata makes it possible to trace search results back to the original memory file.

vectorai_forget removes a stored memory from the VectorAI DB collection.

At this point, OpenClaw can already store, search, recall, and delete memories through VectorAI DB instead of SQLite/sqlite-vec.

Configure OpenClaw

Create an openclaw.plugin.json file that registers the plugin:

{
   "id": "vectorai-memory",
   "name": "VectorAI memory",
   "version": "1.0.0",
   "description": "Store and retrieve data in Actian VectorAI DB using semantic vector search.",
   "activation": {
       "onStartup": true
   },
   "contracts": {
       "tools": [
           "vectorai_store",
           "vectorai_search",
           "vectorai_recall",
           "vectorai_forget"
       ]
   },
   "configSchema": {
       "type": "object",
       "additionalProperties": false,
       "properties": {
           "grpcAddr": {
               "type": "string"
           },
           "restUrl": {
               "type": "string"
           },
           "ollamaHost": {
               "type": "string"
           },
           "collection": {
               "type": "string"
           }
       }
   }
}

This single configuration change swaps the memory backend from SQLite/sqlite-vec to VectorAI DB. The agent behavior and prompts stay the same but the memory layer changes.

Installing and activating the plugin

At this point, the plugin is complete. The final step is swapping OpenClaw from the default memory-core backend to the new VectorAI DB backend.
So far, your directory structure should look like this:
.

├── create-collection.ts
├── data
├── dist
│ └── index.js
├── docker-compose.yml
├── index.ts
├── openclaw.plugin.json
├── tsconfig.json
├── package-lock.json
├── package.json

Install the plugin from your local directory:

openclaw plugins install .

You see an output shown:

Figure 3: OpenClaw plugin installation

Configure your OpenClaw agent to use the Together AI API Llama-3.3-70B-Instruct-Turbo chat model by default. Replace the placeholder with your API key.

openclaw onboard --non-interactive --accept-risk --mode local --auth-choice together-api-key --together-api-key "$TOGETHER_API_KEY"
openclaw config set agents.defaults.model.primary "together/meta-llama/Llama-3.3-70B-Instruct-Turbo"
openclaw config set plugins.allow '["vectorai-memory", "together"]'

Restart the OpenClaw gateway:

openclaw gateway restart

Test connection

Check the agent your OpenClaw is using by:

openclaw agents list

You should see an output shown below:

" width="512" height="93">
Figure 4: Getting the agent

In this case, OpenClaw uses the default agent main.

To verify the plugin works, store a test memory in VectorAI DB using the plugin:

openclaw agent --agent main --message '/tool vectorai_store {"text":"Rome is the capital of Italy","source":"test","line":1}'

You should see:

" width="512" height="77">
Figure 5: Storing to VectorAI DB

To retrieve it, ask:

openclaw agent --agent main --message '/tool vectorai_search {"query":"what is the capital of Italy?"}'

We get the following results:

" width="512" height="110">
Figure 6: Agent retrieving from VectorAI DB

Why On-Premises Memory Matters for Production Agents

For many teams, cloud-hosted memory services are a practical choice. They simplify deployment and reduce operational overhead. However, they are not always an option.

Production agents often run in environments where data cannot leave the network boundary. Financial institutions, healthcare organizations, government agencies, and industrial environments frequently restrict cloud egress or require local data residency. In these cases, the memory layer must run alongside the application infrastructure.

VectorAI DB allows teams to keep semantic search and durable memory inside their own environment. Memory remains available across agent restarts, context compaction events, and multiple agent instances without depending on an external service.

Wrapping Up

OpenClaw's default SQLite/sqlite-vec memory backend works well for a single agent running on a single machine. As soon as you need shared memory, durable storage across context compaction events, or independent management of the vector store, an external backend becomes a better fit.

In this tutorial, you built a custom OpenClaw plugin that routes memory operations to VectorAI DB. The swap required only a plugin installation, a configuration change, and a gateway restart. The agent behavior stayed the same, but the memory layer became persistent, shareable, and independent of the agent runtime.

Get started with Actian VectorAI DB Community Edition by signing up today. Check the documentation for deployment and usage instructions, and participate in the Discord community for support and discussions.

Running Gemma 2B on Edge Hardware with Actian VectorAI DB

Praise James — Thu, 09 Jul 2026 17:49:35 +0000

Today, running a powerful language model entirely on edge hardware is no longer the hard part. The problem is making it useful once it is there. A developer can deploy Gemma 2B on NVIDIA Jetson Orin Nano or a custom hardware device, run inference locally, and generate responses without sending a single token to the cloud.

The model performs well, supports an 8K token context window, and delivers enough throughput for many production edge applications. The real challenge appears when the agent needs to answer questions from a 50,000-document maintenance corpus that is far larger than anything that fits inside its context window.

A local model without a local vector store faces two choices. Either overload the context window with documents that do not fit or call a remote retrieval service that may not exist in an offline environment. Neither option works for industrial systems, robotics platforms, field service devices, or other regulated industries where connectivity is unreliable or unavailable.

This article builds a complete local inference stack using the Gemma 2B model. Gemma 2B handles text generation through Ollama. Actian VectorAI DB handles semantic retrieval from a large document corpus. The entire stack runs on the device, requires no cloud services during operation, and provides a practical foundation for retrieval-augmented agents deployed at the edge.

What Gemma 2B Brings to Edge Hardware

Most edge AI developers no longer struggle to run a language model locally. The challenge is finding one that delivers useful reasoning and text generation while fitting within the memory and power constraints of edge devices.

Gemma 2B is a lightweight open model designed for local inference. It can run on devices such as the NVIDIA Jetson Orin Nano, Raspberry Pi 5, and industrial gateways. In this setup, the model runs on-device, so data stays local instead of flowing through cloud infrastructure.

Model: Gemma 2B
Parameters: 2 billion
Disk size: ~1.6GB
Target hardware: Raspberry Pi 5, Jetson Orin Nano, edge gateways, embedded Linux devices
RAM required: 8GB+

For many edge workloads, these hardware requirements are modest enough to leave room for additional services on the same device.

Gemma 2B also supports an 8K token context window. While that is sufficient for conversations, instructions, and small document sets, it quickly becomes a limitation when an agent must search across thousands of maintenance manuals, support articles, operational procedures, or historical records. A single industrial knowledge base can contain millions of words, far exceeding what the model can hold in context at once. The model is released under a commercially friendly license, which matters if you need to deploy it on-premises or at the edge.

The problem is that local inference only solves half of the architecture. Once the corpus grows beyond what fits in the context window, the agent needs a retrieval layer that can find the right information before generation begins.

The Retrieval Gap

Getting Gemma 2B running locally solves the generation problem. It does not solve the retrieval problem. An edge agent can only answer questions using information that exists in its prompt or context window. As soon as the knowledge base grows beyond what fits in memory, the agent needs a way to locate the right information before generating a response.

This approach has some failure modes.

Context window overflow

An 8K token context window sounds large until the agent must work with a real-world document collection. A library of regulatory documentation can contain hundreds of thousands or even millions of words.

Without retrieval, developers often attempt to load as many documents as possible into the prompt. That approach quickly reaches the context limit. The model then truncates documents, loses important details, or generates answers from incomplete information. The result is lower accuracy and less reliable responses.

Cloud dependency

Many Retrieval-Augmented Generation (RAG) tutorials solve the context problem by connecting the model to a cloud-hosted vector database. That works in environments where internet connectivity is not a core requirement. It becomes a problem when the application runs in a highly regulated or an air-gapped environment, with no internet access.

When the retrieval layer depends on a cloud service, the agent depends on the network. If the connection fails, the retrieval fails. In many cases, the agent cannot answer questions because the information it needs never reaches the model.

Scale degradation with basic embedding search

Simple embedding search implementations work well with a few hundred documents. Many developers start with an in-memory vector index or a basic search capability bundled with a local inference framework. Performance is acceptable during testing because the dataset is small.

The situation changes when the corpus grows to tens of thousands of document chunks and the application begins serving multiple queries per second. Search latency increases, memory usage rises, and retrieval becomes the bottleneck in the system.

An edge-compliant vector store deployed on the device resolves all three challenges without adding cloud dependency.

The Full Stack

The goal is to keep both inference and retrieval on the same device. Gemma 2B handles generation while VectorAI DB handles retrieval. Together, they form a complete RAG stack that operates without cloud dependency.

In this architecture, the user never interacts with the language model directly. Every query first passes through the retrieval layer, which searches the local document corpus and returns the most relevant information. The application injects those retrieved documents into the prompt before sending it to Gemma 2B. The model then generates a response grounded in the retrieved context rather than relying solely on its training data.

The stack consists of two core components:

Gemma 2B

Gemma 2B serves as the generation layer. It receives the user query along with the retrieved context and produces the final response. Running through Ollama simplifies deployment on edge hardware by providing a lightweight local API for inference.

Responsibilities:

Accept user prompts
Process retrieved context
Generate grounded responses
Run entirely on-device

VectorAI DB

VectorAI DB serves as the retrieval layer. It stores document chunks and their metadata, then performs semantic search against the local corpus. When a user submits a query, the application converts that query into an embedding and sends it to VectorAI DB. The database returns the most semantically relevant document chunks, which become context for the language model.

Responsibilities:

Store document embeddings
Perform a semantic similarity search
Retrieve relevant context
Operate without network connectivity

This architecture keeps every component inside the device boundary. Once the model, embeddings, and database are installed, the system can answer questions without relying on external APIs, cloud-hosted vector databases, or internet connectivity.

Figure 1: Local inference and retrieval stack architecture

Setting Up the Stack

This section builds a complete local retrieval and inference stack on an edge device using Gemma 2B for inference and VectorAI DB for retrieval.

Stage 1: Install Ollama and pull Gemma 2B

Install Ollama on your custom hardware device by following the installation instructions for your operating system.

After installation, pull the Gemma 2B model. Use the latest available tag for Gemma 2B in Ollama. Confirm the exact tag in the Ollama model registry before production use.

ollama pull gemma2:2b

You should see an output:

" width="512" height="76">
Figure 2: Ollama pull for Gemma 2B

Verify the installation by running the test prompt

ollama run gemma2:2b "Explain what a vector database does in one sentence."

You get the result:

Figure 3: Verify Ollama Gemma 2B installation

Stage 2: Install Docker and run VectorAI DB

Install Docker by following the guide and confirm that the Docker service is running. Then, create a docker-compose.yml file with the following configuration:

services:
 vectorai:
   image: actian/vectorai:latest
   container_name: vectorai_db
   ports:
    - "6573:6573" #rest
    - "6574:6574" #grpc
   volumes:
     # vector data persists across restarts
    - ./data:/var/lib/actian-vectorai
   environment:
    - VECTORAI_LOG_LEVEL=info
    - ACTIAN_VECTORAI_ACCEPT_EULA=YES
   restart: unless-stopped

Start the VectorAI DB server by:

docker-compose up -d

Stage 3: Connect with VectorAIClient and create a collection

Install UV and run the command:

uv init .

Install the dependencies by running the command:

uv add actian-vectorai-client requests

Create a file create_collection.py with the following contents:

import requests
from actian_vectorai import VectorAIClient, VectorParams, Distance
# ── Config ─────────────────────────────────────────────────────────────────────
VECTORAI_URL = "localhost:6574"
COLLECTION   = "docs"
OLLAMA_URL   = "http://localhost:11434/api/embeddings"
OLLAMA_MODEL = "gemma2:2b"
# ──────────────────────────────────────────────────────────────────────────────
def get_embedding_dim() -> int:
   """Probe Ollama once to get the embedding dimension for this model."""
   response = requests.post(
       OLLAMA_URL,
       json={"model": OLLAMA_MODEL, "prompt": "probe"},
       timeout=30,
   )
   response.raise_for_status()
   return len(response.json()["embedding"])
def main():
   print(f"Probing embedding dimension from Ollama ({OLLAMA_MODEL})...")
   dim = get_embedding_dim()
   print(f"  Embedding dimension: {dim}")
   with VectorAIClient(VECTORAI_URL) as client:
       info = client.health_check()
       print(f"\nConnected to {info['title']} v{info['version']}")


       existing = client.collections.list()
       if COLLECTION in existing:
           print(f"\nCollection '{COLLECTION}' already exists — nothing to do.")
           return
       client.collections.create(
           COLLECTION,
           vectors_config=VectorParams(size=dim, distance=Distance.Cosine),
       )
       # 5. Confirm using get_info()
       info = client.collections.get_info(COLLECTION)
       print(f"\nCollection created: '{COLLECTION}'")
       print(f"  Vector size : {dim}")
       print(f"  Distance    : Cosine")
       print(f"  Status      : {info.status}")
       print(f"  Points      : {info.points_count}")
       print(f"\nPayload fields written at ingest time:")
       print(f"  content     — chunk text")
       print(f"  source      — source filename")
       print(f"  chunk_index — position of this chunk within the document")
       print(f"  metadata    — dict with char_start and total_chunks")
if __name__ == "__main__":
   main()

create_collection.py does the following:

Detects embedding size automatically by sending a sample text to Ollama, avoiding hardcoded vector dimensions
Connects to VectorAI DB via gRPC and checks that the database server is running
Checks for an existing docs collection and exits if it already exists, making the script safe to rerun
Creates the docs collection only when it does not exist.

Run this file by:

uv run create_collection.py

You get the following output:

Figure 4: Run create_collection.py

Stage 4: Ingest a document corpus

With the collection created, the next step is to fill it. The ingestion script reads every .txt file in a local docs/directory, splits each file into chunks, generates an embedding for each chunk using Ollama, and writes the result to VectorAI DB. No external API is called at any point in this process.
Create a docs/ directory and add a file named corpus.txt inside it with the following content:


Hydraulic Press Unit 7 — Maintenance Reference


Unit 7 is a 50-ton hydraulic press manufactured by Duratek in 2019.
Serial number: DT-7740-B. Located in Bay 3, Building 2.


Scheduled maintenance is every 90 days. Last service was completed on March 14, 2026 by technician R. Okafor.


Hydraulic fluid: ISO 46 mineral oil. Tank capacity is 12 liters.
Replace fluid every 180 days or if colour turns dark brown.


The main pump runs at 210 bar operating pressure. Maximum rated pressure is 250 bar.
If pressure drops below 180 bar during operation, inspect the pump seals.


Hydraulic coupling torque spec: 42 Nm ± 2 Nm. Use thread-lock grade 243 after torquing.


Known issue: the pressure relief valve on Unit 7 sticks occasionally at cold start.
Workaround is to run the press unloaded for 2 minutes before applying load.
Replacement valve part number: DT-PRV-114.


Emergency stop is the red panel on the left side of the frame.
Do not operate the press without the safety guard in place.

Create a file ingest.py with the following content:

import uuid
from pathlib import Path


import requests
from actian_vectorai import VectorAIClient, PointStruct
# ── Config ─────────────────────────────────────────────────────────────────────
VECTORAI_URL = "localhost:6574"
COLLECTION   = "docs"
OLLAMA_URL   = "http://localhost:11434/api/embeddings"
OLLAMA_MODEL = "gemma2:2b"
DOCS_DIR     = Path("./docs")
CHUNK_SIZE   = 400    # characters per chunk
# ──────────────────────────────────────────────────────────────────────────────
def embed(text: str) -> list[float]:
   response = requests.post(
       OLLAMA_URL,
       json={"model": OLLAMA_MODEL, "prompt": text},
       timeout=60,
   )
   response.raise_for_status()
   return response.json()["embedding"]
def chunk_text(text: str) -> list[str]:
   return [text[i : i + CHUNK_SIZE] for i in range(0, len(text), CHUNK_SIZE)]
def main():
   files = list(DOCS_DIR.glob("*.txt"))
   if not files:
       print(f"No .txt files found in {DOCS_DIR}/")
       print("Create the folder and drop some .txt files in it, then re-run.")
       return
   print(f"Found {len(files)} file(s) in {DOCS_DIR}/\n")
   with VectorAIClient(VECTORAI_URL) as client:
       # Confirm the collection is there before doing any work
       existing = client.collections.list()
       if COLLECTION not in existing:
           print(f"Collection '{COLLECTION}' not found.")
           print("Run 'python create_collection.py' first.")
           return
       total_chunks = 0
       for path in files:
           text = path.read_text(encoding="utf-8")
           chunks = chunk_text(text)
           points = []
           print(f"Ingesting {path.name} ({len(chunks)} chunks)...")
           for idx, chunk in enumerate(chunks):
               vector = embed(chunk)
               points.append(
                   PointStruct(
                       id=str(uuid.uuid4()),
                       vector=vector,
                       payload={
                           "content":     chunk,
                           "source":      path.name,
                           "chunk_index": idx,
                           "metadata": {
                               "char_start":   idx * CHUNK_SIZE,
                               "total_chunks": len(chunks),
                           },
                       },
                   )
               )
           client.points.upsert(COLLECTION, points)
           total_chunks += len(points)
           print(f"  ✓ {len(points)} chunks written")
       # Final count from the DB
       count = client.points.count(COLLECTION)
       print(f"\nDone. {total_chunks} chunks ingested this run.")
       print(f"Total points in '{COLLECTION}': {count}")
if __name__ == "__main__":
   main()

The ingest.py script is responsible for preparing documents for retrieval. It performs the following steps:

Loads documents from the configured source directory
Splits each document into 400-character chunks to create focused passages that can be retrieved accurately
Generates embeddings for each chunk by sending the text to Ollama's /api/embeddings endpoint
Builds a payload containing the chunk text and associated metadata, such as the document ID, chunk ID, and source information
Stores the vector and payload in the vector database, allowing the chunks to be searched later using semantic similarity

After ingestion is complete, every chunk is represented by both its embedding vector and its metadata. This enables the retrieval system to find the most relevant passages for a query and return the original text along with information about where it came from.

Run the ingest.py by:

uv run ingest.py

This returns the following results:

Figure 5: Run ingest.py

Building the Retrieval Loop

The retrieval loop is the core of the system. It connects three local components into one flow:

The user query input
VectorAI DB for semantic retrieval
Gemma 2B via Ollama for response generation

The full pipeline runs entirely on-device and turns a raw question into a grounded answer using retrieved context.

Create a file query.py with the following content:

import sys
import requests
from actian_vectorai import VectorAIClient


# ── Config ─────────────────────────────────────────────────────────────────────
VECTORAI_URL = "localhost:6574"
COLLECTION   = "docs"
OLLAMA_URL   = "http://localhost:11434"
OLLAMA_MODEL = "gemma2:2b"
TOP_K        = 3
# ──────────────────────────────────────────────────────────────────────────────




# Step 1: Accepts a user query
def main(query: str):
   print(f"\nQuery: {query}\n")


   # Step 2: Embed the query using the same model used at ingest time.
   # Using the same model is critical — the query vector must exist in the
   # same vector space as the stored chunk vectors for similarity to be meaningful.
   query_vector = embed(query)


   # Step 3: Search VectorAI DB for the top-k most similar chunks
   with VectorAIClient(VECTORAI_URL) as client:
       results = client.points.search(
           COLLECTION,
           vector=query_vector,
           limit=TOP_K,
       )


   if not results:
       print("No results returned. Make sure you have run ingest.py first.")
       return


   # Print retrieved chunks so retrieval is visible, not just implied
   print(f"── Retrieved {len(results)} chunk(s) ───────────────────────────")
   context_parts = []
   for i, result in enumerate(results, 1):
       source  = result.payload.get("source", "unknown")
       chunk   = result.payload.get("chunk_index", "?")
       content = result.payload.get("content", "")
       print(f"[{i}] {source} / chunk {chunk}  (score: {result.score:.3f})")
       print(f"    {content[:120]}...")
       context_parts.append(f"[{source}]\n{content}")
   print("────────────────────────────────────────────────────\n")


   # Step 4: Inject retrieved chunks into the prompt as context.
   # The model is explicitly told to use only the provided context.
   # This keeps the response grounded in the indexed documents and prevents
   # the model from drawing on its general training knowledge.
   context = "\n\n".join(context_parts)
   prompt = (
       "You are a maintenance assistant. "
       "Use only the context below to answer the question. "
       "If the answer is not in the context, say so.\n\n"
       f"Context:\n{context}\n\n"
       f"Question: {query}\n\n"
       "Answer:"
   )


   # Step 5: Call Gemma 2B via the Ollama API and return the response
   print("Generating answer with Gemma 2B...")
   response = requests.post(
       f"{OLLAMA_URL}/api/generate",
       json={"model": OLLAMA_MODEL, "prompt": prompt, "stream": False},
       timeout=120,
   )
   response.raise_for_status()
   answer = response.json()["response"].strip()
   print(f"\nAnswer:\n{answer}\n")




def embed(text: str) -> list[float]:
   response = requests.post(
       f"{OLLAMA_URL}/api/embeddings",
       json={"model": OLLAMA_MODEL, "prompt": text},
       timeout=60,
   )
   response.raise_for_status()
   return response.json()["embedding"]




if __name__ == "__main__":
   if len(sys.argv) < 2:
       print("Usage: python query.py \"your question here\"")
       sys.exit(0)
   main(" ".join(sys.argv[1:]))

This code does the following:

Accepts a user query: The script starts by taking a question from the command line.
Converts the query into an embedding: The embed() function sends the query to Ollama’s embeddings endpoint.
Searches VectorAI DB for relevant context: The embedding is used to query the local vector database.
Prints retrieved chunks: Before generating a response, the script prints the retrieved results.
Builds a grounded prompt for Gemma 2B: All retrieved chunks are merged into a single context block.
Sends the prompt to Gemma 2B via Ollama: The final step sends the structured prompt to the local model.
Returns the final answer: The response is extracted and printed.

Test the end-to-end flow by running the command:

uv run query.py "what is the torque spec for the hydraulic coupling?"

You get the result:

" width="512" height="295">
Figure 6: Building the retrieval loop

From the output, you see that the model is no longer answering from its internal training data alone. Instead, it is explicitly grounded in the retrieved chunks printed from VectorAI DB before generation. Each response is tied to specific documents, which means you can trace every answer back to the source material.

Running Offline

This is the final validation of the entire stack. At this point, both Gemma 2B and VectorAI DB are already installed, the corpus has been indexed, and the retrieval loop is working on a live connection. The next step is to prove that nothing in the runtime path depends on the internet.

Disable your network interface by running the command:

networksetup -setairportpower en0 off  # for MacOs
sudo ip link set eth0 down  # for Linux

Run the command:

uv run query.py "what is the serial number"

We get the following results:

Figure 7: Verify offline execution

Wrapping Up

This system demonstrates that edge AI is no longer limited to isolated model inference. Gemma 2B handles local text generation efficiently on devices like the Jetson Orin Nano, while VectorAI DB provides a persistent semantic memory layer that runs on the same hardware. Together, they form a complete retrieval-augmented generation stack that operates without cloud services during runtime.

After initial setup, the stack continues to function without network access. The model does not require external APIs, and the vector database does not depend on cloud infrastructure. This makes the architecture suitable for industrial environments, robotics systems, and field devices where connectivity is unreliable or restricted.

Best Vector Databases for AI Agents in 2026

Praise James — Thu, 09 Jul 2026 15:51:54 +0000

Evaluating the best vector databases for AI agents in 2026 presents a different problem than it did a few years ago. Most vector databases are adequate in a Retrieval-Augmented Generation (RAG) demo. The decision comes down to fit: where the agent runs, what it needs to do, and how much write load the architecture generates.

Two distinctions matter more than most comparison articles acknowledge. First, RAG retrieval and agent memory are different workloads. RAG systems primarily read from a vector store, while agent memory systems continuously write new observations back into it. Second, deployment constraints often eliminate options before performance becomes relevant. A cloud-managed database may perform well in benchmarks but fail a compliance review or data residency requirement.

This article covers eight databases: Pinecone, Milvus, Qdrant, Weaviate, Chroma, pgvector, LanceDB, and Actian VectorAI DB. It opens with five evaluation criteria, works through the strengths and limitations of each database, maps each to either RAG retrieval or agent memory, and closes with a decision table.

What Is a Vector Database for AI Agents?

A vector database stores embeddings, high-dimensional numerical representations of text, images, audio, and other unstructured data, and retrieves them through semantic similarity rather than exact keyword matching. To perform vector similarity search efficiently, vector databases use Approximate Nearest Neighbor (ANN) algorithms such as Hierarchical Navigable Small World (HNSW) and Inverted File Index (IVF).

Traditional relational databases match records by exact value. An AI agent searching for documents related to "equipment failure in cold conditions" will not retrieve a record titled "low-temperature motor fault" unless those exact words appear. A vector database returns semantically similar results regardless of surface wording. That retrieval quality gap is why agents need a dedicated vector store.

AI agents use vector databases for two distinct purposes. The first is RAG, where the agent retrieves relevant documents before generating a response. The second is agent memory, where the agent continuously reads and writes information about previous interactions, observations, and task outcomes.

A vector database does not replace Postgres or another transactional database. Most production distributed architectures store structured records, user metadata, and transactional data in a relational database, while using an AI-native vector database for semantic search, retrieval, and long-term memory operations.

How to Evaluate a Vector Database for AI Agents in 2026

Query speed under load

The fastest vector database benchmark is often the least useful metric for production agents. Static Queries Per Second (QPS) measurements assume a fixed index with no concurrent writes. Real agents continuously ingest documents, update memories, and create embeddings. Streaming ingestion performance at your target recall threshold is often a better production signal than peak benchmark numbers. The VectorDBBench leaderboard, maintained by Zilliz, updates the benchmark suite over time. Because Zilliz also develops Milvus, treat the leaderboard as directional guidance rather than a neutral ranking.

Write pattern support

RAG retrieval is read-heavy. Agent memory is read-write, so every task completion can add write load. VectorDBBench data shows how large this gap can be: Zilliz Cloud drops from 7,385 QPS static to 1,860 QPS with 1,000 rows per second of ingestion, a 75% reduction. Self-hosted Milvus 16c64g drops from 2,747 to 156 QPS under the same ingestion rate. Pinecone p2.x8 drops from 1,131 to 369 QPS under 500 rows-per-second ingestion. Static QPS and streaming ingestion QPS are not the same ranking. A database can look fast in one and fall behind in the other.

Deployment model

A cloud-managed vector database is not always an option for every team. Regulated industries (healthcare under HIPAA, finance under PCI DSS or SOC 2), government systems, and industrial edge deployments have structured data residency requirements that eliminate cloud-managed options entirely. Use three separate categories: cloud-only, self-hosted, and edge or air-gapped. Those deployment constraints determine which databases are relevant for a given team.

Operational overhead

Infrastructure complexity influences long-term costs as much as software pricing. Running pgvector on an existing Postgres instance introduces little additional operational burden. Operating Milvus at scale often requires Kubernetes expertise and dedicated infrastructure support. Managed services reduce operational work but transfer that cost into subscription pricing.

The 8 Best Vector Databases for AI Agents in 2026

The decision depends on three factors: where the agent runs, what it needs to do (retrieval, memory, or both), and how much operational complexity the team can support. Different modern vector databases optimize for different combinations of those requirements.

1. VectorAI DB

VectorAI DB targets environments where deployment constraints matter as much as retrieval performance. Rather than focusing solely on cloud-native workloads, it is designed to run across cloud, on-premises, edge, and air-gapped environments using the same deployment model.
Deployment: Cloud, self-hosted, on-premises, edge, air-gapped environments
Best for:

Regulated industries with strict data residency requirements
Industrial and manufacturing edge AI deployments
Teams that need a consistent deployment model across cloud and disconnected environments

Limitations:

Closed-source engine
Smaller ecosystem than more established vector databases
SDK support is limited to Python and JavaScript at launch

As a newer entrant, VectorAI DB is still expanding its ecosystem, integrations, and platform capabilities. The product roadmap is developing rapidly, with new features and improvements released regularly.

The primary differentiator is deployment flexibility. The same Docker image can run in cloud environments, on-premises infrastructure, NVIDIA Jetson devices, and fully air-gapped networks without requiring different operational models for each environment.

For teams building AI agents in regulated environments, industrial settings, or locations where cloud connectivity is unavailable or restricted, deployment requirements may eliminate many alternatives before benchmark comparisons begin.

2. pgvector

pgvector extends Postgres with vector search capabilities and remains the default choice for teams already invested in the Postgres ecosystem.
Deployment: Self-hosted Postgres
Best for:

Agent memory workloads are tightly coupled with relational data
Teams operating below 10 million vectors

Limitations:

Large vector workloads increase storage overhead and memory pressure
Scaling requires Postgres expertise rather than dedicated vector infrastructure

The strongest argument for pgvector is architectural simplicity. Instead of introducing another database, teams can keep vector search alongside transactional relational data. This approach works particularly well for agents that need frequent joins between memories, user records, and application state.

An independent benchmark from Steezr found pgvector 0.8.0 HNSW performance comparable to Qdrant 1.13 at a 1M vector scale on AWS c6i.2xlarge hardware. That result suggests some teams can stay on pgvector longer before moving to a dedicated vector database.

3. Qdrant

Qdrant balances retrieval performance, deployment flexibility, and support for memory-heavy workloads.
Deployment: Cloud, self-hosted, Kubernetes
Best for:

Agent memory systems with frequent updates
Hybrid search workloads that combine semantic and keyword retrieval

Limitations:

Vector dimensions remain fixed after collection creation
Payload indexes require deliberate planning and configuration

Qdrant's strength lies in versatility. Teams can begin with self-hosted deployments and later adopt managed offerings without changing databases. Its support for metadata filtering, hybrid search capabilities, and memory-oriented workloads has made it a common choice for production AI systems.

A vendor-funded three-week production evaluation by Particula measured Qdrant at 22ms p95 latency on a 10-million-vector workload, compared with 45ms for Pinecone. Engineers should treat the result as directional rather than definitive, as the comparison was not conducted independently.

For teams that need both operational flexibility and solid performance, Qdrant is a good fit.

4. Pinecone

Pinecone targets teams that prefer managed infrastructure and are willing to pay for operational simplicity.
Deployment: Cloud-managed
Best for:

Organizations with limited DevOps capacity
Teams that prioritize managed operations over infrastructure control

Limitations:

No self-hosting option
Costs can increase significantly under large-scale ingestion and retrieval workloads

Pinecone offloads infrastructure management, scaling, and maintenance to the provider. It fits teams that want managed operations and are comfortable trading off self-hosting control.

5. Weaviate

Weaviate is positioned for agentic retrieval workflows and coding-agent integrations.
Deployment: Cloud and self-hosted
Best for:

Agentic retrieval workflows
Coding agents and multi-step reasoning systems

Limitations:

More operational complexity than managed platforms
Kubernetes expertise is often required for larger deployments

Weaviate differentiates itself through agent-native capabilities rather than raw vector search performance. Its Query Agent reached general availability in 2025, and Agent Skills launched in 2026 with integrations targeting Claude Code, Cursor, GitHub Copilot, VS Code, and Gemini CLI.

These key features position Weaviate as more than a storage layer. It increasingly functions as infrastructure for agent orchestration and retrieval workflows.

Teams building sophisticated agent ecosystems may find Weaviate's agent tooling more compelling than benchmark advantages measured in milliseconds.

6. Milvus

Milvus targets teams that need to search billions of vectors across distributed infrastructure. It is a strong fit when scale matters more than operational simplicity.
Deployment: Self-hosted Kubernetes, managed through Zilliz Cloud.
Best for:

Multi-billion vector workloads
Multimodal agents that combine text, image, audio, and video embeddings

Limitations:

Significant operational complexity
Requires immediate patching of critical vulnerabilities such as CVE-2026-26190 on affected versions

Milvus 2.6 also introduced RaBitQ quantization, which claims a 72% memory reduction and 4× faster queries at approximately 95% recall. These figures come from Milvus and should be treated as vendor-published benchmarks.

7. Chroma

Chroma is a local-first option for quickly getting a prototype agent running. Most developers can move from installation to retrieval in minutes.

Deployment: Local-first, single-node
Best for:

Prototyping
Personal AI assistants
Local development environments

Limitations:

Limited production scalability
Weak concurrency handling at larger vector counts

The challenge appears when prototypes become products. A commonly cited practitioner report described Chroma performing well during development but becoming unusable at roughly two million vectors with 12 concurrent users. In that practitioner report, the team said latency dropped from about 800 ms to 28 ms after migrating to Qdrant under the same workload.

For developers looking to validate an idea before committing to infrastructure, Chroma is a practical starting point. For production systems with sustained concurrency or larger vector counts, it is usually a poor fit.

8. LanceDB

LanceDB focuses on embedded AI applications. Rather than running a separate database service, developers can package retrieval directly into local applications.
Deployment: Embedded database, local-first
Best for:

Desktop AI applications
IDE extensions
Offline assistants
Multimodal agent workloads

Limitations:

No built-in multi-tenant access controls
Shorter production history than established alternatives

One notable adoption example comes from Continue, the open-source coding assistant. Continue selected LanceDB because it offered an embedded TypeScript library with fast on-disk retrieval and SQL-style filtering.

That design makes LanceDB particularly attractive when the agent runs on the user's machine rather than inside a centralized cloud platform.

RAG Retrieval vs. Agent Memory: Which Database Fits Which Use Case?

The distinction matters because most evaluation articles treat RAG and agent memory as the same workload, and they are not.

RAG retrieval is read-heavy. Documents are ingested once and queried repeatedly. Freshness matters, but writes happen relatively infrequently.
Agent memory is read-write at runtime. Agents retrieve previous memories, generate new observations, and immediately write those observations back into storage. Freshness becomes a core requirement.

RAG-optimized databases

These databases excel when retrieval dominates writes:

Pinecone
Milvus
pgvector (especially under 10M vectors)

They perform best when agents search document collections rather than constantly updating memory.

Memory-optimized databases

These databases handle continuous writes more effectively:

Qdrant
Weaviate
LanceDB
Actian VectorAI DB

They fit agents that update memory after every interaction, workflow, or task completion.

Both (with caveats)

pgvector works well below about five million vectors when transactional consistency matters.
Chroma works well during development and prototyping, but is not a long-term production memory layer.

One additional consideration is memory freshness. GitHub's engineering team reported using a 28-day auto-expiry policy and repository-scoped memories in GitHub Copilot. That example shows why retention rules matter: stale memories create more risk than missing memories. Any production memory architecture should define retention and expiration policies before optimizing retrieval performance.

Decision Table

This table summarizes the fastest way to narrow your shortlist by workload, deployment, and write pattern.

Database	Deployment	Best for	Hybrid search	Open source / Free	Skip if
pgvector	Self-hosted (Postgres)	Agent memory with relational data	Partial	Yes	You need sustained billion-scale growth, or can't tolerate Postgres operational overhead
Qdrant	Cloud, Self-hosted	Balanced retrieval and memory workloads	Yes	Yes	You need a fully air-gapped managed experience
Pinecone	Cloud	Zero-ops vector search	Yes	No	Data residency or write-heavy workloads matter
Weaviate	Cloud, Self-hosted	Agent-native retrieval pipelines	Yes	Yes	Your team lacks Kubernetes expertise
Milvus	Self-hosted, Managed	Billion-scale search	Yes	Yes	Your team cannot operate a distributed infrastructure
Chroma	Local	Rapid prototyping	Partial	Yes	You expect production concurrency at scale
LanceDB	Embedded	Local and on-device agents	Partial	Yes	You need multi-tenant enterprise controls
VectorAI DB	Cloud, on-premises, edge, air-gapped	Regulated and disconnected deployments	Yes	No	Your deployment is cloud-native, and air-gap support is irrelevant

Wrapping Up

The best vector database for AI agents in 2026 depends more on workload characteristics than features.

If your team already runs Postgres and expects fewer than 10 million vectors, start with pgvector. If you want managed infrastructure, evaluate Pinecone first and compare it with self-hosted Qdrant before committing to the cost-efficiency model.

If your agent runs in a regulated environment, on industrial hardware, or without guaranteed cloud connectivity, VectorAI DB is the only mainstream 2026 database built specifically for that constraint.

Should You Use RAG or Fine-Tune Your LLM?

Offisong Emmanuel — Tue, 19 May 2026 19:43:02 +0000

The debate over retrieval augmented generation (RAG) vs. fine-tuning appears simple at first glance. RAG pulls in external data at inference time. Fine-tuning modifies model weights during training. In production systems, that distinction is insufficient.

According to the Menlo Ventures 2024 State of Generative AI in the Enterprise report, 51 percent of enterprise AI deployments use RAG in production. Only nine percent rely primarily on fine-tuning. Yet research such as the RAFT study from UC Berkeley shows that hybrid systems combining retrieval and fine-tuning outperform either approach alone across benchmarks.

If hybrid systems can produce better results, why does industry adoption favor only RAG? In this article, we’ll compare RAG, fine-tuning, and a hybrid architecture to understand the trade-offs and where each approach excels.

TL;DR

RAG: Best for frequently changing knowledge and moderate traffic; easy to update without retraining.
Fine-tuning: Best for stable domains and high-volume or low-latency tasks; improves task-specific accuracy and formatting.
Hybrid/RAFT: Combines up-to-date retrieval with optimized model behavior for the highest accuracy.
Key trade-off: Choice depends on query volume, how often knowledge changes, and team expertise.

Why the Standard RAG vs. Fine-Tuning Comparison Fails

RAG is a method where the model dynamically pulls in external data at inference time. Each query retrieves relevant documents or knowledge chunks, which the system appends to the prompt, allowing the model to produce answers grounded in current information.

Fine-tuning is the process of modifying a model’s weights during training using labeled data. Instead of relying on external retrieval, the model internalizes patterns directly, producing consistent outputs without querying external sources.

While these definitions are technically correct, most standard comparisons miss the factors that actually drive decisions in production. In real-world systems, the choice between RAG and fine-tuning depends on variables like scale, query volume, and how often your data changes.

Missing variable 1: Context expansion at scale

In many production RAG systems, every request appends hundreds of tokens. That added context changes how the model allocates attention and prioritizes weights.

Large retrieved contexts compete for attention with the prompt and instructions, which can dilute signal quality. Small retrieval errors or loosely relevant chunks can introduce formatting drift, or shift reasoning in subtle ways. The system’s output becomes tightly coupled to retrieval quality.

Fine-tuning works differently. Instead of injecting large volumes of text at inference time, it embeds patterns and constraints directly into the model during training. The distinction affects how the system behaves under real workloads.

Missing variable 2: Retraining frequency

The common advice says “use RAG if knowledge changes frequently" and “use fine-tuning if behavior is stable.” But how frequently is “frequently”?

If your knowledge base changes daily, retraining pipelines may introduce operational friction. Evaluation cycles, dataset versioning, and deployment validation all add delay.

Data preparation also matters. If your organization lacks structured, versioned, and clean datasets, the hidden cost of preparing training data can exceed compute costs.

The Cost Math of RAG vs. Fine-Tuning

Surface-level comparisons of RAG and fine-tuning often ignore the cost curves that determine long-term viability. In production systems, financial estimations are crucial in architectural decisions. To evaluate RAG vs. fine-tuning realistically, we need to examine three cost layers:

Token cost and context expansion
Retrieval infrastructure cost
Training infrastructure cost

The cost structure of RAG

RAG systems introduce a recurring operational cost because each query retrieves external information and injects it into the model’s prompt. That additional context is billed on every request.

Context expansion
Production RAG systems append around 500 tokens of retrieved context to each query. The provider bills those tokens on every request.

Using pricing similar to GPT-5.2 at 1.750 dollars per million input tokens, the incremental monthly cost becomes:

Cost per query
500 tokens × $1.75/1,000,000 = $0.000875 per query

At a small scale, this cost appears negligible. However, because it applies to every query, the total overhead grows linearly with traffic.

At different traffic levels:

Monthly queries	Context cost
10 million	$8,750
50 million	$43,750
100 million	$87,500

This is context overhead alone. It does not include output tokens or base prompt tokens. At a sustained scale, what appears flexible and inexpensive becomes a significant recurring expense.

Vector database and retrieval cost
Token cost is only one component of RAG costs. RAG also relies on a vector database for semantic search. The system must store, index, and query embeddings efficiently.

Public pricing of Pinecone lists:

Storage at approximately 0.33 dollars per gigabyte per month
Read units at approximately 16 dollars per million
Write units at approximately four dollars per million

For example, consider a system handling 50 million queries per month, where each query performs a single vector search (assuming a 1,024-dimension vector). That would result in 50 million read operations monthly. If the system also writes approximately six million records per month, the combined read and write activity would bring the total estimated monthly cost to around $1,532.

At 200 million queries per month, the total expenses rises to $9,000 per month.

Two RAG systems serving identical traffic can therefore have materially different cost structures depending on how the vector database is designed and optimized.

Infrastructure cost
RAG systems require storage and compute infrastructure to generate embeddings, store and index vectors, execute retrieval queries, and run inference. Each of these stages consumes compute resources, typically provisioned through cloud servers that must scale with traffic.

For real-time or high-throughput applications, additional capacity is required to maintain low latency and system reliability. Replication, autoscaling, monitoring, and failover mechanisms all add operational complexity. These infrastructure layers are essential for production-grade RAG, but they expand the total cost footprint beyond token usage alone.

The cost structure of fine-tuning

Fine-tuning introduces a different economic model from RAG systems. Instead of paying incremental costs on every request for external context, you invest upfront to modify the model’s internal behavior.

That upfront investment can be broken into four primary cost categories: data, training compute, experimentation, and operational maintenance.

Data preparation costs
High-quality labeled data is the foundation of effective fine-tuning. This includes collecting domain-specific examples, cleaning inconsistencies, formatting inputs and outputs correctly, and validating annotation quality.

In many organizations, data preparation consumes 20 to 40 percent of the total fine-tuning budget. Poorly curated data directly degrades model performance, leading to additional retraining cycles and wasted compute.

Training compute costs
OpenAI lists fine-tuning at roughly $25 per million training tokens for GPT-4.1. A run using 20 million tokens would cost about $500 in direct training fees, with larger datasets or multiple runs increasing this total.

For self-hosted training, costs depend on model size and hardware. High-performance GPUs such as A100 clusters can cost thousands of dollars per training epoch. Because fine-tuning is rarely a single-pass process, multiple epochs, evaluations, and retraining cycles are common, which further increases the overall cost.

Experimentation and validation costs
Fine-tuning is an iterative process that requires experimentation with hyperparameters, evaluation against baseline models, and testing across edge cases. These workflows require engineering time, infrastructure, and structured evaluation frameworks. Unlike prompt engineering, fine-tuning introduces a full ML lifecycle, adding ongoing operational overhead.

This creates a non-linear cost curve. Fine-tuning concentrates cost at the beginning, while marginal cost per request remains relatively stable as traffic grows.

Whether that trade-off is advantageous depends on three variables: query volume, knowledge stability, and retraining frequency. Without modeling those explicitly, cost comparisons between RAG and fine-tuning remain incomplete.

When RAG Wins

Despite its scaling trade-offs, RAG remains the dominant production choice for a reason. In certain operating conditions, it is structurally more flexible, faster to iterate, and operationally safer than fine-tuning. RAG is suitable in the following scenarios:

1. When knowledge changes frequently

If your domain knowledge changes weekly or daily, fine-tuning becomes operationally expensive. Dataset updates, retraining, evaluation, and deployment introduce delays that can stretch from hours to weeks depending on governance requirements.

Teams frequently underestimate the operational overhead of keeping a fine-tuned model synchronized with a rapidly evolving knowledge base. In these environments, RAG shifts the problem from model retraining to data indexing.

2. When you have extensive unstructured data but limited labeled data

Many organizations possess terabytes of internal documents but lack high-quality supervised datasets. Building labeled training corpora requires annotation workflows, domain experts, and quality validation pipelines. In practice, this often becomes the most expensive part of fine-tuning projects.

RAG bypasses this constraint by allowing models to operate directly on existing document corpora without constructing large labeled datasets.

3. When governance and data residency requirements are strict

Once sensitive information is embedded in model weights, deletion and auditing become difficult. Removing a specific record from a fine-tuned model often requires retraining or maintaining complex dataset lineage.

RAG architectures avoid this issue by keeping sensitive information in external storage systems where standard governance controls already exist.

4. When query volume is moderate

As shown in the earlier cost analysis, context expansion overhead grows with query volume, reaching approximately $43,750 per month at 50 million queries. At moderate traffic, RAG’s per-request costs are typically lower than the amortized expenses of fine-tuning, including training and ongoing maintenance. This makes RAG an attractive choice for organizations that want high-quality outputs without front-loading infrastructure and compute investments.

Use cases

Large-scale examples illustrate RAG’s effectiveness at this volume. Notion’s Q&A assistant is effectively a large-scale RAG system over workspace data. The difficult engineering problem was not retrieval itself, but enforcing identity and access controls during retrieval. When a user queries the assistant, the system must ensure the model only retrieves documents that the user is permitted to see.

LinkedIn leveraged RAG and knowledge graphs to preserve the structure of their support cases. This system retrieved relevant subgraphs rather than isolated text chunks, improving retrieval accuracy by 77.6% and reducing median issue resolution time by 28.6%.

For systems at this scale, RAG combines cost efficiency with flexibility, allowing teams to update knowledge sources rapidly without retraining models, while still delivering high-quality results.

When Fine-Tuning Wins

Fine-tuning becomes structurally advantageous under different conditions. These conditions typically involve scale, stability, and behavioral precision.

1. When query volume exceeds 100 million per month

At very high traffic levels (100M+ queries per month), RAG’s per-request context overhead becomes significant. Each query adds hundreds of retrieved tokens that the model processes, causing costs to scale linearly with traffic. Large context windows can also increase latency, reduce throughput, and complicate infrastructure reliability.

If domain knowledge is relatively stable, fine-tuning can become more efficient. By embedding knowledge directly into the model, organizations avoid repeated retrieval and token costs, leading to more predictable per-query expenses, better consistency, and simpler operations at scale.

2. When output structure is critical

Fine-tuned models often excel in tasks that require strict adherence to structure or formal constraints. For example, Cosine, which is an AI software engineering assistant that’s able to autonomously resolve bugs and build features, was able to achieve a SOTA score of 43.8% on the SWE-bench⁠ verified benchmark.

Similarly, Distyl secured the top position on the BIRD-SQL benchmark, widely regarded as the premier evaluation for text-to-SQL performance. Its fine-tuned GPT-4o model reached an execution accuracy of 71.83% on the leaderboard.

In applications where errors propagate downstream, into financial calculations, automated APIs, or compliance documents, behavioral consistency is mandatory. In these contexts, fine-tuning provides the reliability needed to minimize risk and maintain trust in automated outputs.

3. When latency requirements are strict

RAG adds multiple steps to the inference pipeline that increase response time. Each query must go through embedding generation, vector search, and context injection before reaching the model.

Fine-tuned models skip retrieval entirely. All necessary knowledge and reasoning patterns are internalized, allowing the model to generate outputs immediately. In applications where sub-100ms responses are required, such as live recommendation engines or high-frequency trading systems, removing the retrieval pipeline eliminates a major bottleneck.

4. When deep domain reasoning matters more than freshness

A domain-specific agriculture benchmark study found that fine-tuning improved model accuracy from 75% to 81%, while hybrid systems (fine-tuning + retrieval) reached 86%. Because the dataset focused on specialized agricultural knowledge and reasoning tasks, the improvement primarily reflects stronger domain reasoning, not simply better access to external information.

In domains such as legal analysis or medical decision support, reasoning patterns can be complex. Fine-tuning enables models to internalize domain expertise rather than rely solely on retrieved context.

The Hybrid Approach

While RAG and fine-tuning each have clear advantages, research shows that combining them effectively can produce superior results, but only when done correctly. The RAFT (Retrieval Augmented Fine-Tuning) approach, developed by UC Berkeley, Microsoft, and Meta Research, demonstrates how to do this in practice.

RAFT trains a model to operate in an “open-book” setting. It learns to process retrieved context, identify relevant passages, ignore distractors, and cite evidence accurately. Without this explicit training, simply layering RAG on top of a fine-tuned model often fails. For instance, a model fine-tuned on medical reasoning may retrieve irrelevant journal articles if it hasn’t learned to filter and prioritize context, resulting in hallucinations or incorrect recommendations.

RAFT addresses this with a structured 80/20 training split. 80% of training examples include oracle documents that the model should use, and 20% do not, forcing the model to learn when to trust retrieved data and when to rely on internalized knowledge. This operational detail is crucial for engineers evaluating whether their team can implement a hybrid approach successfully. It is not enough to just combine RAG and fine-tuning. The model must be trained to reason over the retrieved context.

A common and practical pattern is “fine-tune for format, RAG for knowledge.” Fine-tuning shapes the model’s internal behavior, enforcing domain-specific reasoning, output structure, and style. RAG provides dynamic access to external information that changes frequently or is too large to store in the model weights. In healthcare, for example, fine-tuning ensures the model understands medical terminology, follows proper diagnostic reasoning, and formats outputs according to clinical documentation standards. RAG supplements this by retrieving the latest research, newly published treatment guidelines, or patient-specific records, keeping recommendations current without retraining the entire model.

Similarly, Harvey AI fine-tuned on 10 billion case law tokens, but still leverages RAG to handle current cases and updates. This pattern is widely used in other domains too. Legal systems fine-tune for statutory reasoning and citation style, then layer RAG to retrieve the most current case law; finance models fine-tune for portfolio analysis rules, then layer RAG for market updates and regulatory changes. It’s a way to balance the stability of learned behavior with the adaptability of retrieval.

A Quantified Decision Framework for RAG vs. Fine-Tuning

The question is no longer “Which approach is better?” It is “Under what conditions does each approach make economic and operational sense?”

Instead of defaulting to architectural preference, evaluate three measurable variables:

Knowledge change frequency
Monthly query volume
Infrastructure capability and governance constraints

When those variables are quantified, the decision becomes far clearer.

Step 1: Measure knowledge volatility

Knowledge change frequency is often the fastest way to eliminate one option. If your domain knowledge changes weekly or daily, RAG is structurally favored. Updating an index is far simpler than retraining a fine-tuned model. The separation between model weights and external data enables real time data retrieval without redeployment cycles.

If knowledge remains stable for months at a time, fine-tuning becomes economically viable. Retraining frequency drops, and training cost can be amortized over longer intervals. In these environments, embedding domain specific knowledge directly into model parameters may reduce long-term inference overhead.

As a practical threshold:

Knowledge changes more than monthly → prioritize RAG
Knowledge stable for multiple months → evaluate fine-tuning

Step 2: Calculate context expansion cost**

The next variable is query volume. Large-scale RAG systems append hundreds of tokens to every query, and this context overhead scales linearly with traffic.

Quantitative triggers

Monthly queries	Guidance
<10M	RAG is cheaper
10–50M	Evaluate fine-tuning vs. RAG
50–100M	Fine-tuning or hybrid
>100M	Fine-tuning or hybrid

Step 3: Assess infrastructure maturity

Even if economics favor one approach, infrastructure capability may dictate feasibility.

RAG requires:

Strong data engineering
Reliable data pipelines
Efficient vector database architecture
Observability and monitoring

Fine-tuning requires:

High quality labeled data
Machine learning expertise
Compute resource allocation
Evaluation discipline

When teams ignore their actual capabilities, architecture decisions collapse under scale. Many production failures blamed on “model quality” are just traits of immature infrastructure.

Decision matrix

The following matrix translates the analysis into practical guidance.

Scenario	Monthly queries	Knowledge update frequency	Recommendation	Rationale
Domain knowledge updates weekly, moderate traffic	10–50M	Weekly/Daily	RAG	Immediate indexing and low recurring cost
High-scale traffic, knowledge stable	50–100M+	<1 update/month	Fine-tuning	Avoids recurring context injection, reduces latency
Structured output or code generation required	Any	Any	Fine-tuning	Embeds domain-specific rules and formatting internally
Specialized reasoning + frequent updates	10–50M	Weekly/Daily	Hybrid	Combines internalized reasoning with dynamic knowledge
Multi-domain systems with diverse knowledge update cycles	10–100M	Mixed	Hybrid	Fine-tuning stabilizes core domains, RAG handles rapidly changing sources

Using this matrix, it becomes easier to make the decision whether to utilize RAG, fine-tune your LLMs, or use the hybrid approach.

Final Thoughts

The debate between RAG and fine-tuning is often framed as a binary choice, but the more useful question is “If hybrid systems demonstrably outperform either approach alone, why does industry adoption still overwhelmingly favor RAG?”

Hybrid requires both ML and data engineering capabilities simultaneously, a combination few organizations have. RAG remains the practical default, offering agility and transparency with less upfront complexity.

The key takeaway is to choose the architecture that matches your knowledge volatility, query scale, and team capability. For teams exploring enterprise-scale retrieval systems, platforms like Actian VectorAI DB provide purpose-built vector database capabilities designed for performance and scalability.

Join the Discord community and learn how Actian fits to your AI strategy.

What 37signals’ Cloud Repatriation Taught Us About AI Infrastructure

Offisong Emmanuel — Tue, 19 May 2026 19:42:51 +0000

In 2023, 37signals announced that it had completely left the public cloud and followed up by publicly documenting its cloud repatriation process, providing one of the clearest real-world examples of on-premises economics at scale. By reversing its cloud migration and shifting workloads to private cloud infrastructure, the company drastically reduced its annual cloud infrastructure spend by almost $2 million.

The transparency of the numbers made the case compelling. In 2022, 37signals spent $3,201,564 on cloud services, which is about $266,797 per month. These detailed cost breakdowns, along with published hardware investment and payback timelines, provided a rare look into the financial mechanics of large-scale cloud repatriation.

For commodity SaaS workloads, the math was clear. But the same logic raises an important question for the next generation of compute-heavy systems: “Does the economic argument extend to AI infrastructure as well?” In this article, we examine whether the same economic logic holds for AI infrastructure.

TL;DR

37signals spent ~$3.2M/year on AWS in 2022.
After repatriating workloads to their own infrastructure, cloud spend dropped to ~$1.3M by 2024.
The company invested roughly $700K–$800K in servers and paid them off in under 18 months.
The entire infrastructure is still run by the same 10-person team. No additional operational overhead.
The key takeaway is that at a sustained scale, owning infrastructure can be dramatically cheaper than renting it.

The 37signals Playbook: What Hanson Actually Documented

In 2022, 37signals spent $3.2 million annually on AWS. After leaving the cloud in 2023, their annual costs had dropped to approximately $1.3 million by 2024, a reduction of almost $2 million per year.

The transition required a hardware investment of roughly $600,000 in Dell servers. The company fully recouped the investment in under 18 months, achieving complete payback in the second half of 2023 as their AWS reserved instance contracts expired. From that point forward, the savings flowed directly to operating margin rather than offsetting capital expense.

37signals projected $1.5 million in hardware costs and roughly $200,000 per year in operating expenses. This shift replaces a recurring $1.3 million annual cloud storage bill with a one-time capital outlay plus a fraction of the ongoing operating cost. Over five years, 37signals revised the total savings projections upward from $7 million to more than $10 million.

37signals cloud exit financials by year

To illustrate the financial impact of 37signals’ cloud exit over time, the table below breaks down annual cloud spending, on-premises hardware investments, and operating costs, highlighting the resulting net savings and key operational notes.

Year	Cloud spend	Hardware investment	Operating costs	Notes
2022 Baseline	~$3.2M	$0	Included in cloud spend	Full cloud dependency
2023 Migration	~$2M	~$700–800K	Moderate	Hardware fully recouped in under 18 months
2024+ Post-repatriation	~$1.3M	~$1.5M (storage)	~$200K/year	~$1.9M annual savings
2025+	Minimal AWS dependency	~$1.5M (Pure Storage, 18PB)	~$200K/year	$10M+ projected 5-year savings

Notably, the migration did not require the team to expand operations. A 10-person infrastructure team handled the entire repatriation without adding new staff. Addressing a common concern about operational overhead, 37signals co-founder David Heinemeier Hansso n noted:

“We've been out for just over a year now, and the team managing everything is still the same. There were no hidden dragons of additional workload associated with the exit that required us to balloon the team, as some spectators speculated when we announced it. All the answers in our Big Cloud Exit FAQ continue to hold.”

This directly challenges the common assumption that moving away from public cloud environments inevitably requires a significantly larger infrastructure team.

Execution followed a “criticality ladder” strategy where the team migrated lower-risk services first and more critical ones later. The team moved the HEY email system in stages, starting with caching, then database, and finally, job services. To minimize risk, they colocated infrastructure approximately one millisecond from the AWS region to preserve rollback capability during the cloud repatriation process. After stabilizing the system, they replaced managed services with substantial recurring cost, including RDS and managed Elasticsearch which exceeded $500,000 together annually.

What makes 37signals' case study consequential is the publicly documented cost efficiency. For organizations questioning long-term cloud adoption assumptions particularly with regard to storage costs and managed services, the 37signals documentation provides a rare baseline for comparison.

Why AI Infrastructure Economics Are Even More Extreme

The lessons from 37signals’ cloud repatriation take on a sharper edge when applied to AI infrastructure. Higher GPU costs, predictable inference workloads, massive embedding storage, and stricter data regulations create financial and operational pressures that amplify the advantages of on-premises or hybrid cloud solutions that allow you to move workloads where they make the most sense. Below, we break down the key drivers.

AI infrastructure cost comparison

To evaluate the cost implications of different AI infrastructure approaches, the table below compares upfront setup costs, monthly operating expenses at varying workloads, and expected break-even timelines for cloud, on-premises, and hybrid configurations.

Setup	Setup cost	Monthly cost	Break-even
Cloud GPU rental (AWS/ Azure)	$0	$2,900–3,500 (8h/day × $4–8/hour × 15 days)	N/A
Cloud inference APIs (Lambda Labs)	$0	$1,800–2,500 (8h/day × $3.67/hour × 15 days)	N/A
Self-hosted GPU (8×H100 server)	$200K–400K	$1,500–2,000 (power + maintenance)	<12 months
Hybrid (Cloud training + On-Prem)	$200K–400K	Training only, inference minimal	<12 months

Note: For cloud GPU rental, we estimate monthly cost assuming eight hours/day per GPU. The cost scales linearly with utilization; it is not directly per-query.

1. GPU cloud markups are high

AI workloads depend heavily on GPUs, and cloud providers charge far steeper premiums for GPU capacity than for typical CPU compute. On-demand AWS P5 instances with H100 GPUs cost roughly $4–8 per GPU-hour, while comparable Azure H100 instances are about $3.67 per hour. By contrast, spot markets and alternative providers such as Lambda Labs offer similar GPU capacity for $1–2 per hour, or $1.85–2.49 per hour with reserved commitments.

The result is a 4–8× markup for on-demand hyperscaler GPU capacity relative to the spot or specialized GPU cloud market. In other words, the premium cloud providers charge for high-end AI compute is significantly larger than typical CPU cloud markups. For organizations running sustained inference workloads, this pricing gap quickly becomes the dominant cost driver in AI infrastructure.

2. Predictable inference makes GPU ownership economical

High GPU pricing becomes especially significant because AI inference workloads are unusually predictable. Purchasing H100 GPUs outright can be cost-efficient. A single GPU costs roughly $25K–40K, while a complete 8×H100 server ranges from $200K–400K. Lenovo’s analysis shows that six or more hours of sustained daily usage reaches payback against AWS within the first year.

The reason this break-even arrives so quickly is that AI inference workloads are unusually predictable. Unlike SaaS traffic which fluctuates throughout the day, production AI systems such as recommendation engines tend to process steady volumes of requests.

Predictability changes the economics. When infrastructure runs at consistent utilization, owned hardware can be amortized efficiently across the workload. Paying cloud premiums for burst capacity that teams rarely use becomes unnecessary.

For organizations running inference continuously, the hardware investment is often recouped in under 12 months. From that point forward, the savings resemble the same pattern documented by 37signals. Fixed infrastructure replacing an ongoing rental bill.

3. Embedding storage requirements are massive

Even if GPU compute were optimized, AI systems introduce another rapidly growing cost layer: embedding storage. Vector databases store high-dimensional embeddings used for search, retrieval, and recommendation. As datasets scale into millions or billions of records, storage requirements expand quickly.

For instance, 10 million vectors at 1,536 dimensions require at least 58GB of raw storage, often 200–300GB with indexes and metadata. Cloud storage services like Pinecone charge $0.33/GB/month, meaning 500GB could cost $165/month before any queries. Self-hosted solutions like PostgreSQL with pgvector dramatically reduce cloud spending while keeping sensitive data under direct control. Over time, these storage requirements compound infrastructure costs alongside GPU compute, further reinforcing the economic advantages of self-hosted or hybrid architectures.

4. Data sovereignty and compliance favor on-premises deployment

Data residency regulations and general compliance are priorities in the AI space with the industry becoming increasingly regulated. Notably, the EU AI Act introduced strict regulations for AI systems, with prohibitions on certain AI use cases which took effect in February 2025. On-premises deployment simplifies compliance.

For financial organizations navigating complex regulatory environments, solutions like Actian’s Data Intelligence platform helps enforce data governance and streamline compliance workflows.

The Cloud Infrastructure Case Studies 37signals Validated

As much as the financial transparency of 37signals’ cloud exit was radical, their repatriation was not an isolated occurrence. It was part of a growing trend by many organizations trying to regain cost control and optimize their cloud infrastructure. Many high-profile case studies illustrate the scale and economics of moving workloads back from public clouds to owned or hybrid infrastructure.

Dropbox

Dropbox pioneered enterprise cloud repatriation as early as 2015, completing the migration between 2016 and 2018. The company moved roughly 90% of customer data, reportedly over 500 petabytes, off AWS to three owned colocation facilities. The infrastructure investment totaled $53 million, yet Dropbox reported $74.6 million in operational savings over two years per its 2018 S‑1 filing. A small portion of workloads, primarily European customers and specialized services, remain in AWS. Internally, the initiative was known as “Magic Pocket,” and it exemplifies how a well-executed hybrid cloud approach can deliver substantial savings while aligning with long-term business objectives.

Ahrefs

Ahrefs, the SEO tools company, relied on a Singapore colocation setup with 850 servers. Their reported savings from avoiding public cloud were approximately $400 million over 2.5 years. Actual infrastructure cost: $39.5 million for 850 servers (~$1,500/server/month), versus an estimated $447.7 million if hosted entirely on AWS (~$17,557/server/month equivalent). As Ahrefs put it: “We wouldn’t be profitable, or even exist, if our products were 100% on AWS.” While critics argue that Ahrefs inflated AWS estimates, the directional savings were undeniable, illustrating that cloud repatriation challenges can be surmounted at scale with careful planning.

GEICO

GEICO spent a decade migrating to multiple cloud providers only for its costs to climb and exceed projections by 2.5×, reaching $300 million by 2022 across eight providers. In response, GEICO began moving workloads to a private cloud using OpenStack and Kubernetes, targeting over 50% repatriation by 2029. Early results show 50% reductions in compute and 60% reduction per gigabyte of storage costs compared with public cloud services, demonstrating how a hybrid cloud architecture can deliver efficiency, compliance, and alignment with long-term business objectives.

Akamai

Akamai was on the path to spending over $100 million on third party cloud services before migrating compute workloads to its own global edge network of 350,000+ servers. The migration delivered savings of roughly $100 million per year, a testament to the economics of repatriation when existing infrastructure and scale align.

What these cases share is the same economic pattern documented by 37signals. Predictable, high-volume workloads eventually become cheaper to run on owned infrastructure than on hyperscaler clouds.

These examples reflect a broader shift occurring across enterprise infrastructure strategies. Barclays’ Chief Information Officers (CIO) surveys show cloud repatriation trending upward in recent years, with the sentiment peaking in the second half of 2024 with 86% of CIOs planning repatriation.

However, this statistic does not mean that companies are abandoning public cloud environments completely. According to IDC, only 8–9% of companies favor full repatriation with most preferring a hybrid approach that combines public and private clouds. Hybrid cloud infrastructure allows organizations to optimize workload placement by strategically allocating sensitive data and mission-critical applications on-premises while leveraging public cloud services for less critical workloads. As such, it has become increasingly important for teams exploring similar transitions to understand the nuances of hybrid deployments and their associated risks.

Cloud Repatriation Statistics

Cloud repatriation is accelerating at the same time as public cloud spending keeps climbing. IDC projects global public cloud spend will reach $1.6 trillion in 2028, doubling from their 2024 prediction. Yet as mentioned earlier, 86% of CIOs are planning some form of repatriation according to Barclays. Both trends can be true because this is not a cloud exodus so much as a rebalancing. Enterprises are leaning towards a hybrid cloud model.

AI is likely to accelerate that shift. AI workloads account for less than 10% of total cloud compute today but Gartner projects that this figure will approach 50% by 2029. Hyperscalers are responding with enormous capital investment. There is an estimated $600 billion in infrastructure spend in 2026, roughly three-quarters of it tied to AI. The assumption is clear: Enterprises will rent that GPU capacity. But the 37signals math suggests that once AI workloads move from experimentation to steady production, ownership economics begin to dominate.

Cost pressure is already driving behavior. Flexera reports that 27% of cloud resources are wasted or underutilized, and 21% of workloads have already been repatriated. The primary reason cited is cost exceeding projections, followed by performance concerns. With GPUs, the margin for inefficiency is thinner. There are fewer optimization levers, higher hourly rates, and faster budget burn.

Regulation adds another layer. The EU AI Act, DORA for financial services, China’s PIPL, and India’s DPDP are tightening data governance requirements. Mimecast reports that 87% of organizations now factor data sovereignty into vendor decisions. For AI systems, sovereignty extends beyond data location to model provenance, audit trails, and compliance documentation. On-premises deployment does not eliminate regulatory complexity, but it centralizes control and for many enterprises, that simplicity is becoming strategically attractive.

The Counter-Arguments and When Cloud Providers Win

Not all observers agree that cloud repatriation is the best path for every organization. Public cloud environments still deliver value in certain circumstances. But arguments often do not hold strong in the case of AI workloads.

When cloud wins vs. when on-premises wins

Component	Cloud advantage	On-prem advantage
Workload predictability	Handles spiky or unpredictable workloads	Predictable workloads cheaper to self-host
Team expertise	Requires minimal in-house infrastructure skill	Strong IT teams can optimize and reduce vendor reliance
Scale and growth	Rapid scaling and global expansion	Predictable growth enables cost-efficient hardware
Regulatory requirements	Managed compliance, geo-redundancy	Direct control simplifies regulatory alignment
Cost and margins	Pay-as-you-go reduces upfront spend	Long-term savings from owned infrastructure
Service quality	Cloud SLAs ensure availability and performance	Dedicated resources guarantee predictable uptime

Cloud “wrong usage” argument

Jeremy Daly, a serverless advocate, argues that “37signals was using the cloud wrong.” By treating cloud environments as virtual colocation, running VMs and Kubernetes, they were paying cloud premiums without capturing the value of serverless, managed services, and instant scaling. As Daly notes, “In the cloud, we should be renting services, not servers.”

For SaaS workloads with highly variable or spiky traffic, this argument is compelling. Serverless infrastructure allows organizations to scale instantly and pay only for the compute they actually use.

However, AI inference workloads often behave very differently. Production inference systems, such as recommendation models, copilots, and document processing pipelines, tend to run at steady, sustained utilization rather than unpredictable bursts. In these cases, the economic advantage of elastic cloud scaling diminishes. The premium paid for burst capacity still exists, but the workload itself rarely needs that burst capacity.

Daly’s argument therefore holds for variable SaaS workloads, where elasticity is critical. For sustained AI inference workloads running at high utilization, paying a premium for burst capacity that is rarely used can make dedicated infrastructure or hybrid deployments more cost-efficient.

Full cost critique

Some critics also question the financial assumptions behind 37signals’ approach. They point out that hardware and software normally account for only about 20% of IT costs, with the remainder covering electricity, cooling, physical security, racking, Uninterruptible Power Supply (UPS), and opportunity costs. David Heinemeier Hanson’s analysis did not include all of these overheads because 37signals used colocation facilities rather than fully owned data centers. Even so, considering 37signals’ figures, it is reasonable to conclude that renting colocation space can still be far cheaper than relying on cloud services.

Competence vs. growth framework

Forrest Brazeal’s IT competence versus growth aspirations framework provides additional nuance. He places 37signals in the High Competence/Low Growth quadrant, ideal for self-hosting. “Not every company has the competence (high) or growth aspirations (low) of 37signals,” he observes. Startups with uncertain or spiky workloads benefit from cloud flexibility, but AI companies running production inference at scale often combine high operational competence with steady growth. Such profiles (steady growth & high competence) are well suited to repatriation.

Applying the Playbook to AI Infrastructure

If 37signals provided the economic blueprint, AI infrastructure makes the economics more concrete. The decision is no longer abstract. It becomes a structured assessment grounded in workload behavior, utilization, and regulatory exposure.

A practical four-question framework helps translate the 37signals logic into AI terms:

Is your inference workload predictable and sustained?

Unlike SaaS traffic spikes, most production AI systems such as recommendation engines, RAG pipelines, or fraud detection models process steady volumes with gradual growth.
Are projected GPU utilization rates above 60–70%?

At this threshold, owned hardware amortization typically undercuts public cloud GPU pricing within the first year.
Are you processing more than 10–50 million queries per month?

At this scale, per-token and per-query pricing from cloud APIs compound rapidly.
Do you face data sovereignty or strict compliance requirements?

For financial services, healthcare, or government workloads, regulatory mandates can tilt the decision toward controlled environments.

If the answer is “yes” to three or four of these, the repatriation economics tend to favor on-premises deployment for production inference.

Decision matrix

Workload stage	Recommended environment	Rationale
Model training	Public cloud	Compute-intensive; cloud GPUs handle burst workloads cost-effectively
Experimentation and prototyping	Public cloud	Flexible, fast provisioning for early-stage iteration
Production inference	On-premises / Hybrid	Steady workloads; owned hardware cheaper at 60–70%+ GPU utilization
Vector storage (embeddings)	On-premises	Reduces recurring managed-service costs and ensures data control

The hybrid AI pattern

In practice, most AI organizations adopt a hybrid model rather than an all-or-nothing shift. Training remains in the cloud. Inference moves closer to owned infrastructure.

Lenovo documented that training Llama 3.1 at hyperscale (39.3 million GPU hours) in the cloud would exceed $483 million. That type of elastic, short-term scale is exactly where public cloud excels. Inference is different. Once a model is trained, serving it for three to five years becomes steady, predictable work. That is where amortized hardware economics has the upper hand.

This split architecture also simplifies data migration risk. Instead of relocating entire AI pipelines at once, organizations can migrate production inference workloads gradually while leaving experimentation and early-stage training in cloud environments. A controlled, phased migration process reduces operational disruption while ensuring seamless integration between cloud-based training and on-premises serving layers.

Self-hosted inference economics

The economics of self-hosted inference depend heavily on utilization and token volume. According to enterprise deployment benchmarks, a 7B-parameter model running on an H100 GPU at roughly 70% utilization costs about $10,000 per year in spot nodes or hardware amortization. Power costs about $300 annually, bringing the total costs to about $10,300.

Public LLM APIs, by contrast, typically charge per million tokens, with enterprise pricing in 2025 ranging from $0.25–$15 per million input tokens and $1.25–$75 per million output tokens depending on model tier and provider.

At low usage levels, APIs remain the more economical option because infrastructure sits idle. However, the economics change as workloads scale. Industry analyses suggest that self-hosted deployment begins to break even at roughly two million tokens per day, after which the fixed cost of owned infrastructure is amortized across a large inference volume.

At high volumes, self-hosted inference can reduce costs by up to 78%. Artefact’s analysis) found break-even around 8,000 conversations per day. Below that threshold, managed cloud APIs remain more economical. Above it, ownership compounds savings. The pattern mirrors 37signals: predictable workload plus high utilization equals rapid payback.

Vector databases

Instacart documented migrating from Elasticsearch plus FAISS to PostgreSQL with pgvector, achieving 80% cost savings and a 10× reduction in write amplification. Timescale’s pgvectorscale benchmarks show approximately 75% lower costs than managed vector services like Pinecone at comparable performance.

For RAG systems handling millions of queries monthly, self-hosted vector infrastructure produces savings that resemble the 37signals S3 case: large recurring storage bills replaced by amortized hardware and open-source tooling.

Data sovereignty as a structural driver

Grandview research reports that the sovereign cloud market was worth 648.87 billion USD in 2025 and is projected to reach USD 648.87 billion by 2033. Also, according to Gartner, around 60% of financial firms outside the United States are expected to adopt sovereign or on-premises deployments by 2028.

Frameworks such as the EU AI Act, China’s PIPL, and India’s DPDP mandate data localization and traceability. For organizations processing sensitive training datasets or proprietary inference logs, on-premises deployment inherently satisfies residency requirements because data never leaves jurisdictional boundaries.

The Bottom Line

37signals showed that cloud repatriation teams can measure, model, and defend decisions with hard numbers. With AI infrastructure, the economics can be even more pronounced. If cloud repatriation saved roughly $10 million for Basecamp, an equivalent AI company running production inference at comparable scale could save multiples of that amount, given the much higher cost of GPU compute and embedding infrastructure.

For organizations choosing to run AI workloads in controlled environments, platforms like Actian VectorAI DB provide a purpose-built vector database designed for high-volume vector search and AI inference workloads. It can be deployed on-premises or in the cloud, allowing organizations to place vector infrastructure where it best fits their operational and economic requirements.

Join the community and learn more about Actian.

How to Build a HIPAA Compliant AI Ecosystem Without the Cloud

Offisong Emmanuel — Tue, 19 May 2026 19:42:41 +0000

Healthcare cannot rely on cloud RAG because patient data leaves your network and your system logs, stores, and exposes it outside your control. You sign a Business Associate Agreement (BAA), connect your pipeline to a managed vector database, and assume compliance is complete. That assumption is wrong. The BAA covers the provider’s infrastructure. It does not cover what your application sends, logs, or exposes during retrieval and generation.

You remain responsible for every path your system sends Protected Health Information (PHI) through. A clinician query can leak sensitive data through logs. A system prompt can include patient context that your system stores outside your boundary. Weak access control allows retrieval results to expose records across departments. These risks exist in your application layer, not in the cloud provider’s scope.

American regulators now target this gap. In 2026, they flagged attack patterns like membership inference, where an adversary probes an AI system to confirm whether a patient’s data exists in the index. Cloud-hosted pipelines increase this risk because queries and embeddings move across external infrastructure. Audit requirements tighten further when logs live on third-party systems.

In this tutorial, you will build a clinical knowledge assistant that runs entirely on hospital infrastructure. It performs semantic search over clinical data, enforces role-based access at query time, and generates answers with clear citations. Every query stays inside your network, every access is logged locally, and no external API calls are required.

Why BAA Is Not Enough

A BAA protects the cloud provider’s infrastructure, not how your system handles PHI during queries, retrieval, and generation. You remain responsible for every place PHI appears, moves, or gets stored inside your pipeline. There are multiple failure modes that make your system non-compliant, even when you sign a BAA.

Shared responsibility gap

The BAA stops at the infrastructure boundary. Your system controls what enters a prompt, what gets logged, and what leaves your network. If a clinician query includes PHI and your application logs it to an external service, you are responsible. If your retrieval step returns records across departments without strict filters, you have created an internal data breach. These failures happen in your code, not in the cloud provider’s scope.

For example, a physician searches “Show me similar cases to John Doe with early stage lung cancer.” Your application logs the full query to a cloud logging service for debugging. That log now contains PHI outside your network. The cloud provider did not leak it. Your application sent it.

Audit log ownership

HIPAA requires a complete audit trail for every access to PHI. When your vector database runs on third-party infrastructure, your system stores query logs and retrieval traces outside your control. You cannot guarantee completeness, retention, or isolation. Your security team cannot verify access patterns without relying on another provider’s system. That breaks your ability to enforce and prove compliance.

For example, your compliance team asks for a report of all oncology patient records in the past 30 days. Your vector database provider stores query logs on their platform with limited retention. Some logs are missing and others lack user-level metadata. You cannot produce a complete audit trail.

Membership inference exposure

Attackers can probe your system with targeted queries to determine whether a specific patient’s data exists in your index. This attack class is now a regulatory concern. Cloud-hosted indexes increase this risk because they expose a remote interface for repeated probing. A locally hosted index removes that external interface and limits access to your internal network.

For example, an attacker sends repeated queries like “Patients diagnosed with HIV in 2024 treated with drug X” and slightly modifies filters each time. They observe changes in response confidence and content. Over time, they infer whether a specific individual’s record exists in your dataset.

These failures show that a BAA does not ensure compliance. An on-premises deployment removes the third-party surface entirely and gives you full control over data flow, access, and auditability.

Split view showing Cloud vs On-premises RAG architecture

What You Are Building

In this section, you will build a RAG system with three layers:

Ingestion layer

You ingest clinical notes and treatment protocols into a controlled vector index with enforced data hygiene. You de-identify data before any processing. HIPAA Safe Harbor requires removal of identifiers, while Expert Determination allows a statistical approach. You apply one of these before ingestion, not after. You then chunk documents into 512 token segments with 50 token overlap, generate embeddings using a local model, and store them in VectorAI DB with metadata.

You define a strict schema for every record. Each chunk includes document type, department, date, and author role. This metadata is not optional. It enables access control at query time and prevents cross-department leakage. Do not store raw documents without structure.

Query layer

You process clinician queries through a controlled retrieval pipeline. Every query passes through role-based access control before it reaches the index. A cardiology user can only retrieve cardiology data. A scheduling bot cannot access diagnosis notes. You enforce this with a MUST filter on department or patient cohort at the database level.

Run hybrid search. Vector similarity retrieves semantically relevant chunks. Metadata filters restrict the result set. Pass the filtered context into a local LLM. The model generates an answer from retrieved data only and includes citations. Do not allow the model to invent or pull from external knowledge.

Audit layer

Log every interaction locally with full traceability. Each query writes a record that includes timestamp, user ID, department, query text, and retrieved document references. This log lives on your infrastructure with defined retention and access policies. You do not rely on external logging systems.

You can reconstruct any access event from this log. You can answer who accessed what, when, and under which role. This satisfies audit requirements and gives your security team direct visibility into system behavior.

The entire system runs on commodity hardware inside the hospital network. The end-to-end architecture of the system is shown in the image:

Hospital RAG system architecture

Building a HIPAA Compliant RAG Workflow

In this section, you will build a fully local RAG system that ingests clinical data, enforces access control, answers queries, and logs every interaction.

Prerequisites

To follow along, install the following tools on your local network:

Docker and Docker Compose installed.
Python 3.10 or higher.
PIP or UV: This guide uses UV.

Step 1: Deploy a vector database

You deploy a local instance of Actian VectorAI DB with persistent storage for both vector data and audit logs.

Create a docker-compose.yaml file:

services:
  vectorai:
    image: williamimoh/actian-vectorai-db:latest
    platform: linux/amd64
    container_name: vectorai_db
    ports:
      - "50052:50051"
    volumes:
      # vector data persists across restarts
      - ./data:/app/data
      # audit log lives on host — not inside the container
      - ./audit_logs:/app/audit_logs
    environment:
      - VECTORAI_LOG_LEVEL=info
    restart: unless-stopped

Run the service:

docker-compose up -d

The database starts and exposes port 50051 for local access. Vector data persists in ./data. Audit logs write to ./audit_logs on the host, which keeps all access records inside your network boundary.

Note:

M3/_M4 Apple Silicons might encounter a GRPC disconnection error without any container _logs. In this case, disable Rosetta in Docker Desktop.
VectorAI DB is under active development.

Step 2: Build the ingestion pipeline

Install the client library and run the ingestion pipeline to convert clinical documents into embeddings and store them in your local vector database.

Use uv for dependency management and execution. It is fast, reproducible, and avoids global Python state.

Download the Actian VectorAI client package. This creates a file actian_vectorai-0.1.0b2-py3-none-any

Initialize your project by:

uv init .
uv venv

After initialization, install the Actian VectorAI package by:

uv pip3 install actian_vectorai-0.1.0b2-py3-none-any

Add the embedding model dependency:

uv add sentence-transformers

Create a file ingest.py with the following contents:

import re
import hashlib
from actian_vectorai import VectorAIClient, Distance, VectorParams, PointStruct
from sentence_transformers import SentenceTransformer

# ── Config ────────────────────────────────────────────────────────────────────
VECTORAI_HOST  = "localhost:50052"
COLLECTION     = "clinical_docs"
EMBED_MODEL    = "sentence-transformers/all-MiniLM-L6-v2"  # 384-dim
VECTOR_DIM     = 384
CHUNK_TOKENS   = 512
OVERLAP_TOKENS = 50

# ── Synthetic clinical notes (replace with real de-identified corpus) ─────────
RAW_NOTES = [
    {
        "document_id":   "card_note_001",
        "document_type": "clinical_note",
        "department":    "cardiology",
        "date":          "2025-03-15",
        "author_role":   "attending_physician",
        "text": """
            Patient: [NAME REDACTED], DOB: [DATE REDACTED], MRN: [MRN REDACTED]
            Chief Complaint: Chest pain radiating to left arm, onset 2 hours ago.
            Assessment: Acute ST-elevation myocardial infarction confirmed on ECG.
            History: Hypertension and type 2 diabetes. Started on aspirin 325 mg,
            clopidogrel 600 mg loading dose, and heparin infusion per ACS protocol.
            Plan: Emergency PCI. Beta-blocker therapy with metoprolol succinate
            25 mg daily post-procedure. ACE inhibitor ramipril 5 mg daily initiated
            24 hours post-PCI. Follow-up echocardiography in 6 weeks.
        """,
    },
    {
        "document_id":   "card_protocol_001",
        "document_type": "treatment_protocol",
        "department":    "cardiology",
        "date":          "2025-01-10",
        "author_role":   "department_head",
        "text": """
            Cardiology Protocol — Heart Failure with Reduced EF (HFrEF)
            First-line therapy:
            - ACE inhibitor: ramipril 2.5–10 mg daily (or ARB if ACE-intolerant).
            - Beta-blocker: bisoprolol 1.25–10 mg daily, carvedilol 3.125–25 mg BID,
              or metoprolol succinate 12.5–200 mg daily. Titrate every 2 weeks.
            - MRA: spironolactone 25–50 mg daily for NYHA class II–IV
              if eGFR > 30 and K+ < 5.0.
            Target: Symptomatic improvement. Reassess LVEF at 3–6 months.
            Device therapy (ICD/CRT) if LVEF ≤ 35% after 3 months optimal therapy.
        """,
    },
    {
        "document_id":   "psych_note_001",
        "document_type": "clinical_note",
        "department":    "psychiatry",
        "date":          "2025-03-18",
        "author_role":   "psychiatrist",
        "text": """
            Psychiatry intake note — [NAME REDACTED], [AGE REDACTED]-year-old.
            Presenting with major depressive episode, PHQ-9 score 18 (severe).
            No current suicidal ideation. Started sertraline 50 mg daily.
            Psychotherapy referral placed. Follow-up in 2 weeks.
            Safety plan documented. Family support confirmed present.
        """,
    },
    {
        "document_id":   "onco_note_001",
        "document_type": "clinical_note",
        "department":    "oncology",
        "date":          "2025-03-20",
        "author_role":   "oncologist",
        "text": """
            Oncology note — [NAME REDACTED].
            Diagnosis: Stage IIIA non-small cell lung cancer, adenocarcinoma.
            EGFR mutation positive (exon 19 deletion).
            Plan: Osimertinib 80 mg daily (first-line EGFR-targeted therapy).
            Baseline CT chest/abdomen/pelvis completed. Brain MRI negative.
            Next imaging review in 8 weeks. Antiemetics PRN, skin care for rash.
        """,
    },
]

# ── Step 1: De-identification ─────────────────────────────────────────────────
# For production use Presidio:
#   from presidio_analyzer import AnalyzerEngine
#   from presidio_anonymizer import AnonymizerEngine
#   analyzer, anonymizer = AnalyzerEngine(), AnonymizerEngine()
#   result = analyzer.analyze(text=raw, entities=[...], language="en")
#   clean  = anonymizer.anonymize(text=raw, analyzer_results=result).text
#
# This demo applies lightweight regex to already-synthetic notes.

_HIPAA_PATTERNS = [
    (r"\b\d{3}-\d{2}-\d{4}\b",                   "[SSN]"),          # SSN
    (r"\bMRN[-:\s]*\d{4,10}\b",                   "[MRN]"),          # medical record #
    (r"\b\d{1,2}/\d{1,2}/\d{2,4}\b",             "[DATE]"),         # dates
    (r"\b[A-Z][a-z]+ [A-Z][a-z]+\b",             "[NAME]"),         # names (simple)
    (r"\b\d{3}[-.\s]\d{3}[-.\s]\d{4}\b",         "[PHONE]"),        # phone
    (r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.\w+\b","[EMAIL]"),      # email
    (r"\b\d{5}(?:-\d{4})?\b",                     "[ZIP]"),          # zip
    (r"\b(?:https?://)\S+",                        "[URL]"),          # URLs
    (r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",  "[IP]"),           # IP addresses
]

def deidentify(text: str) -> str:
    """Remove HIPAA Safe Harbor identifiers from text."""
    for pattern, replacement in _HIPAA_PATTERNS:
        text = re.sub(pattern, replacement, text)
    return text.strip()

# ── Step 2: Chunking ──────────────────────────────────────────────────────────
def chunk(text: str, size: int = CHUNK_TOKENS, overlap: int = OVERLAP_TOKENS) -> list[str]:
    """Split text into overlapping token windows (whitespace tokenisation)."""
    tokens = text.split()
    chunks, start = [], 0
    while start < len(tokens):
        end = min(start + size, len(tokens))
        chunks.append(" ".join(tokens[start:end]))
        if end == len(tokens):
            break
        start += size - overlap
    return chunks

# ── Step 3: Embedding ─────────────────────────────────────────────────────────
print(f"Loading embedding model: {EMBED_MODEL}")
_model = SentenceTransformer(EMBED_MODEL)

def embed(texts: list[str]) -> list[list[float]]:
    return _model.encode(texts, normalize_embeddings=True).tolist()

# ── Step 4: Ingest into VectorAI DB ──────────────────────────────────────────
def _chunk_id(doc_id: str, idx: int) -> int:
    """Stable integer ID from (document_id, chunk_index)."""
    h = hashlib.sha256(f"{doc_id}:{idx}".encode()).hexdigest()
    return int(h[:15], 16)

def ingest(notes: list[dict]) -> None:
    with VectorAIClient(VECTORAI_HOST) as client:
        # Health check
        info = client.health_check()
        print(f"VectorAI DB connected  version={info['version']}")

        # Create collection (skip if already exists)
        try:
            client.collections.create(
                name=COLLECTION,
                vectors_config=VectorParams(size=VECTOR_DIM, distance=Distance.Cosine),
            )
            print(f"Collection '{COLLECTION}' created  dim={VECTOR_DIM}")
        except Exception as e:
            if "exists" in str(e).lower():
                print(f"Collection '{COLLECTION}' already exists — skipping create")
            else:
                raise

        total_chunks = 0
        for note in notes:
            # De-identify FIRST — before chunking or embedding
            clean_text = deidentify(note["text"])

            # Chunk second
            chunks = chunk(clean_text)

            # Embed third
            vectors = embed(chunks)

            # Build PointStruct records with strict metadata schema
            # All four metadata fields are REQUIRED — no optional fields.
            points = [
                PointStruct(
                    id=_chunk_id(note["document_id"], i),
                    vector=vectors[i],
                    payload={
                        # ── strict schema ──────────────────────────────────────
                        "document_type": note["document_type"],   # required
                        "department":    note["department"],       # required — RBAC filter key
                        "date":          note["date"],             # required
                        "author_role":   note["author_role"],      # required
                        # ── retrieval helpers ──────────────────────────────────
                        "document_id":   note["document_id"],
                        "chunk_index":   i,
                        "text":          chunks[i],                # de-identified chunk text
                    },
                )
                for i in range(len(chunks))
            ]

            client.points.upsert(COLLECTION, points)
            total_chunks += len(chunks)
            print(f"  ✓ {note['document_id']}  dept={note['department']}  chunks={len(chunks)}")

        print(f"\nIngestion complete — {len(notes)} documents, {total_chunks} chunks total")

if __name__ == "__main__":
    ingest(RAW_NOTES)

This file performs the following actions:

De-identifies data: The system removes all HIPAA identifiers from raw text before processing. Names, dates, and other sensitive fields are replaced with placeholders to prevent PHI from entering the system pipeline.
Chunks the texts: The system splits the cleaned text into 512-token segments with a 50-token overlap. This overlap preserves context across boundaries, enhancing retrieval accuracy.
Embeds the chunks: The model converts each chunk into a numerical vector using a local sentence-transformers model. This process captures semantic meaning while keeping all processing within the network.
Stores with metadata: The system writes each chunk and its vector to VectorAI DB, along with necessary fields like document_type, department, date, and author_role. These fields support strict access control during queries.

Run the script by:

uv run ingest.py

You should see the following results after running the command:

ingest.py execution

From the logs, you see that the ingestion pipeline writes chunks to VectorAI DB.

Step 3: Run your queries

Execute queries against your local RAG system and validate retrieval, access control, and audit logging.

Create a file query.py with the following contents:

import json
import datetime
import urllib.request
import urllib.error
from pathlib import Path
from actian_vectorai import VectorAIClient
from actian_vectorai import FilterBuilder, Field
from sentence_transformers import SentenceTransformer

# ── Config ─────────────────────────────────────────────────────────────────────
VECTORAI_HOST = "localhost:50052"
COLLECTION    = "clinical_docs"
EMBED_MODEL   = "sentence-transformers/all-MiniLM-L6-v2"
AUDIT_LOG     = Path("./audit_logs/queries.jsonl")   # volume-mounted path

# Ollama settings — set OLLAMA_ENABLED=True once `ollama serve` is running
OLLAMA_ENABLED = False          # flip to True when Ollama is ready
OLLAMA_URL     = "http://localhost:11434/api/generate"
OLLAMA_MODEL   = "mistral"      # or "llama3.2:3b" for lower hardware

# ── RBAC: role → allowed departments ──────────────────────────────────────────
# Access is enforced as a MUST filter at the database level.
# A scheduling_bot cannot reach clinical notes; cardiology cannot see psychiatry.
ROLE_PERMISSIONS = {
    "cardiology_clinician":  ["cardiology"],
    "oncology_clinician":    ["oncology"],
    "general_practitioner":  ["cardiology", "oncology", "general"],
    "admin":                 ["cardiology", "oncology", "psychiatry", "general"],
    "scheduling_bot":        ["scheduling"],   # no clinical note access
}

class AccessDeniedError(Exception):
    pass

def allowed_departments(role: str) -> list[str]:
    if role not in ROLE_PERMISSIONS:
        raise AccessDeniedError(f"Unknown role '{role}' — access denied by default.")
    return ROLE_PERMISSIONS[role]

# ── Embedding (reuse the same model as ingest.py) ─────────────────────────────
print(f"Loading embedding model: {EMBED_MODEL}")
_model = SentenceTransformer(EMBED_MODEL)

def embed(text: str) -> list[float]:
    return _model.encode([text], normalize_embeddings=True).tolist()[0]

# ── Step 5: Search with department MUST filter ─────────────────────────────────
def retrieve(query_vec: list[float], departments: list[str], top_k: int = 5) -> list[dict]:
    """
    Hybrid retrieval: vector similarity + metadata MUST filter.
    Results from departments outside the allowed list are impossible —
    the filter is applied at the database level, not in application code.
    """
    results = []
    with VectorAIClient(VECTORAI_HOST) as client:
        for dept in departments:
            hits = client.points.search(
                collection_name=COLLECTION,
                vector=query_vec,
                limit=top_k,
                # MUST filter — department equality enforced at DB level
                filter=FilterBuilder().must(Field("department").eq(dept)).build(),
            )
            for hit in hits:
                payload = getattr(hit, "payload", {}) or {}
                results.append({
                    "score":         round(getattr(hit, "score", 0.0), 4),
                    "document_id":   payload.get("document_id"),
                    "document_type": payload.get("document_type"),
                    "department":    payload.get("department"),
                    "date":          payload.get("date"),
                    "author_role":   payload.get("author_role"),
                    "chunk_index":   payload.get("chunk_index"),
                    "text":          payload.get("text", ""),
                })

    results.sort(key=lambda r: r["score"], reverse=True)
    return results[:top_k]

# ── Step 6: LLM answer via Ollama ─────────────────────────────────────────────
_RAG_SYSTEM = (
    "You are a clinical decision support assistant. "
    "Answer ONLY using the context passages below. "
    "Do NOT use external knowledge or make assumptions. "
    "Cite each fact as [Doc N]. "
    "If the context is insufficient, say: 'I cannot answer from the available documents.'"
)

def build_context(chunks: list[dict]) -> str:
    return "\n\n".join(
        f"[Doc {i+1}] ({c['document_type']}, dept={c['department']}, "
        f"date={c['date']}, role={c['author_role']})\n{c['text']}"
        for i, c in enumerate(chunks)
    )

def generate(query_text: str, chunks: list[dict]) -> str:
    if not OLLAMA_ENABLED:
        # Return raw retrieved context when LLM is disabled
        return "[LLM disabled — set OLLAMA_ENABLED=True]\n\n" + build_context(chunks)

    context = build_context(chunks)
    prompt = f"{_RAG_SYSTEM}\n\nContext:\n{context}\n\nQuestion: {query_text}\n\nAnswer:"

    payload = json.dumps({
        "model":  OLLAMA_MODEL,
        "prompt": prompt,
        "stream": False,
        "options": {"num_predict": 400},
    }).encode()

    try:
        req = urllib.request.Request(
            OLLAMA_URL,
            data=payload,
            headers={"Content-Type": "application/json"},
            method="POST",
        )
        with urllib.request.urlopen(req, timeout=30) as resp:
            data = json.loads(resp.read())
            return data.get("response", "").strip()
    except urllib.error.URLError as e:
        return f"[Ollama unreachable: {e}]\n\nRetrieved context:\n{build_context(chunks)}"

def write_audit(record: dict) -> None:
    AUDIT_LOG.parent.mkdir(parents=True, exist_ok=True)
    with open(AUDIT_LOG, "a", encoding="utf-8") as f:
        f.write(json.dumps(record, ensure_ascii=False) + "\n")

# ── Public query entry point ───────────────────────────────────────────────────
def query(user_id: str, role: str, query_text: str, top_k: int = 5) -> dict:
    """
    Execute a role-gated RAG query.

    Returns:
        {answer, retrieved_docs, access_denied, error}
    """
    timestamp = datetime.datetime.now(datetime.timezone.utc).isoformat()

    # RBAC check — before anything else
    try:
        departments = allowed_departments(role)
    except AccessDeniedError as e:
        write_audit({
            "timestamp": timestamp, "user_id": user_id, "role": role,
            "department": "DENIED", "query_text": query_text,
            "retrieved_docs": [], "answer_provided": False, "access_denied": True,
            "denial_reason": str(e),
        })
        return {"answer": f"Access denied: {e}", "retrieved_docs": [], "access_denied": True}

    # Embed → retrieve (with MUST filter) → generate
    q_vec  = embed(query_text)
    chunks = retrieve(q_vec, departments, top_k)
    answer = generate(query_text, chunks)

    doc_refs = [
        {"document_id": c["document_id"], "chunk_index": c["chunk_index"],
         "department": c["department"], "document_type": c["document_type"],
         "score": c["score"]}
        for c in chunks
    ]

    # Audit log — every query, regardless of outcome
    write_audit({
        "timestamp":       timestamp,
        "user_id":         user_id,
        "role":            role,
        "department":      ",".join(departments),
        "query_text":      query_text,
        "retrieved_docs":  doc_refs,
        "answer_provided": True,
        "access_denied":   False,
    })

    return {"answer": answer, "retrieved_docs": doc_refs, "access_denied": False}

# ── Demo runs ─────────────────────────────────────────────────────────────────
if __name__ == "__main__":

    separator = "─" * 60

    # ── Query 1: authorised cardiology query ────────────────────────────────
    print(f"\n{separator}")
    print("QUERY 1 — cardiology_clinician (authorised)")
    print(separator)
    r1 = query(
        user_id="dr_chen_007",
        role="cardiology_clinician",
        query_text="What beta-blocker is recommended for heart failure with reduced ejection fraction?",
    )
    print(f"\nAnswer:\n{r1['answer']}\n")
    print("Retrieved sources:")
    for d in r1["retrieved_docs"]:
        print(f"  score={d['score']}  [{d['document_type']}]  dept={d['department']}  "
              f"doc={d['document_id']}  chunk={d['chunk_index']}")

    # ── Query 2: scheduling bot tries to access clinical notes ───────────────
    print(f"\n{separator}")
    print("QUERY 2 — scheduling_bot (attempting clinical note access)")
    print(separator)
    r2 = query(
        user_id="bot_sched_01",
        role="scheduling_bot",
        query_text="What are the diagnosis notes for cardiology patients?",
    )
    if r2["access_denied"]:
        print(f"\n✗ Access denied (as expected): {r2['answer']}")
    else:
        print(f"\nAnswer:\n{r2['answer']}")
        print("Sources:", r2["retrieved_docs"])

    # ── Query 3: cardiology query that must NOT return psychiatry notes ───────
    print(f"\n{separator}")
    print("QUERY 3 — cardiology_clinician (RBAC must exclude psychiatry)")
    print(separator)
    r3 = query(
        user_id="dr_chen_007",
        role="cardiology_clinician",
        query_text="antidepressant dosing and patient management",
    )
    departments_returned = {d["department"] for d in r3["retrieved_docs"]}
    cross_leak = "psychiatry" in departments_returned
    print(f"\nDepartments in results: {departments_returned or 'none'}")
    print(f"Cross-department leak: {'✗ LEAK DETECTED' if cross_leak else '✓ none — RBAC working correctly'}")

    # ── Show audit log tail ───────────────────────────────────────────────────
    print(f"\n{separator}")
    print("AUDIT LOG  →  {AUDIT_LOG}")
    print(separator)
    if AUDIT_LOG.exists():
        lines = AUDIT_LOG.read_text().strip().splitlines()
        for line in lines[-3:]:          # show last 3 entries
            entry = json.loads(line)
            print(json.dumps({
                "timestamp":     entry["timestamp"],
                "user_id":       entry["user_id"],
                "role":          entry["role"],
                "query_text":    entry["query_text"][:60] + "…",
                "docs_accessed": len(entry["retrieved_docs"]),
                "access_denied": entry["access_denied"],
            }, indent=2))
    else:
        print("No audit log found — run ingest.py first.")

The script performs three core operations in a single flow.

Enforces access control: The system checks the user’s role before retrieving any data. Each role is mapped to specific departments, and enforced as a mandatory filter at the database level. The authorization layer immediately blocks and logs unauthorized roles.
Retrieve and generate answers: The system embeds the query and retrieves relevant document chunks using vector search, with strict department filters applied. The results are then passed to a local LLM. If the LLM is disabled, the retrieved context is returned directly.
Write audit logs: The system logs every query locally, including the user ID, role, query text, accessed documents, and access status. This creates a complete audit trail for compliance and review.

Run the script by:

uv run query.py

You should see the following results after running the command:

query.py execution

The output shows three test cases: one showing a valid clinician query, a denied access attempt, and a check for cross-department leakage. These confirm that RBAC and audit logging work correctly before you move to production.

Step 4: Configure the audit log

Store every query locally by using the volume mapping defined during deployment.

The Docker configuration mounts ./audit_logs from your host into the container. When you run queries, this creates a local folder named audit_logs with a file queries.jsonl.

The file contains the following entries:

{"timestamp": "2026-03-31T15:56:02.552088+00:00", "user_id": "dr_chen_007", "role": "cardiology_clinician", "department": "cardiology", "query_text": "What beta-blocker is recommended for heart failure with reduced ejection fraction?", "retrieved_docs": [], "answer_provided": true, "access_denied": false}

{"timestamp": "2026-03-31T15:56:03.098346+00:00", "user_id": "bot_sched_01", "role": "scheduling_bot", "department": "scheduling", "query_text": "What are the diagnosis notes for cardiology patients?", "retrieved_docs": [], "answer_provided": true, "access_denied": false}

{"timestamp": "2026-03-31T15:56:03.663767+00:00", "user_id": "dr_chen_007", "role": "cardiology_clinician", "department": "cardiology", "query_text": "antidepressant dosing and patient management", "retrieved_docs": [], "answer_provided": true, "access_denied": false}

{"timestamp": "2026-03-31T15:56:18.331657+00:00", "user_id": "dr_chen_007", "role": "cardiology_clinician", "department": "cardiology", "query_text": "What beta-blocker is recommended for heart failure with reduced ejection fraction?", "retrieved_docs": [], "answer_provided": true, "access_denied": false}

{"timestamp": "2026-03-31T15:56:18.492188+00:00", "user_id": "bot_sched_01", "role": "scheduling_bot", "department": "scheduling", "query_text": "What are the diagnosis notes for cardiology patients?", "retrieved_docs": [], "answer_provided": true, "access_denied": false}

{"timestamp": "2026-03-31T15:56:18.569824+00:00", "user_id": "dr_chen_007", "role": "cardiology_clinician", "department": "cardiology", "query_text": "antidepressant dosing and patient management", "retrieved_docs": [], "answer_provided": true, "access_denied": false}

Each line represents a single query event. The log captures who made the request, their role, the department scope, the query text, and whether access was allowed. This file lives entirely on your infrastructure and gives you a complete, verifiable audit trail for every interaction with PHI.

Wrapping Up

A BAA was never enough because it does not control how your application handles PHI. You solved that by keeping all data, queries, and logs inside your network.

You now have a RAG system that enforces role-based access, retrieves only authorized data, and logs every interaction locally, without external APIs or third-party exposure.

Apply this pattern to other regulated systems. Refer to the VectorAI DB documentation and GitHub repository for updates and implementation details.

Join the community and learn more about Actian.

When to choose on-premises vs. cloud for vector databases

Offisong Emmanuel — Tue, 19 May 2026 19:42:27 +0000

For most of the last decade, the on-premises vs. cloud debate felt settled. Cloud computing was cheaper, faster, and easier to adopt. Enterprises moved workloads from on-premises infrastructure to public cloud services, relying on major cloud providers to handle scalability, maintenance, and security.

In 2026, that assumption is breaking, and cracks are showing up in legal reviews, financial projects, and SLA negotiations. Enterprises are facing an increasing pressure for data residency regulations, stricter enforcement, and scrutiny around cloud security models. Compliance constraints, data security requirements, cost predictability, and latency are forcing teams to reconsider on-premises solutions, private cloud computing, and hybrid cloud infrastructure.

At the same time, AI is moving closer to where data is generated. Manufacturing sites, retail stores, and healthcare environments increasingly require offline capability and sub-100ms latency. That shift helps explain why Oracle released AI Database 26ai for on-premises deployment and why Google is pushing Gemini onto Distributed Cloud for air-gapped environments. This shift signals that large-scale enterprise AI no longer fits neatly in cloud environments.

In this article, we’ll examine why on-premises infrastructure is resurging, what trade-offs you need to know, and how to make defensible deployment decisions.

What’s Driving The On-premises Resurgence

The renewed interest in on-premises infrastructure is not about going back to old systems. It is a response to clear changes in how AI systems are being built and used in 2025 and 2026. For many enterprises, cloud-only vector databases no longer fit their compliance, cost, and reliability needs.

A lot of factors drive this current on-premises resurgence, but in this article we will consider four key causes.

Large vendors now support on-premises AI

On-premises AI is no longer treated as an edge case by major vendors. Oracle’s release of AI Database 26ai and Google’s decision to run Gemini on Distributed Cloud show a clear shift in how enterprise AI is being packaged and delivered.

These products are built for large enterprises, not early-stage experiments or research projects. That distinction matters. Large vendors do not invest in complex on-premises AI platforms unless there is strong and growing customer demand. These announcements confirm that many enterprises want to run AI systems inside their own environments, close to their data, and under their full operational control. Why is this?

Regulatory pressure is now a real blocker

Teams used to plan for regulatory risk as a future possibility. Now it’s a day-to-day reality. GDPR enforcement reached record levels in 2025, with insufficient legal basis for data processing driving the largest penalties. That year alone, regulators issued nearly 2,700 fines totaling billions of euros.

From a data security perspective, GDPR enforcement has fundamentally changed how enterprises evaluate cloud services. While cloud service providers offer compliance tooling, legal teams are increasingly wary of relying on third-party providers for sensitive data storage and processing.

HIPAA adds another layer of complexity. For example, in Florida, physicians must maintain medical records for five years after the last patient contact, whereas hospitals must maintain them for seven years under state record-retention requirements. This makes repeated data movement risky and expensive. Financial services and government contractors face similar data sovereignty requirements that limit where data can be stored and processed. In these situations, cloud deployments add legal review, audit work, and ongoing risk. Keeping data on-premises is often the most straightforward way to meet these obligations.

Edge AI requires local and offline operation

AI workloads are increasingly deployed close to where data is created. Manufacturing facilities may operate in air-gapped environments or remote locations with limited connectivity. Retail systems must continue working during network outages. Healthcare applications often require very low latency for real-time decision support.

In these environments, relying on a remote cloud service introduces risk. Network delays and outages directly affect system reliability. On-premises and edge deployments allow vector search and inference to run locally, without depending on constant network access. For many use cases, this local execution is not an optimization but a requirement.

Together, these shifts explain why on-premises vector databases are gaining traction again. The change is driven by the practical realities of deploying production AI systems under real regulatory, cost, and reliability constraints.

The Compliance Calculus

For many enterprises, compliance is the deciding factor in the on-premises versus cloud debate. While cloud providers offer compliance certifications, the real challenge is not whether a platform can be compliant in theory, but whether it can withstand legal review, audits, and long-term operational scrutiny in practice. Once vector databases move into production and begin storing sensitive or regulated data, these questions become unavoidable.

GDPR and the limits of cross-border transfers

The Schrems II ruling changed how European data can be processed outside the EU. Privacy Shield was invalidated, leaving Standard Contractual Clauses as the primary legal mechanism for cross-border data transfers. In highly regulated industries such as financial services and healthcare, many legal teams consider SCCs insufficient due to enforcement uncertainty and ongoing legal challenges.

For vector databases, this matters because embeddings often contain derived personal data. Even if raw records are masked or tokenized, embeddings can still be considered personal data under GDPR. If data must remain within the EEA, or within a specific country, cloud deployments that rely on global infrastructure introduce legal risk. In these cases, on-premises or in-region deployment becomes a requirement rather than a preference.

HIPAA retention and the real cost of data movement

HIPAA does not explicitly require data to stay on-premises, but it does require long retention periods and strict access controls. When vector embeddings are built on top of this data, they inherit the same retention requirements. HIPAA data governance must be enforced when considering on-premises or cloud vector databases.

The cost impact becomes clear when egress fees are included. Consider a system storing 100 TB of embeddings in a cloud environment. At a common egress rate of $0.09 per GB, moving that data out of the cloud over a seven-year retention period results in:

100 TB × $0.09 per GB × 84 months = over $750,000 in egress costs alone

This does not include compute, storage, or indexing costs. With this in mind, will cloud data warehouses really help you cut costs?

Financial services and data sovereignty rules

Financial institutions face additional constraints beyond GDPR. Regulations such as GLBA, APRA, and regional data sovereignty mandates often require strict control over where customer data is stored and processed. Regulators may demand clear evidence of geographic boundaries, access controls, and auditability.

Cloud services can meet some of these requirements, but they often introduce complex configurations, contractual dependencies, and ongoing compliance reviews. For many banks and insurers, on-premises deployment simplifies audits by keeping data within controlled infrastructure that regulators already understand.

Government and public sector constraints

Government contracts introduce some of the strictest infrastructure requirements. Standards such as FedRAMP often mandate US-only infrastructure, restricted access, and tightly controlled environments.

In these cases, public cloud services are frequently disallowed or require extensive approvals. On-premises deployment is often the only viable option for running vector databases in support of government workloads.

When compliance makes cloud untenable

If legal teams flag cross-border data transfers as unacceptable, cloud deployments quickly become impractical. Once data residency is mandatory, on-premises deployment is no longer a trade-off decision. It is a compliance requirement.

The Cost Breakdown Analysis

Cost is often the reason teams revisit the on-premises versus cloud decision. To make a defensible decision, teams need to understand where costs diverge and when self-hosting becomes economically rational.

Where self-hosting breaks even

Research from OpenMetal shows a consistent breakeven point for Pinecone vector databases at scale. Once workloads reach roughly 80 to 100 million queries per month, self-hosted deployments tend to be cheaper than managed cloud services. Below this range, cloud pricing is usually competitive. Above it, usage-based billing begins to dominate total cost.

This threshold matters because many enterprise RAG systems cross it quickly. Customer support, document search, fraud detection, and recommendation systems often serve tens or hundreds of millions of queries each month once deployed across business units or regions.

The hidden cost in cloud pricing

Cloud pricing is rarely just a per-query fee. Vector databases introduce several cost drivers that are easy to overlook during planning.

Egress fees are a major factor. Most cloud providers charge around $0.09 per GB for data leaving their network. Moving embeddings between regions, exporting data for analytics, or migrating to another system all incur these fees. Over time, they become a meaningful portion of total spend.

Finally, vector search does not scale linearly. As vector counts grow and dimensionality increases, query costs rise faster than expected. What looks affordable at 10 million vectors can become expensive at 500 million, even if query volume grows steadily.

On-premises costs are fixed and predictable

On-premises deployments have real costs, but they behave differently. Hardware is typically amortized over three to five years. Staffing requirements are stable once the system is running. Facilities and power costs are known in advance.

The key difference is predictability. Costs do not spike because of usage patterns or data movement. Once the system is sized correctly, monthly spend remains largely flat, even as query volume increases.

A real world example

Consider a production e-commerce application with the following scale:

500M vectors
200M queries every month
1024 vector dimensions
6M writes monthly

At this scale, a typical managed Pinecone vector database costs around $8,500 per month once compute, storage, and rebuild overhead are included.

Estimated monthly cost

Total Estimated Cost: $8,454 / month

1. Storage

Usage: 845 GB
Cost: $279

2. Query Costs

Configuration:
24 b1 nodes
4 shards × 6 replicas
Assumption: 1% filter selectivity
Estimated Cost: $8,074
Note: Actual query cost may vary. Benchmark your workload on DRN for more accurate estimates.

3. Write Costs

Write Volume: 30 million Write Units (WU)
Assumption: Each write request consumes ≥ 5 WU
Cost: $101

An equivalent on-premises deployment might cost approximately half of that after hardware amortization, assuming an 18-month payback period and one to two engineers supporting the system. After that payback period, costs drop further while capacity remains available.

A study by Enterprise Storage Forum shows the cost projection of on-premises and cloud workloads.

Cost alone does not decide every deployment, but once vector workloads reach scale, the economics become difficult to ignore. Understanding where your system sits on this curve is essential before locking in a long-term vector database strategy.

When Latency And Connectivity Matter

Latency and connectivity are often treated as secondary concerns in architecture decisions. For many AI workloads, they are decisive. Once vector databases support real-time systems, network round-trips and internet dependency can make cloud deployments impractical or unsafe.

Real-time response requirements

Some applications have strict response time limits. In healthcare, clinical decision support and diagnostic systems often require responses in under 50 milliseconds. This budget includes data retrieval, vector search, and model inference. Similarly, banks and financial institutions often require very low latency for maximum user experience.

Public cloud deployments add unavoidable network latency. Even within the same region, round-trip latency typically adds 20 to 80 milliseconds before any compute work begins. For applications with tight latency targets, this overhead alone can exceed the total allowed response time. On-premises deployments remove that network hop, allowing systems to meet real-time requirements consistently.

Systems that must work offline

Many environments cannot rely on constant connectivity. Retail point-of-sale systems must continue operating during network outages. Manufacturing facilities are often located in remote areas with unstable connections. Military and maritime deployments may operate in fully disconnected or classified environments.

In these scenarios, a cloud dependency is a single point of failure. If the network goes down, the AI system stops working. On-premises and edge deployments allow vector search and inference to run locally, ensuring the system continues to function even when external connectivity is unavailable.

The cost of downtime

It is no news that there has been an increase in downtime from cloud providers. On November 18, 2025, Cloudflare outage disrupted large portions of the internet, causing downtime across major platformsincluding X, Amazon Web Services, Spotify, and so on. The impact of connectivity failures is not theoretical. In manufacturing, average downtime costs are estimated at $260,000 per hour. When AI systems support quality control, predictive maintenance, or process automation, any outage directly affects production.

A cloud-only architecture introduces risk that is hard to justify in these environments. Even short network disruptions can lead to significant financial loss. On-premises deployments reduce this risk by removing external dependencies from critical execution paths.

For workloads with strict latency targets or limited connectivity, the choice is often clear. Cloud-based vector databases may work during development, but they fail to meet operational requirements in production.

The Operational Complexity Question

The strongest argument for cloud vector databases is operational simplicity. Managed services remove the need to provision hardware, manage clusters, apply patches, or handle failures. For small teams or early-stage projects, this advantage is real and often decisive. Cloud deployments allow engineers to focus on application logic rather than infrastructure.

It is also important to recognize that modern on-premises deployments look very different from those of a decade ago. This is not the world of manual server provisioning and fragile scripts. Kubernetes, infrastructure-as-code, and automated deployment pipelines have reduced operational overhead significantly. Rolling upgrades, automated scaling, and monitoring are now standard practices in on-premises environments as well as in the cloud.

Many enterprises adopt hybrid approaches to balance speed and control. Development and experimentation happen in the cloud, where teams can move quickly and iterate. Production systems run on-premises, where costs are predictable and compliance is easier to enforce. This pattern allows teams to get the best of both models without committing fully to either.

Decision Framework: Eight Questions

The fastest way to make a defensible deployment decision is to walk through a small set of yes or no questions with engineering, legal, finance, and operations.

1. Does your data require geographic restrictions?

Regulations such as GDPR, HIPAA, and financial services rules may limit where data can be stored or processed.

If yes, on-premises should be strongly considered because it provides full control over data location. If no, cloud deployment remains viable.

2. Do you have predictable, high-volume query patterns?

Cloud vector database costs scale with usage. A simple check is monthly queries multiplied by the unit cost.

If usage exceeds roughly 80 to 100 million queries per month, on-premises is often cheaper. Below that range, cloud pricing is usually more economical.

3. Do you need offline capability?

Some systems must continue working without network access, such as in manufacturing, retail, or edge environments.

If yes, on-premises is required. If no, cloud remains an option.

4. Can you tolerate additional latency?

Cloud deployments add network latency, often 50 to 100 milliseconds.

If your application cannot tolerate this, on-premises deployment is necessary. If it can, cloud performance may be acceptable.

5. Do you have existing infrastructure teams?

Operational capacity matters.

If you already run on-premises systems, the added burden is limited. If not, cloud-managed services provide a clear operational advantage.

6. Is cost predictability important?

Usage-based billing introduces cost variability.

If predictable costs matter, on-premises provides stability. If flexibility matters more, cloud pricing may be a better fit.

7. Are you extending existing IT infrastructure?

Deployment context affects the decision.

If you are extending existing systems, on-premises leverages current investments. If you are building something new, cloud may be faster to deploy.

8. How large is your data footprint?

Data volume and access frequency influence long-term cost.

If you manage more than 10 TB with frequent access, on-premises becomes attractive. If your data is smaller, cloud is often sufficient.

When several answers point in the same direction, the decision becomes easy to explain and defend across engineering, legal, finance, and operations teams.

When Cloud Makes Sense

On-premises deployment is not always the right answer. In many situations, cloud-based vector databases remain the better choice. Being clear about these cases helps avoid over-engineering.

Unpredictable scaling: Startups and new products often face uncertain growth. Cloud platforms allow rapid scaling without long-term infrastructure commitments, which reduces risk when demand is unclear.
Small data volumes: When total data is under 10 TB and query volume stays below about 50 million queries per month, cloud pricing usually works well and is simpler than self-hosting.
Rapid experimentation: Proof-of-concepts, research projects, and early prototypes benefit from fast setup and easy teardown. Cloud services support quick iteration with minimal operational effort.
No compliance constraints: If data residency, sovereignty, and regulatory requirements are not an issue, cloud deployment avoids legal complexity and speeds up delivery.
Limited infrastructure expertise: Teams focused on application logic rather than operations can rely on managed services instead of maintaining databases, clusters, and hardware.

In these cases, cloud is the most effective and practical option.

Hybrid Deployment Strategies

Hybrid deployments act as the middle ground for enterprises that need both speed and control. Rather than treating cloud and on-premises as mutually exclusive, teams place each part of the system where it performs best.

Cloud for iteration, on-prem for scale

A common pattern is to develop and test in the cloud, where managed services and elastic infrastructure enable rapid iteration. Once models, indexes, or pipelines are stable, they are promoted into on-premises production environments to meet compliance, latency, and operational requirements. This preserves developer velocity without compromising production guarantees.

Data segregation by risk and regulation

Hybrid architectures also allow organizations to separate workloads by risk profile. Sensitive or regulated data stays on-premises, while analytics, training, or search over derived data runs in the cloud. The same logic applies regionally: EU data may remain on-premises or in sovereign environments, while US workloads run in public cloud regions, avoiding global systems being constrained by the strictest jurisdiction.

Cost and migration flexibility

Cost optimization is another driver. Frequently accessed vectors or low-latency services can be cheaper and more predictable on-premises, while cold storage and bursty workloads benefit from cloud pricing. Many teams start cloud-first, then selectively move components on-premises as scale or compliance pressures grow. Hybrid makes this a controlled evolution rather than a disruptive rewrite.

Industry research shows this is a stable operating model. Google Distributed Cloud and similar platforms explicitly frame hybrid as a long-term strategy, recognizing that modern systems are designed to span environments, not collapse them into one.

Actian’s Approach To On-premises Vector Databases

For teams that conclude that on-premises is the right deployment model, the next question is: which platform can actually meet these requirements? Actian’s approach is built specifically for this audience, without assuming the cloud is the default or the end state.

Actian delivers an enterprise-grade vector database that runs fully in your own data center or controlled environments. You retain full control over data placement, networking, and operations. There is no forced dependency on external cloud services, which simplifies audits and long-term system design.

Compliance requirements are treated as baseline constraints. By keeping data local and eliminating egress paths, Actian aligns with GDPR, HIPAA, FedRAMP, and similar regulatory frameworks. This reduces the need for compensating controls or complex legal workarounds.

Cost behavior is also predictable. Actian avoids usage-based pricing models that scale with queries or vector counts. This makes budgeting simpler and removes surprises as workloads grow.

Edge support is also taken into consideration. Actian’s architecture supports offline operation and local inference, making it suitable for manufacturing sites, retail locations, and other environments where connectivity is limited or unreliable. The system is designed to keep working even when the network does not.

Final Thoughts

Choosing between cloud and on-premises for vector databases is about understanding your priorities. Cloud works well for small workloads, rapid experimentation, and teams without deep infrastructure expertise. On-premises makes sense when compliance, latency, cost predictability, or scale are critical.

Many enterprises find a hybrid approach is the best balance, combining cloud flexibility with on-premises control. The key is making intentional decisions based on your data, workloads, and regulatory needs rather than following trends.

Actian empowers enterprises to confidently manage and govern data at scale. Organizations trust Actian data management and data intelligence solutions to streamline complex data environments and accelerate the delivery of AI-ready data. As the data and AI division of HCLSoftware, Actian helps enterprises manage and govern data at scale across on-premises, cloud, and hybrid environments. Learn more about Actian and how it fits into your on-premises AI strategy.

How to Evaluate Vector Databases in 2026

Odewole Babatunde Samson — Tue, 19 May 2026 09:12:21 +0000

In 2026, a synthetic performance crisis challenges the vector database market. A GitHub search for “vector database benchmark” reveals polished repositories with dashboards and performance charts. However, vendors often build these tools to evaluate their own products and portray architecture-specific strengths as objective comparisons.

Zilliz maintains VectorDBBench. Redis and Qdrant publish benchmark suites that highlight their own systems. Even widely cited Approximate Nearest Neighbor (ANN) evaluations, such as ANN-Benchmarks, rely on low-dimensional datasets such as Scale-Invariant Feature Transform (SIFT) and Generalized Search Trees (GIST). Modern Large Language Model (LLM) embeddings often reach 3,072 dimensions. These benchmarks do not reflect that reality.

Leaderboards reward performance under static conditions, yet production systems must survive continuous writes, metadata filters, and concurrency spikes. As software engineer Simon Frey famously noted in a viral post: “The best vector database is the one you already have.” This captures the 2026 market shift, prompting teams to move from specialized silos toward the databases they already trust and operate.

This guide takes a production-first approach. We define the five critical tests for 2026 and explore why your optimal vector database may already exist within your current architecture, whether that is PostgreSQL with pgvector or an enterprise hybrid engine like Actian VectorAI DB.

TL;DR

The bias: Most benchmark suites originate from vendors and optimize for narrow architectural advantages.
The reality: Production workloads include continuous ingestion, metadata filtering, and concurrency spikes that synthetic tests ignore.
The risk: Tail latency (P99), index fragmentation, and write amplification degrade systems long before average QPS drops.
The cost curve: Managed vector services often introduce nonlinear pricing as the dataset size increases.
The direction: 2026 favors integrated platforms, from established relational extensions (PostgreSQL + pgvector) to enterprise hybrid systems (Actian VectorAI DB), over “vector-only” silos.

Why Every Benchmark You’ve Seen is Vendor-Optimized

Benchmarks create a perception of objectivity but often encode architectural assumptions. Tools like VectorDBBench (Zilliz) reward distributed scaling, while Redis and Qdrant suites emphasize in-memory operations. To find objective data, architects must look to peer-reviewed academic conferences such as NeurIPS and VLDB (Very Large Databases), which prioritize algorithmic rigor over marketing.

Before examining what matters in production, it helps to understand how common benchmark tools shape outcomes.

Benchmark tool	Primary creator	Optimization focus	Typical bias
VectorDBBench	Zilliz (Milvus)	High-throughput scaling	Favors massive clusters; penalizes single-node systems.
vector-db-benchmark	Redis/Qdrant	In-memory operations	Favors RAM-heavy architectures; ignores TCO of memory.
ANN-Benchmarks	Academic	Raw algorithm efficiency	Uses outdated, low-dimensional datasets (SIFT/GIST).
NeurIPS / VLDB	Academic Peers	Algorithmic robustness	Focuses on math/theory; ignores operational/SLA reality.

The Hidden Rules of Benchmarking

A significant hurdle is the “DeWitt Clause,” a legal provision in many End User License Agreements (EULAs) that prohibits users from publishing independent benchmarks without the vendor’s permission. In 2024, BenchANT found that 30% of the major vector databases legally prohibit disclosure that their products are slow.

Furthermore, these benchmarks often operate at “Time Zero,” the artificial window immediately following ingestion but preceding live updates. In production, systems must constantly insert and delete data, forcing the index to re-optimize in real time. Vendor benchmarks often omit the Out-of-Memory (OOM) failures that result.

The Five Production Tests That Actually Matter

Most benchmarks measure performance after loading data, before any real updates occur. But production is a nonstop, unpredictable process. To find a database that can handle real users, you should run these five stress tests.

1. Filtering under concurrent load

Pure vector similarity searches are rare in real life. In production, you’re more likely to search for something like “Product recommendations WHERE category is ‘shoes’ AND stock > 0.”

Reddit’s engineering team, managing 340M+ vectors, identified metadata filtering as the primary performance bottleneck in their 2025 deployment. They found that as concurrent users grew, the database spent more time resolving metadata filters than calculating similarity distances.

The reality: Production means 100+ concurrent clients hitting different metadata subsets.
The gap: VectorDBBench only tests with a single client. In real-world situations, moving data between the vector graph and the relational metadata store can cause P99 latency to jump by 10x, as the CPU waits for disk I/O.

2. Performance degradation over time

While archival retrieval-augmented generation (RAG) systems can technically use static knowledge bases, production-grade applications in 2026 must reflect real-time data, such as customer tickets or product inventory. As the engineering team at Milvus admitted, “Benchmarks test after data ingestion completes, but production data never stops flowing.” If the database cannot re-index as quickly as it ingests data, your AI may provide stale or incorrect answers for hours.

Benchmarks that omit a “72-hour continuous write-and-query” test provide zero value. You must determine whether query performance degrades after six months of continuous index maintenance.

3. Tail latency under load (P95/P99)

Average latency can be misleading and doesn’t show what users really experience. For example, a 10ms average response time doesn’t help if your slowest 1% of queries (P99) take 800ms. This makes your AI agent seem slow and unreliable. Only high-concurrency tests reveal these spikes, which often happen during garbage collection or index locking.

4. Total cost of ownership (TCO)

In 2025, managed vendors introduced complex “read unit” pricing. This created a “Growth penalty”: if your index grows from 10GB to 100GB, you may pay 10x as much for the same query result.

Scale metric	Managed Vector DB (usage-based)	Integrated/Hybrid platform	TCO impact
Initial (10GB)	High (Platform fee + usage)	Moderate (Fixed resource)	Integrated is ~40% lower
Growth (100GB)	High (Scales with volume)	Low (Vertical scaling)	8x cost gap
Enterprise (1TB+)	Prohibitive (Linear growth)	Optimized (Reserved capacity)	90%+ long-term savings

This economic reality primarily drives the market’s shift toward “Vector as a Feature,” in which teams prioritize on-premises capabilities and predictable scaling over usage-based silos.

5. Operational maturity

Benchmarks ignore the “Operational Support Tax,” which quantifies the cost and risk of maintaining specialized infrastructure. You can easily find a PostgreSQL expert because the community has thrived for 30 years, but hiring someone proficient in a niche, three-year-old vector database often creates a bottleneck.

Evaluate the ecosystem: Does the database work with standard backup tools? Can it integrate with Prometheus? How long does it take to rebuild an index after a crash?

Here’s how benchmark claims compare to production reality.

Metric	Benchmark focus	Production reality
Ingestion	Static QPS after completion	Sustained QPS during continuous writes
Latency	Average latency	P95/P99 Latency under concurrent load
Filtering	Single-client filtered search	100+ Concurrent metadata-filtered queries
Cost	Infrastructure cost per query	TCO at 100M+ queries/month

Spotting these hidden bottlenecks is the first step to building a strong system. In 2026, the answer is rarely to use a faster, specialized database. Instead, engineers are adding these features to the tools they already know and trust.

The Consolidation Shift: Vector as a Feature

Corey Quinn, Chief Cloud Economist, once said: “Vector is a feature, not a product.” This prediction shapes the 2026 market. Teams are moving away from specialized “Vector-Only” databases and choosing integrated “Vector-Also” platforms. Shifting data between a main database and a separate vector database often causes more problems than it fixes.

The PostgreSQL renaissance

Engineers frequently argue on platforms like Hacker News that ~80% of RAG use cases (specifically those with embeddings under 2M) do not require a specialized vector database. For these workloads, standalone silos often introduce more operational friction than they offer in performance gains. Instacart validated this at scale by migrating from Elasticsearch to PostgreSQL, achieving 80% cost savings and reducing write workload by 10x after eliminating the need to coordinate and reconcile data across fragmented architectures.

Recently, pgvectorscale achieved 471 queries per second at 99% recall on 50 million vectors, outperforming Qdrant’s 41 QPS on identical AWS hardware. Vendor benchmarks often omit this result because it shows that most RAG applications don’t require a specialized vendor.

Performance metric	PostgreSQL (pgvector + pgvectorscale)	Qdrant (Specialized)	The Delta
Throughput (QPS)	471.57	41.47	11.4x higher in Postgres
P95 Latency	60.42 ms	36.73 ms	Qdrant is 39% faster at tail
P99 Latency	74.60 ms	38.71 ms	Qdrant is 48% faster at tail
Hardware	AWS r6id.4xlarge (16 vCPU)	AWS r6id.4xlarge (16 vCPU)	Parity

The integrated enterprise gap

For workloads that exceed basic extensions, Actian VectorAI DB bridges the gap by embedding a high-performance engine with native vector support. Teams can execute metadata filtering and similarity search within a single system, reducing data movement and simplifying query execution.

Platform	Architectural strategy	Intended AI capability
Actian VectorAI DB	High-performance hybrid	Engineered for integrated analytics + native vector support.
PostgreSQL	Integrated feature	Leverages pgvector within standard SQL.
AWS S3 Vectors	Storage-centric	Designed to query multi-billion vectors in object storage.
MongoDB Atlas	Unified document/vector API	Integrates native vector search directly into the existing document store workflow.

As the market comes together, the way we evaluate databases shifts. Teams no longer ask, “Who has the fastest graph?” They ask, “Which architecture provides the most reliable query engine?” No universal winner exists. Teams instead face a spectrum of trade-offs between specialized speed and integrated reliability.

The evaluation process now puts more weight on operational strength, real-world flexibility, and support for hybrid search. Reliable query execution is becoming the top priority, especially given the growing demand for hybrid search.

Hybrid Search Reality That Pure Vector Benchmarks Hide

Pure vector search often fails the “groundedness” test, which measures how strictly an AI’s response relies on provided source material. A high groundedness score ensures that the LLM avoids fabrication and adheres closely to your internal data.

According to an analysis by the Microsoft Azure DevBlog, pure vector search alone struggles with factual accuracy, scoring a mediocre 2.79 out of 5 for groundedness. The solution is Hybrid Search, which blends semantic vector similarity with traditional keyword matching (BM25).

The 20–40% performance penalty

Hybrid search demands significant computation. The database must rank results from two different engines, such as lexical and semantic, then merge them using a fusion algorithm. Production implementations typically see a 20–40% performance penalty when moving from pure vector search to hybrid search. Reciprocal Rank Fusion (RRF) creates most of this “merge tax”, which, according to Elastic’s research, can significantly increase query latency compared to single-index lookups.

Databases that integrate vector search with filtering, full-text search, and query execution in a single engine execute hybrid queries within a single atomic statement. The query optimizer can evaluate metadata filters, full-text conditions, and vector similarity at once. This lets the optimizer produce better execution plans and move less data.

In contrast, specialized vector silos fragment the query path. Applications route requests across multiple systems and merge results outside the database. This increases system complexity and introduces unpredictable latency under load.

Hybrid platforms such as Actian VectorAI DB address this problem by embedding vector search within the database engine. This design removes cross-system joins, simplifies operations, and reduces long-term architectural overhead.

Build Your Own Evaluation Framework

Stop asking which database won a GitHub leaderboard. Start asking which architecture survives your constraints. In 2026, these constraints center on data residency, scale, and team expertise.

The case for hybrid and on-premises

Data residency is no longer optional for global companies. With EU AI Act penalties reaching 35M Euros or 7% of global revenue, cloud-only vector databases represent a legal non-starter for regulated industries.

Sovereignty: 60% of financial firms outside the US plan to adopt sovereign/on-premises vector solutions by 2028.
Cost: As query volumes hit 100M/month, the “cloud tax” becomes visible. Self-hosting or using hybrid platforms like Actian can cut your infrastructure bill in half.
Maturity: If you already manage a relational database, your team possesses 90% of the required skills.

The 2026 architecture decision tree

Does the data require on-premises storage for compliance? → Prioritize Actian VectorAI DB or self-hosted PostgreSQL.
Does your query volume exceed 100M/month? → Avoid managed usage-based pricing; use self-hosted or reserved capacity.
Do you require complex metadata filtering? → An integrated relational/vector engine is non-negotiable.

How to Evaluate the Evaluators

To avoid letting vendor benchmarks mislead you, give the evaluation tool the same careful review you give the database. To spot a biased test, look past the headline QPS numbers and check the exact conditions that produced them.

Use the following evaluation rubric to review any benchmark report before it shapes your architectural decisions.

Evaluation metric	Red flag (Discard result)	Green flag (Trustworthy result)
Ingestion state	Queries run against a static, immutable index with zero background writes.	“Read-while-Write” testing, where queries run during continuous data ingestion.
Hardware parity	Vendor cloud “Optimized” vs. Competitor “Default” local/mismatched instances.	Verified identical CPU, RAM, and Disk I/O configurations across all tested systems.
Data selectivity	“High Selectivity” filters (99% of data removed) that hide join/scan inefficiencies.	“Low Selectivity” (10–20% filtered) tests that force the engine to handle large-scale index traversal.
Dimensionality	Testing on 128-dimension legacy datasets (SIFT/GIST).	Testing on 1,536 or 3,072-dimension vectors that match modern LLM outputs.
Latency metric	Focuses strictly on “Average Latency” or “Mean Response Time.”	Clearly publishes P95 and P99 tail latency under high concurrent load.

Pre-Commitment Checklist

Test with production-representative high-dimensional embeddings (3,072d+).
Measure P99 latency with 100+ concurrent users hitting diverse metadata filters.
Calculate 3-year TCO, including storage growth, egress, and re-indexing fees.
Confirm that your team can manage observability and backups for the new stack.

Final Thoughts

Real evaluation requires testing with your data, your patterns, and your scale. Load your production-representative data, run a week-long stability test under concurrent load, and measure P99 latency and the TCO.

If your workload requires compliance, hybrid deployment, or production-grade operational maturity that managed vector databases don’t offer, then Actian VectorAI DB early access is the right next step.

Join the Actian community on Discord to discuss vector architecture with engineers solving real production problems.

The hidden cost of vector database pricing models

Oluseye Jeremiah — Mon, 18 May 2026 14:47:31 +0000

For a long time, usage-based pricing seemed like the safest way to run new infrastructure. The appeal was to start small, pay very little, and let costs rise only if the product proved itself. For teams experimenting with semantic search or early retrieval systems, that trade-off made sense, particularly when fixed infrastructure commitments felt riskier than uncertain usage patterns.

That sense of safety began to fade in 2025 as several vector database providers introduced pricing floors and minimums. Pinecone announced a $50/month minimum, Weaviate implemented a $25/month floor, and similar changes rippled across the managed vector database market.

Small, steady workloads suddenly experienced step changes in cost without any corresponding increase in activity, a pattern that reflected a broader shift across the SaaS landscape. Always-on vector database infrastructure no longer fits the economics of single-digit monthly pricing. SaaS subscription costs from several large vendors rose between 10% and 20% in 2025, outpacing IT budget growth projections of 2.8%, according to Gartner.

Today, vector databases power production systems at scale. They run semantic search, recommendations, copilots, and internal knowledge tools. Data volumes stay relatively stable, and traffic patterns follow predictable curves. Yet for many organizations, vector search infrastructure has become one of the most volatile cost centers in the stack. Not because usage swings wildly, but because vector database pricing models behave differently once systems mature.

TL;DR

Cloud native vector database pricing advertises low minimums and usage-based flexibility, but production costs tell a different story.
Hidden fees (embeddings, reindexing, backups) can double your bill.
Query costs scale with dataset size, meaning the same query becomes 10x more expensive as you grow from 10GB to 100GB.
The October 2025 pricing shift introduced $50 minimums, forcing 400-500% cost increases for stable workloads.
At 60-100M queries/month, self-hosting becomes 50-75% cheaper than cloud.
Pricing model must be an architectural decision, not an afterthought.

What pricing pages leave out

Vector database pricing pages prioritize offering summarization over long-term cost modeling. Their job is to make adoption frictionless, not to walk you through how the bill is calculated after a system is live. Most pages spotlight a familiar set of numbers: storage per gigabyte, read and write units, and a low monthly minimum. Free tiers are marketed as enough to get started, which makes experimentation feel low-risk.

What these pages rarely explain is how those line items interact once usage stabilizes. They typically don't model how query costs change as datasets grow, how write activity accumulates over time, or how meaningful parts of the workflow sit entirely outside the database. Pinecone's pricing examples exclude initial data import, inference for embeddings and reranking, and assistant usage. Weaviate's pricing calculator similarly omits backup costs and data egress fees.

Qdrant's estimates don't account for reindexing overhead. The same vendors that dominate every comparison list now face questions about the sustainability of their pricing. These disclaimers are present but easy to skim past when you're focused on shipping a proof of concept.

A predictable pattern repeats itself. Someone runs the calculator and sets a monthly budget. The system goes live. A few weeks later, the bill is two to four times higher than expected. Nothing broke, no traffic spike happened. The database is doing exactly what it was built to do. The pricing page simply didn't describe the total cost of operating it.

How usage-based pricing works (and why it gets expensive)

Usage-based pricing reduces risk during experimentation when traffic is unknown. The issue is that vector databases in production are rarely unpredictable.

Once a system is live, most engineering groups have a reasonable understanding of data size and baseline query volume. What they lack is a reliable way to predict next month's bill, because managed vector databases charge across several dimensions simultaneously: storage, writes, and queries.

Each cost grows on its own curve, and none maps cleanly to user value. The part that catches development teams off guard is query pricing. In many models, query cost rises as the dataset grows, even when the query itself stays the same.

The three cost drivers you're actually paying for

Managed vector databases bill across three primary dimensions, though the exact rates vary by provider:

Storage:

Pinecone: $0.30/GB/month
Weaviate: $0.095/GB/month
Qdrant: $0.28/GB/month
Scales linearly as your dataset grows
More vector dimensions = larger bill

Operations:

Pinecone: Write units ($4/million), Read units ($16/million)
Weaviate: Per compute unit hour (variable)
Qdrant: Credit-based system
Every upsert, update, and query consumes units
Vector search operations accumulate quickly at scale

Additional services:

Embedding generation: Pinecone Inference ($0.08/million tokens)
Weaviate/Qdrant: Require external services (OpenAI, Cohere)
Reranking, backups, data transfer billed separately
Adds another vendor relationship and cost stream

Each cost dimension scales independently, and their interaction creates compounding effects that pricing calculators rarely capture. Understanding why these costs compound requires looking at how vector search actually works, specifically HNSW indexing.

Why costs compound as you scale

The cost increases stem directly from how vector search works under the hood.

How HNSW works:

Most production vector databases use approximate nearest neighbor (ANN) algorithms like HNSW (Hierarchical Navigable Small World) to make searches tractable at scale.

HNSW constructs a multi-layer graph in which each layer represents vectors at different levels of granularity, thereby organizing millions of vector dimensions into an efficient structure.

The cost impact:

Pinecone's documentation indicates that a query consumes 1 RU per 1 GB of namespace size, with a minimum of 0.25 RUs per query. As your dataset grows, so does the graph:

Dataset size	RU per query	Cost at $16/M RU	Same query, different cost
10 GB	10 RU	$0.00016	Baseline
100 GB	100 RU	$0.0016	10x more expensive
1 TB	1,000 RU	$0.016	100x more expensive

Result:

Ten times the cost, for the same query, delivering the same result quality.

At $16 per million read units, costs scale linearly with data growth but the functionality delivered to users stays the same. A search query returns the same number of results with the same accuracy whether your index is 10 GB or 100 GB. Your users see no difference, but you pay 10x more. This is the moment growth starts to feel like a penalty. The graph structure needs to traverse more vector dimensions as your index expands, and you pay for every additional operation.

The free tier that isn't really free

The free tier enables early experimentation but doesn't predict production economics. By the time you hit the limits, switching costs are no longer theoretical. Migration is perceived as expensive, and people accept pricing they would have questioned earlier.

Provider	Free tier limits	Production reality	Time to exceed
Pinecone	2 GB, 1M reads, 2M writes (single region)	60+ GB, 5M+ reads typical	2-4 weeks
Weaviate	1M vectors, limited compute	10M+ vectors standard	1-3 weeks
Qdrant	1 GB storage	60+ GB storage common	1-2 weeks

The October 2025 pricing shift that changed everything

These structural issues became impossible to ignore when Pinecone made a significant pricing change. By late 2025, pricing changes across major vector database providers made it clear that the pay as you go (PAYG) model did not always hold once systems reached steady production. The most visible signal came in October, when Pinecone implemented a $50 monthly minimum across paid Standard plans.

For organizations already spending well above that level, the change barely registered. For smaller but stable workloads, the situation was different. Some groups had intentionally designed their usage to stay under $10 per month.

These weren't abandoned projects, but internal tools, early production features, and low-volume customer-facing systems that had already stabilized. Usage remained flat, but in some cases the introduction of pricing minimums led to five- to tenfold increases in monthly costs.

What made the moment important was not the dollar amount. It was the introduction of a fixed floor into a model marketed as consumption-based. Low usage no longer guaranteed low cost. Once that assumption broke, minimums stopped feeling like an edge case and started looking like structural risk.

Previous monthly cost	New minimum	Increase
$8	$50	525%
$12	$50	317%
$25	$50	100%

The migration it forced

For anyone below the new $50 minimum, migration was rarely planned. It was reactive. Platform owners had to evaluate alternatives, export data, rebuild indexes, and validate query behavior under time pressure. In some cases, the engineering effort required to migrate exceeded the annual savings from switching providers. Many still moved anyway, because the alternative was committing to pricing that no longer matched the workload.

The impact of the pricing change became visible across developer communities. One developer documented their migration experience publicly, noting they had managed to keep bills under $10 per month by storing only essential data in the vector database. The September 2025 announcement requiring a $50 monthly minimum regardless of actual usage prompted an immediate search for alternatives.

The migration calculus proved challenging. Moving to Chroma Cloud became the chosen path, but the process revealed deeper concerns about serverless pricing models. As the developer noted, they were seeking a truly serverless solution in which costs scale linearly with usage, starting at $0. The $50 minimum eliminated that possibility.

This pattern repeated across Reddit threads and developer forums. A discussion thread titled “Pinecone's new $50/mo minimum just nuked my hobby project” captured the broader sentiment. Teams running stable, low-volume production workloads faced a choice: accept a 400-500% cost increase or invest engineering time in migration.

The issue wasn't the absolute dollar amount. For many teams, $50 per month remained affordable. The problem was precedent. If a vendor could introduce a minimum that quintupled costs without warning, what prevented future increases? The pricing change transformed vendor selection from a technical decision into a risk management calculation.

A few patterns showed up repeatedly across these migrations. Pricing predictability started to matter more than managed convenience. Open source and self-hosted options re-entered discussions that had previously defaulted to cloud. Vendor pricing risk became a first-class architectural concern. These migrations were not driven by dissatisfaction with features or performance. They were driven by economics.

What it reveals about vendor pricing power

Once a vector database is deployed in production, vendors can adjust pricing in ways that materially affect customers, even if usage remains unchanged.

Usage-based pricing lowers the barrier to adoption, but it increases switching costs over time as APIs become embedded, data formats solidify, and migrations grow expensive.

For engineering leadership, the evaluation question shifts:

Before: "What does this cost today?"
After: "How exposed are we to pricing changes once this is in production?"

Real-world cost scenarios (what you'll actually pay)

Understanding these dynamics in the abstract is one thing. Seeing how they play out in actual production systems is another.

To see the full picture, let's examine three common production scenarios and compare costs across major providers.

Scenario 1: Customer support RAG system

Imagine a customer support assistant built on historical tickets, internal documentation, and help articles. At this stage, you might be dealing with about 10 million vectors (typically 768 or 1536 vector dimensions) and around five million queries per month.

Provider	Storage	Queries	Writes	Embeddings	Overhead	Total
Pinecone	$18	$5 (but $50 min applies)	$0.40	$40-60	$20-30	$350-500
Weaviate	$6	Compute: $40-60	Included	$40-60	$15-25	$300-400
Qdrant	$17	Credits: $30-50	Included	$40-60	$15-25	$280-380

Key finding: Even at a small scale, actual costs are 3-5x higher than base calculator estimates due to minimums and complex pricing structures.

Scenario 2: E-commerce recommendation engine

As systems grow, the cost dynamics become more pronounced. With around 100 million vectors and tens of millions of queries per month, costs climb quickly. Product catalogs, user vector embeddings, and real-time personalization introduce sustained traffic and frequent updates.

Provider	Storage	Queries	Writes	Embeddings	Overhead	Total
Pinecone	$180	$192	$8	$200-300	$50-80	$1,500-2,500
Weaviate	$57	Compute: $800-1,000	Included	$200-300	$40-60	$1,400-2,200
Qdrant	$168	Credits: $600-900	Included	$200-300	$40-60	$1,300-2,100

Key finding: At mid-scale, costs converge across providers. Embedding fees often exceed base database costs.

Scenario 3: Multi-tenant SaaS platform

The economics shift dramatically at the enterprise scale. At 500 million vectors and 100 million queries per month, usage-based pricing becomes structural. These large datasets contain high-dimensional vector embeddings across many customers.

Provider	Storage	Queries	Writes	Embeddings	Support	Total
Pinecone	$921	$1,200	$100-150	$500-700	$300-500	$2,500-4,000+
Weaviate	$292	Compute: $2,000-3,000	Included	$500-800	$200-400	$3,000-4,500
Qdrant	$860	Credits: $1,500-2,200	Included	$500-800	$200-400	$2,900-4,200

Key finding: At enterprise scale, annual costs reach $30,000-$54,000. This is where self-hosting economics become compelling.

Side-by-side provider comparison

To make the economics clearer, here's how the major vector database providers stack up across the dimensions that matter most for production deployments:

Feature	Pinecone	Weaviate	Qdrant	PostgreSQL + pgvector
Pricing model	Usage-based	Usage-based	Usage-based	Self-hosted (fixed)
Monthly minimum	$50	$25	None	None
Storage cost	$0.30/GB	$0.095/GB	$0.28/GB	Hardware cost only
Query pricing	Scales with data	Compute-based	Credit-based	Free within capacity
Additional Cost	Many	Moderate	Some	None
Cost predictability	Low	Low-Medium	Medium	High
Scenario 1 cost	$350-500	$300-400	$280-380	~$200-300
Scenario 2 cost	$1,500-2,500	$1,400-2,200	$1,300-2,100	~$800-1,200
Scenario 3 cost	$2,500-4,000+	$3,000-4,500	$2,900-4,200	~$1,500-2,000
Best for	Fast prototyping	Hybrid search	K8s-native teams	Stable, high-volume

The hidden fees that aren't in the calculator

These scenarios reveal a consistent pattern: the advertised pricing rarely captures the full cost. Production vector search systems incur costs that are rarely modeled comprehensively by calculators. Understanding these hidden costs is crucial for accurate budgeting.

Embedding and inference fees

Pinecone Inference charges $0.08 per million tokens for generating vector embeddings. Weaviate and Qdrant don't provide native embedding services, requiring you to use external providers like OpenAI (starting at $0.10 per million tokens) or Cohere.

Converting documents to vectors costs extra beyond database operations across all platforms. Reranking adds additional per-request fees. Cohere-rerank-v3.5 has no free requests on any tier, meaning every reranking operation is billed.

These embedding and inference costs can match or exceed the database bill itself, depending on data churn and query patterns. Every time you generate new vector embeddings or update existing ones, you're paying separately from your core vector storage costs.

Reindexing costs (the silent killer)

The cost impact becomes especially severe when you need to change your approach. When you change embedding models, you must re-vectorize all data. For a 100-million-vector dataset, this could mean:

Embedding costs: $8,000-$15,000 one-time
Increased write units during migration
Processing time and compute overhead

Experimentation with models becomes prohibitively expensive, creating lock-in to initial embedding choices. The cost of generating vector embeddings at scale makes it risky to improve your system.

The support tax

Support tiers add meaningful costs across all managed providers. Pinecone's support tiers run from free community forums to $499/month for 24/7 coverage. Weaviate charges $500/month for their Professional support tier. Qdrant's enterprise support starts at similar levels.

Tier	Pinecone	Weaviate	Qdrant
Free	Community only	Community only	Community only
Developer	$29/month	N/A	N/A
Pro/Enterprise	$499/month	$500/month	Custom

Geographic distribution costs

Multi-region deployment for latency optimization adds data transfer costs, regional infrastructure overhead, and can increase base costs by 30-50% depending on configuration. Running vector search across multiple cloud provider regions compounds these expenses.

When self-hosting becomes 75% cheaper

Given these hidden costs and pricing volatility, many teams eventually reach a crossroads. There is a point where vector database pricing stops being a convenience question and becomes an economic one. That point usually arrives earlier than many people expect.

Timescale benchmarks show that PostgreSQL + pgvector is 75% cheaper than Pinecone, while also delivering 28x faster P95 latency compared to Pinecone's storage-optimized tier. The tipping point at which self-hosting becomes materially cheaper typically occurs between 60 and 100 million queries per month.

The cost crossover point

Breaking down the economics by scale reveals clear patterns:

Below 10M queries/month: Cloud is usually simpler. The operational overhead of self-hosting (DevOps time, monitoring, maintenance) outweighs potential savings. Managed services make sense here.

10M-60M queries/month: Economics converge. Self-hosting costs stabilize, whereas cloud costs continue to rise with usage. This is where many teams begin to seriously evaluate alternatives. The gap narrows to the point at which the decision depends more on team capabilities than on pure economics.

60M-100M+ queries/month: Self-hosting becomes 50-75% cheaper. PostgreSQL self-hosted costs approximately $835 per month on AWS EC2, compared to Pinecone's $3,241 per month for the storage-optimized index at a comparable scale. At this volume, the math becomes hard to ignore.

What self-hosting actually costs

Breaking down the real economics reveals why this shift happens. Running your own vector search infrastructure on dedicated hardware involves several cost components:

Server: $400-$800/month (OpenMetal dedicated hardware or equivalent AWS EC2 instance optimized for vector workloads)
Setup: About 40 hours initial effort ($4,000-$8,000 one-time at typical engineering rates)
Ongoing maintenance: 10-15 hours/month (roughly $1,500-$2,250/month in engineering time)
Monitoring stack: $50-$200/month (Prometheus, Grafana, alerting)
Backup storage: $100-$300/month (S3 or equivalent)

Total: About $2,050-$3,550/month versus Pinecone $5,000-$10,000+ at enterprise scale
Net savings: $2,950-$6,450/month = $35,000-$77,000/year

The math gets more compelling as you scale. With large datasets containing hundreds of millions of vector dimensions, the gap widens substantially.

Performance advantages beyond cost

The economic case is strong, but performance matters too. Timescale benchmarks demonstrate that PostgreSQL with pgvector achieves a P95 latency 28x lower than Pinecone's storage tier: 63ms versus 1,763ms. Additionally, PostgreSQL achieves 16x higher query throughput at 99% recall.

Beyond performance, self-hosting provides:

Control: Tune for your specific workload and vector dimensions
No throttling or rate limits
Data sovereignty and compliance benefits
Predictable scaling where costs are tied to capacity, not usage
Hybrid search flexibility to combine vector search with traditional queries

The hidden cost of free and serverless

Free tiers and serverless pricing are designed to feel safe. They lower friction, reduce upfront commitment, and make it easy to start building. In practice, they often delay cost visibility rather than eliminate it.

Serverless does not mean infrastructure is free. It means infrastructure is abstracted and billed indirectly through usage. For steady workloads, that abstraction usually comes at a premium. Every query, every stored vector, every embedding refresh, and every background operation is metered. Over time, convenience replaces predictability.

Free tiers follow a similar pattern. They are useful for experimentation, but they are not representative of production economics. By the time limits are reached, integration work is already done, APIs are embedded, and migration feels expensive. At that point, teams tend to accept pricing they would have challenged earlier.

A practical way to choose

Once pricing volatility appears, the question is no longer which database is cheapest today. It becomes which pricing model still works once the system stabilizes.

Three factors matter most:

Scale: How many vectors you store, how many queries you run per month, and how quickly those numbers grow
Predictability: Whether usage is bursty and uncertain, or steady and forecastable over the next six to twelve months
Control: How much operational responsibility your team can realistically take on, and how sensitive the business is to budget variance

Early on, managed cloud services usually make sense. They optimize for speed, experimentation, and unknown demand. As workloads stabilize and query volumes climb into the tens of millions per month, usage-based pricing begins to lose its advantage. Costs rise faster than value, and forecasting becomes harder, not easier.

Beyond roughly 60–100 million queries per month, many teams reach a crossover point. At that scale, self-hosted or on-premises deployments are often materially cheaper and far more predictable, even after accounting for infrastructure and operational overhead.

When each option fits

Cloud-managed services work best when:

Traffic is unpredictable or highly bursty.
Speed of iteration matters more than long-term cost.
DevOps capacity is limited.
Workloads are still exploratory.

Self-hosted or on-premises deployments make sense when:

Query volume is high and stable.
Cost predictability is a business requirement.
Budgets must be defended in advance.
Compliance or data residency matters.
Performance targets are tight.

The right choice depends on matching your pricing model to your actual production behavior.

Decision triggers that help

Instead of debating architecture continuously, many teams define clear triggers:

If monthly vector database spend exceeds $1,500, re-evaluate deployment options.
If query volume exceeds 50 million per month, model total cost of ownership for owned infrastructure.
If pricing changes exceed 20%, reassess vendor risk.
If latency targets are consistently missed, evaluate alternatives.

These triggers turn pricing from a surprise into a planned decision point.

The bottom line

Vector database pricing looks simple at the start. Free tiers, low minimums, and usage-based billing suggest you only pay for what you use. In production, the economics change. Costs compound across storage, queries, embeddings, and background operations.

The same query gets more expensive as datasets grow, even when it delivers the same value. Predictability disappears at the stage where predictability matters most. For sustained workloads, there is a clear tipping point where ownership becomes cheaper and easier to justify. Teams that avoid bill shock are not the ones who negotiated better discounts; they are the ones who treated pricing as an architectural decision early.

For organizations that value fixed budgets, predictable spend, and long-term control, this is why on-premises vector databases are re-entering serious architectural discussions. Actian’s on-premises vector database, designed around transparent licensing rather than usage-based volatility, reflects that shift.

Do the cost math before you need to migrate. It is always cheaper that way.

5 Best Python Vector Database Libraries

Tiioluwani — Mon, 18 May 2026 12:42:35 +0000

Most comparisons of Python vector database libraries focus on retrieval speed, indexing algorithms, or benchmark results. These metrics matter, but production failures stem from various factors: installation inconsistencies, client packaging differences, version churn, and unexpected API changes. In reality, a different class of problems appears once the application leaves the notebook environment and runs inside a production service.

A typical example occurs with embedded ChromaDB setups. A project may work perfectly during development, only to fail in production with an error such as:

RuntimeError: Chroma running in http-only client mode.

A structural conflict between the chromadb and chromadb-client packages triggers this error because the client-only package lacks the default embedding functions the application depends on. Diagnosing this can take hours.

Client packaging choices and library design decisions, not retrieval quality or indexing performance, produce this type of failure.

This article compares the leading Python vector database libraries from that perspective, examining client architecture, installation stability, API design, and long-term maintainability, rather than benchmark numbers alone.

TL;DR

ChromaDB: Fastest setup for prototyping and notebook environments with minimal configuration.
Pinecone: Fully managed cloud solution with zero infrastructure management overhead.
Qdrant: Zero code changes from local development to production; the strongest open-source option for API stability.
Weaviate: Hybrid search combining vector similarity and keyword filtering at scale.
Actian VectorAI DB: On-premises deployment with same architecture from laptop to production; Actian designed it for edge and air-gapped environments.

The Python Landscape: Understanding the Options

The relationship between a Python vector database library and its storage backend determines how you will develop, test, and eventually scale your application. The wrong library choice often triggers the environment-specific failures described above, since each architecture handles local and production environments differently.

These differences generally fall into four distinct categories, each with its own approach to the interaction between infrastructure and code.

Four client architectures

Cloud-only (e.g., Pinecone): These clients act as a full API abstraction for serverless environments. The primary advantage is zero infrastructure management, but this requires an active internet connection and an API key for all local development and testing.
OSS with managed option (e.g., Qdrant, Weaviate, Milvus): This set of tools uses the same API for both self-hosted Docker instances and managed cloud services. This provides excellent development-production parity, though it often requires managing a local server or a Docker container during development.
Embedded libraries (e.g., ChromaDB, FAISS): These tools run in-process and embed the database logic in your Python application. While they are ideal for notebooks and rapid prototyping, their developers never designed them for distributed production environments, and they do not offer a well-defined migration path as the application scales.
Extension approach (e.g., pgvector via Timescale vector): This model adds vector search capabilities to traditional relational databases. It allows existing PostgreSQL infrastructure to support vector similarity search. However, query performance varies with index configuration, dataset size, and workload characteristics; some scenarios benefit from the relational foundation, while others favor purpose-built vector architectures.

These four models describe how a client connects to storage, but they also show a practical separation between standalone search libraries and managed database systems. Choosing the wrong model generates some of the most persistent production friction in vector search applications.

A vector database provides the infrastructure required for production readiness, going beyond what developers' standalone libraries offer. Libraries such as FAISS or Annoy are static, in-memory tools focused on approximate nearest neighbor search across large datasets. They are highly efficient for similarity search within a fixed vector space, but they cannot manage data over time.

Specialized databases like Pinecone, Qdrant, or Milvus go further, providing full CRUD support, metadata-based filtering, and distributed persistence for large datasets.

The table below summarizes where each architecture fits across common use cases.

Category	Primary trade-off	Production migration path
Cloud-only	No infrastructure management; requires network connectivity and API authentication for all environments	Same client code across development and production
OSS + managed	Identical API for local and cloud deployments; requires Docker or server setup for local development	Zero code changes between local Docker instance and managed cloud service
Embedded	In-process execution with minimal setup; limited to single-machine architecture	Client class replacement required; distributed deployment needs architecture redesign
Extension	Integrates with existing PostgreSQL infrastructure; performance depends on index configuration and dataset characteristics	Current PostgreSQL setup and scale requirements determine the migration path

Client Comparison: Developer Experience Deep Dive

Architecture does narrow your options, but the day-to-day experience of working with a Python vector database library comes down to how each client handles connection setup, version stability, and the friction points encountered during active development.

We're comparing the four clients below based on what developers deal with in real-world use.

1. Pinecone Python client

Pinecone offers one of the more polished connection experiences among cloud-only vector database clients, with extensive type hints and a straightforward initialization pattern.

from pinecone import Pinecone

    pc = Pinecone(api_key="your-api-key")
    index = pc.Index("your-index-name")

Strengths:

Extensive type hints and IDE autocompletion support.
Pinecone introduced AsyncIO support in v6 via Pinecone Asyncio.
gRPC mode offers higher throughput for demanding workloads.
Well-maintained official documentation.

Pain points:

Pinecone shipped three major versions in 18 months (v5, v6, and v7), introducing breaking changes to the connection logic and renaming the package from pinecone-client to pinecone.
Historical confusion between the pinecone and pinecone-client packages.
query_namespaces async operations under load require thread pool tuning.

When to choose Pinecone:

Fully managed infrastructure is a hard requirement.
The team has no appetite for self-hosted database management.
The Budget allows for cloud-only pricing at the target scale.

When to avoid Pinecone:

Local development requires a live API key.
API stability across versions is a priority.
Cost at scale above 100M vectors is a constraint.

2. Weaviate Python client

Weaviate's v4 client is a meaningful step forward from v3, adding typed classes and gRPC support that noticeably improve query performance.

import weaviate

    client = weaviate.connect_to_local()
    collection = client.collections.get("your-collection-name")

Strengths:

gRPC mode delivers 40-70% faster query performance than v3.
Typed property and DataType classes replace untyped v3 dictionaries.
Built-in hybrid search combining vector and keyword search.
Strong support for multi-tenancy workloads.

Pain points:

Weaviate has fully deprecated the v3 API, and teams report that the migration takes weeks of work.
gRPC requires port 50051 to be open, which creates friction in restricted network environments.
Batch API redesign caused significant confusion (Issue #433).
LangChain did not ship v4 support until several months after Weaviate's release (Issue #14531).

When to choose Weaviate:

Hybrid search combining vector similarity and keyword filtering is a core requirement.
gRPC performance gains justify the network configuration overhead.
The team has the capacity to manage the v3-to-v4 migration.

When to avoid Weaviate:

Migration resources are limited, and API stability is a priority.
Network environments restrict non-standard port access.

3. ChromaDB Python client

ChromaDB offers one of the easiest onboarding experiences among Python vector database libraries, making it a natural starting point for notebooks and early-stage prototyping.

 import chromadb

    # In-memory mode
    client = chromadb.Client()

    # Persistent mode
    client = chromadb.PersistentClient(path="/your/path")

    # HTTP client mode
    client = chromadb.HttpClient(host="localhost", port=8000)

Strengths:

Simplest API surface of any client in this comparison.
Mature LangChain integration with well-documented examples.
In-memory mode requires zero configuration for notebook environments.
Large and active open-source community.

Pain points:

Python 3.13 incompatibility (Issue #3651).
Windows instability above 99 records (Issue #3058).
Confusing chromadb with chromadb-client breaks production deployments.
hnswlib compilation errors on ARM Mac processors.
Requires SQLite 3.35 or higher, creating an environment-specific setup overhead.

When to choose ChromaDB:

Rapid prototyping in a notebook environment is the primary use case.
The dataset fits comfortably within a single process.
LangChain integration with minimal configuration is a priority.

When to avoid ChromaDB:

Production deployment on Windows or Python 3.13 is required.
The application needs to scale beyond a single machine.
A clear and well-defined migration path to a distributed infrastructure is important.

4. Qdrant Python client

Developers value Qdrant for its local-to-production parity. The same client code runs against an in-memory instance during development and a fully managed cloud deployment in production, without requiring any modifications.

from qdrant_client import QdrantClient

    # In-memory mode for local development
    client = QdrantClient(":memory:")

    # Production mode - zero code changes required
    client = QdrantClient(url="https://your-cluster-url", api_key="your-api-key")

Strengths:

:memory: mode enables a zero-code-change local-to-production workflow.
Qdrant introduced a native AsyncQdrantClient for high-concurrency workloads.
Pydantic model type safety throughout the client interface.
Rust-backed implementation with lower memory overhead compared to JVM-based alternatives.

Pain points:

Developers must explicitly set prefer_grpc=True to enable gRPC, a step they often overlook.
Port split between REST (6333) and gRPC (6334) requires careful network configuration.
Pydantic version constraints: v1.10.x or v2.21 and above only.
Cloud connection issues (Issue #112).

When to choose Qdrant:

Local-to-production parity is a priority, and zero code changes between environments matter.
High-concurrency async workloads require native AsyncQdrantClient support.
You prefer a self-hosted, open-source vector database deployment over a managed cloud service.
Hybrid search that combines dense and sparse vectors is a core requirement.

When to avoid Qdrant:

The team has no Docker experience and needs a simpler local setup.
The target environment cannot support gRPC network configuration.
Pydantic version constraints conflict with existing project dependencies.

Each client brings distinct trade-offs to the connection layer. These differences also extend further into how each client handles installation and platform compatibility.

Installation and Environment Management

In ideal environments, installing a Python vector database library is straightforward. In practice, the target platform, Python version, and existing package dependencies each introduce variables that can turn a simple pip install into a multi-hour debugging session. A quick compatibility check before committing to a client is worth the effort, since most of these issues only surface after the setup is complete.

The compatibility matrix

The following matrix shows client behavior across Python 3.8–3.13 on macOS ARM, Windows, and Linux.

Client	macOS ARM (M1/M2)	Windows	Linux (Debian)	Python 3.13
Pinecone	✓ Full support	✓ Full support	✓ Full support	✓ Supported
Weaviate	✓ Full support	✓ Full support	Requires Docker for gRPC	✓ Supported
ChromaDB	hnswlib compilation errors	Instability above 99 records (#3058)	Requires Debian Bookworm+	✗ Broken (#3651)
Qdrant	✓ Full support	✓ Full support	✓ Full support	✓ Supported

ChromaDB carries the heaviest compatibility burden of any client in this comparison. On macOS with ARM processors, hnswlib produces compilation errors during installation, forcing developers to manually pin Python to 3.11 or 3.12.

On Windows, ChromaDB destabilizes once a collection exceeds 99 records, making the embedded client unsuitable for anything beyond early prototyping. On Linux, Debian-based distributions require Bookworm or later to install and run ChromaDB cleanly.

Virtual environment best practices

Setting up a virtual environment before installing any vector database client saves a lot of debugging time, especially with ChromaDB, where developers know package conflicts occur frequently.

python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    pip install chromadb

Pinning the client version in a requirements.txt file matters equally, since several of these clients have a history of introducing breaking changes between minor releases.

chromadb==0.4.x
    qdrant-client==1.7.x
    pinecone==3.x
    weaviate-client==4.x

ChromaDB's two-package architecture confuses many developers. When someone installs chromadb-client instead of chromadb, the application throws this error on its first attempt to call the default embedding function.

ValueError: You must provide an embedding function

chromadb-client is a lightweight HTTP-only package that does not include DefaultEmbeddingFunction. Applications that rely on the default embedding behavior need the full chromadb package instead.

Extras and optional dependencies

Beyond the base installation, each client supports additional dependencies that can improve performance for production workloads.

 # Pinecone with gRPC support
    pip install pinecone[grpc]

    # Qdrant with FastEmbed for local embedding generation
    pip install qdrant-client[fastembed]

    # ChromaDB with sentence-transformers for local embedding support
    pip install chromadb sentence-transformers

gRPC has the highest impact on optional installation for query performance. Weaviate sees 40-70% faster queries over gRPC than over REST, while Qdrant gains roughly 15% in query speed. The trade-off is that gRPC requires additional network configuration, which may not be feasible in restricted environments.

FastEmbed and sentence-transformers both enable local embedding generation without an external API dependency, keeping latency and embedding costs down for semantic search and similarity search workloads.

Qdrant's native AsyncQdrantClient and Pinecone's PineconeAsyncio deliver 3–5x throughput improvements under high-concurrency workloads.

Local Development Workflows

Developers make most vector database decisions in the local development environment. The critical question is: Which client requires the least code change when moving to production?

The migration path
Here is how each client handles the move from local deployment to production.

 # Qdrant - zero code changes required
    client = QdrantClient(":memory:")              # Development
    client = QdrantClient(                         # Production
        url="https://your-cluster-url",
        api_key="your-api-key"
    )

    # ChromaDB - client class change required
    client = chromadb.Client()                     # Development
    client = chromadb.HttpClient(                  # Production
        host="your-host",
        port=8000
    )

    # Pinecone - same code in both environments
    pc = Pinecone(api_key="your-api-key")          # Development and production
    index = pc.Index("your-index-name")

Qdrant :memory: mode carries identical client code from local development all the way to production. The vector store configuration, cosine similarity settings, and hnsw index parameters remain the same across environments.

ChromaDB requires a change to the client class when moving to production. The more widely the codebase uses the client, the more of the application this change touches.

Pinecone uses the same code in both development and production since everything runs in the cloud regardless of the stage.

These migration differences stem from three distinct local development approaches: embedded mode, Docker, and cloud-only.

Embedded mode

ChromaDB’s default embedded client stores data only in memory. When the application stops running, the data is lost. For developments involving persistent collections, PersistentClient writes data to disk instead.

 # In-memory only: data lost when process ends
    client = chromadb.Client()
    collection = client.create_collection("my_collection")
    collection.add(documents=["doc1", "doc2"], ids=["1", "2"])

    # Persistent local storage
    client = chromadb.PersistentClient(path="/local/path")

Qdrant’s :memory: mode uses the same client interface as a production deployment. Whatever code works locally also works in production without any changes.

client = QdrantClient(":memory:")
    client.create_collection(
        collection_name="my_collection",
        vectors_config=VectorParams(size=384, distance=Distance.COSINE)
    )

Both clients work well for early prototyping and notebook environments, and the differences surface only at the production boundary.

Docker for local development

Docker runs the vector database in an isolated local container using the same configuration as in a production deployment. Qdrant and Weaviate are both open-source vector databases that support this approach.

# Qdrant
    docker run -p 6333:6333 qdrant/qdrant

    # Weaviate
    docker run -p 8080:8080 -p 50051:50051 cr.weaviate.io/semitechnologies/weaviate:latest

Once the container is running, the client connects to localhost the same way it would to a self-hosted vector database in production.

 # Qdrant
    client = QdrantClient(url="http://localhost:6333")

    # Weaviate
    client = weaviate.connect_to_local()

The main advantage is that vector index configuration behaves the same way locally and in production, and issues that surface locally are genuine issues rather than environment-specific artifacts.

The trade-off is the overhead of Docker installation and port configuration, particularly Weaviate's requirement for both ports 8080 and 50051.

Cloud-only development

Pinecone operates entirely in the cloud. Every operation, from index creation to vector upsert to real-time search, requires an active API key and a live network connection.

The setup overhead is minimal since there is no local infrastructure to configure or maintain. The same code runs across all environments, with API key management and network connectivity as the only constant requirements.

Moving beyond local development, how each client integrates with the broader Python ecosystem, particularly LangChain and LlamaIndex, adds another layer to the comparison.

Integration Ecosystem: LangChain and LlamaIndex

LangChain and LlamaIndex sit at the center of most Python-based retrieval-augmented generation workflows. All four clients integrate with both frameworks, though the quality of these integrations varies.

LangChain maturity

 from langchain_pinecone import PineconeVectorStore      # Dedicated package
    from langchain_chroma import Chroma                      # Mature, widely used
    from langchain_qdrant import QdrantVectorStore           # Actively maintained
    from langchain_weaviate import WeaviateVectorStore       # Lagged during v3 to v4

Pinecone's LangChain integration benefits from thorough documentation and a stable release history across major version changes. Teams widely rely on ChromaDB's integration to prototype retrieval-augmented generation pipelines.

Qdrant’s dedicated langchain-qdrant package keeps pace closely with client releases. Weaviate is the exception. LangChain took several months to catch up with the v3-to-v4 migration, and the subsequent breaking changes forced many teams to pin their dependency versions until the LangChain integration caught up.

LlamaIndex support

All four clients have LlamaIndex connectors through the llama-index-vector-stores namespace.

 from llama_index.vector_stores.qdrant import QdrantVectorStore

Here, the support pattern is consistent across all four clients. When a client releases a new version, LlamaIndex integration typically follows within four to eight weeks. Any breaking changes in the client affect pipelines built on top of it in the meantime.

The integration tax

Every time a vector database client releases a major update, a predictable sequence follows.

The client releases a new version with breaking changes.
LangChain and LlamaIndex updated their integrations weeks later.
Production pipelines built on top of those integrations break in the meantime.

This pattern has played out with Weaviate’s v3-to-v4 migration, Pinecone’s three major releases in 18 months, and ChromaDB’s ongoing compatibility issues.

API stability matters more than feature richness when selecting a Python vector database library for production retrieval-augmented pipelines. A client with fewer features but a stable API causes considerably less disruption than one with a rich feature set and frequent breaking changes.

Performance Considerations: Beyond Raw Speed

Most client comparisons overlook three factors that significantly affect vector database performance: protocol choice, the quality of async support, and connection pooling.

Protocol choice: REST vs. gRPC

gRPC and REST are the two transport protocols available across these clients. As mentioned earlier, Weaviate sees 40 to 80% faster queries over gRPC, and Qdrant gains roughly 15% in query speed with gRPC enabled. In restricted network environments where port 50051 is not accessible, REST is the more practical option.

Async support quality

Most teams build production LLM applications on FastAPI or similar async frameworks, which makes async client support a meaningful performance consideration. Using a synchronous client within an async application results in blocking calls, which sharply reduces throughput.

Qdrant’s native AsyncQdrantClient, available since v1.61, provides a well-established async implementation. Pinecone introduced PineconeAsyncio in v6, bringing proper async support to cloud-only vector search workloads. Weaviate added async support in v4.7, making it the most recent of the four to reach production-ready async capabilities. ChromaDB’s async support remains limited across all four.

The throughput difference is substantial. For I/O-bound workloads where network latency is the bottleneck, async clients typically deliver 3–5x higher throughput than their synchronous equivalents.

Connection pooling and resource management

This is one of the configuration areas where default settings tend to fall short in production. Qdrant and Pinecone both expose parameters that give more control over connection management under sustained production traffic.

 # Qdrant connection pool configuration
    client = QdrantClient(
        url="https://your-cluster-url",
        api_key="your-api-key",
        timeout=30,
        pool_size=10
    )

    # Pinecone connection pool configuration
    index = pc.Index(
        "your-index-name",
        pool_threads=30,
        connection_pool_maxsize=30
    )

For Pinecone, query_namespaces requires tuning pool_threads and connection_pool_maxsize for production workloads. For Qdrant, increasing pool_size above the default reduces connection contention for applications that handle large volumes of document embeddings in parallel.

Teams that tune these settings before deployment avoid considerable debugging time when the application runs under load.

Error Handling and Debugging

Vector database libraries handle a lot of complexity internally. When something fails, how clearly the client communicates that failure determines how quickly teams can fix it.

Error message quality

The quality of error messages varies considerably across the four clients.

Pinecone produces clear, actionable error messages that typically suggest a solution alongside the failure description, reducing the time teams spend searching for the root cause.

Qdrant error messages are helpful and point directly to the source of the problem. The UnexpectedResponse exception includes a specific reason field that identifies exactly which parameter failed validation.

    qdrant_client.http.exceptions.UnexpectedResponse: Status 400, reason: "Wrong input: Vector dimension error: expected dim: 384, got 768"

ChromaDB error messages are frequently vague and require a GitHub search to diagnose. When the two-package mix-up occurs, ChromaDB raises a ValueError about missing embedding functions instead of reporting the actual root cause. The SQLite version requirement produces a similarly unhelpful error:

 RuntimeError: Your system has an unsupported version of sqlite3. Chroma requires sqlite3 >= 3.35.0.

This error is a common roadblock for Python developers deploying on older Amazon Linux 2 or Streamlit environments.

Weaviate v3 silently failed, returning null objects or dictionaries with an errors key that developers had to check manually. The v4 rewrite addressed this with typed exceptions, such as WeaviateQueryError and WeaviateGRPCUnavailableError.

Logging and observability

Observability capabilities differ across the four clients.

Qdrant supports structured logging, distributed tracing, and metrics without additional configuration, which makes it a strong fit for production machine learning applications that require visibility into vector search engine performance.
Pinecone provides basic logging through its managed infrastructure.
ChromaDB has limited logging with no structured output, which makes diagnosing issues in production AI applications considerably harder.

Common pitfalls and solutions

Three error patterns recur across all four clients in production environments.

Client version mismatches cause frequent, unexpected failures, particularly amid Pinecone's three releases over 18 months and Weaviate's v3-to-v4 migration. Teams can control this by pinning client versions in a requirements.txt file.
Embedding dimension mismatch occurs when the query embedding dimensions do not match the collection's expectations. Verifying that the embedding model's output size matches the collection configuration before deployment prevents this.
Rate limiting affects cloud-only deployments on Pinecone and Weaviate Cloud. Implementing exponential backoff on API calls is the standard solution for production workloads that approach rate limits under sustained traffic.

How often version mismatches surface, how broadly platform incompatibilities spread, and how clearly error messages communicate failures together determine a client's real maintenance cost in production.

The version churn across Pinecone, Weaviate, and ChromaDB has left many production teams looking for a client that prioritizes operational stability over feature velocity. Actian VectorAI DB addresses this directly.

Actian VectorAI DB

Actian designed Actian VectorAI DB for large and small-scale deployment of vector search in edge and on-premises environments, addressing operational stability through the following characteristics:

Architecture and design

Same architecture and APIs across all environments.
Docker-based deployment from local laptops to production infrastructure.
HNSW-based indexing for approximate nearest neighbor search.
Real-time indexing architecture with immediate update availability.
Python and JavaScript SDKs with REST and SQL APIs.
Native LangChain and LlamaIndex integration support.

Deployment options

Docker containers for local development and production.
Data center, private cloud, and public cloud infrastructure support.
Edge, remote, and air-gapped environment capability with offline operation.

Compliance design

On-premises deployment architecture for data sovereignty.
Supports GDPR, HIPAA, and data residency requirements.
Aligns with SOC 2 and ISO 27001 compliance frameworks.

Performance targets

Sub-15ms query latency goal for local deployment.
Uses HNSW to optimize for high recall accuracy.

These characteristics represent important considerations beyond features and benchmarks alone.

Decision Framework: Choosing Your Python Client

Selecting the right Python vector database library comes down to six criteria.

Deployment model: Cloud-only deployments point towards Pinecone. On-premises or self-hosted vector database requirements point towards Qdrant, Weaviate, or Actian. Teams already running PostgreSQL should test pgvector against their workload before adding a new dependency.
Team experience: Junior teams or teams new to vector databases benefit from ChromaDB or Actian, where API stability and clear error messages reduce the learning curve. Senior teams comfortable with Docker and gRPC configuration get more out of Qdrant and Weaviate.
Scale:

Under 10 million vectors: ChromaDB or FAISS
10 million to 100 million vectors: Qdrant, Pinecone, or Actian
Over 100 million vectors: Milvus

API stability: Production environments with zero tolerance for breaking changes point towards Actian or Pinecone v7+. Teams that can absorb migrations can work with Weaviate or Qdrant. POC and experimental workloads can use ChromaDB or FAISS.
Integration: LangChain-dependent pipelines should verify version support before committing to a client. Weaviate's v3-to-v4 lag is the clearest example of what happens when this check is skipped.
Cost: Pinecone's managed pricing runs from approximately $50 to $500 or more per month, depending on scale. Cost-conscious teams running large datasets on self-hosted infrastructure should evaluate Qdrant or pgvector. ChromaDB and FAISS are both free for local, open-source vector database workloads.

Decision matrix

Criteria	Pinecone	Qdrant	Weaviate	ChromaDB	Actian VectorAI DB
API stability	Medium	Good	Improving	Low	High
Local development	✗ No local mode	✓ :memory: mode	Docker required	✓ Embedded	✓ :memory: + SQLite
Platform compatibility	✓ Cloud only	✓ All platforms	✓ All platforms	✗ Issues on ARM, Win	✓ All platforms
Async support	✓ v6+	✓ Native	✓ v4.7+	✗ Limited	✓ Native
Cost	$50-500+/mo	Free / Self-hosted	Free / Managed	Free	Enterprise pricing

Final Thoughts

The ChromaDB production failure from the opening example stems from client packaging issues that developers only encounter after deployment. This comparison helps avoid similar failures: platform incompatibilities, breaking changes from version migrations, and client class redesigns that propagate across codebases.

ChromaDB gets projects started quickly but tends to show its limitations once the application moves to production. Pinecone is polished and well-managed, but version churn and permanent cloud dependencies are real costs. Qdrant is the strongest open-source option for teams that want local-to-production parity without code changes. Weaviate's v4 client significantly improves on v3 and is well-suited for teams that need hybrid search at scale.

For teams where API stability and platform compatibility are critical, enterprise-grade clients like Actian VectorAI DB provide production-ready stability with verified cross-platform support.

Explore Actian VectorAI DB for guaranteed production stability.

Can AI in Manufacturing Work Without the Cloud? A Guide

Tiioluwani — Mon, 18 May 2026 12:40:42 +0000

Keeping external traffic out of operational networks is a best practice that most manufacturing facilities build into their architecture from the ground up.

Manufacturing networks use the Purdue Model, a five-level system that has shaped industrial network design for decades. At the lowest level are the physical machines: sensors, motors, and actuators at Level 0; real-time controllers and SCADA systems at Level 1; and supervisory servers and HMI systems at Level 2. Level 3 manages operations. Levels 4 and 5 connect to the enterprise network and to the internet.

IEC 62443 enforces strict boundaries between these levels. Traffic from Level 2 does not reach the internet. For defense contractors, ITAR compounds the problem. Technical data must stay on U.S. soil and remain accessible only to U.S. persons. Cloud-hosted vector databases like Pinecone, Weaviate Cloud, and Qdrant Cloud fail both requirements. Level 2 has no way to send that request, and other industries learned this lesson the hard way.

Latency compounds the problem. Cloud round-trips average 50 to 500 milliseconds. PLC-level control loops require responses in under 10 milliseconds. Teams that need AI during outages use edge deployment patterns designed for disconnected environments.

Cost adds another layer. AWS standard egress starts at $0.09 per GB. At any serious production scale, sensor and vision data add up quickly, and the bill arrives faster than most teams expect.

Architecture, latency, and cost all point in the same direction. AI on the factory floor needs to run where the data lives.

This tutorial shows you how to build a local RAG pipeline that runs entirely on factory-floor hardware, where a technician can ask a question about any piece of equipment and get a cited answer from decades of maintenance records, with no internet connection required.

What You Are Building

You’ll build a three-layer RAG pipeline that runs fully inside your factory network. The ingestion layer processes PDF maintenance documents and stores them in Actian VectorAI DB. The query layer takes a technician's question and returns a cited answer fast enough for interactive use on factory-floor hardware.

Ingestion: Reads the PDF maintenance documents, splits them into 256-token chunks with a 25-token overlap, generates embeddings using sentence-transformers on a CPU, and stores everything in VectorAI DB with metadata for equipment line, document date, and source file.
Query: Takes a technician's question, embeds it with the same model, runs a hybrid search in VectorAI DB filtered by equipment line and date range, and sends the top results to a local LLM running with Ollama, which generates a cited answer in plain English.
Audit: Logs every ingestion and query event as a structured JSON entry to ./data/audit.log, timestamped in UTC, and stored inside your security boundary to satisfy IEC 62443 traceability requirements.

VectorAI DB sits at the center of all three layers. It stores the embeddings that the ingestion layer produces, and serves the search results that the query layer runs. Running it on-premises instead of in the cloud keeps the whole pipeline inside your security boundary.

The pipeline runs on standard factory edge server hardware, with Ubuntu 22.04 LTS, 16 GB of RAM, and a 4-core CPU.

Building a Local RAG Pipeline with VectorAI DB

Set up VectorAI DB, build the ingestion pipeline, run your first query, add hybrid filters, and connect a local LLM.

Prerequisites

Docker and Docker Compose
Python 3.10 or higher
uv package manager. Install with curl -LsSf https://astral.sh/uv/install.sh | sh
Ollama. Install from Ollama.com and pull the model with ollama pull llama3.2:3b

Your machine needs at least 8 GB RAM (16 GB or more recommended) and 10 GB of disk space (100 GB or more recommended) to run VectorAI DB. If you're on Windows, the uv install command needs 'sh', which PowerShell doesn't have. Run all commands in WSL2 (Windows Subsystem for Linux). To set up WSL2, run 'wsl --install' in PowerShell, then use the Ubuntu terminal for this tutorial.

Project structure

Set up your project folder like this:

 factory-rag/
├── docker-compose.yml
├── data/
│   └── audit.log
├── config/
└── src/
    ├── healthcheck.py
    ├── ingest.py
    ├── query.py
    ├── llm.py
    ├── audit.py
    └── test_e2e.py

Create the directories:

mkdir -p factory-rag/{data,config,src}
cd factory-rag

Step 1: Deploy VectorAI DB

Create docker-compose.yml in your project root:

services:
  vectorai-db:
    image: williamimoh/actian-vectorai-db:latest
    platform: linux/amd64
    container_name: vectorai-db
    ports:
      - "50051:50051"
    volumes:
      - ./data:/data
      - ./config:/config
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "nc -z localhost 50051 || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 15s

Start the container with:

docker compose up -d

Install the SDK with:

uv add actian_vectorai-0.1.0b2-py3-none-any.whl

Install these required libraries:

uv add sentence-transformers pypdf

Check that the server is running. Make a file called src/healthcheck.py:

from actian_vectorai import VectorAIClient

with VectorAIClient("localhost:50051") as client:
    info = client.health_check()
    print(f"✓ VectorAI DB is running")
    print(f"  Title:   {info['title']}")
    print(f"  Version: {info['version']}")

Run the script:

uv run python src/healthcheck.py

Terminal output:

Step 2: Build the ingestion pipeline

Put your PDF maintenance documents in the data/ folder before running this step. Add any equipment maintenance records, inspection reports, or failure logs there.
The pipeline uses sentence-transformers/all-MiniLM-L6-v2, which needs less than 200 MB of RAM on CPU. We split text into 256-token chunks with a 25-token overlap to keep enough context for good retrieval.

Create src/ingest.py:

from __future__ import annotations

import argparse
import uuid
from pathlib import Path

from actian_vectorai import Distance, PointStruct, VectorAIClient, VectorParams
from pypdf import PdfReader
from sentence_transformers import SentenceTransformer

from audit import log_ingestion

COLLECTION = "maintenance_records"
HOST = "localhost:50051"
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
VECTOR_DIM = 384
CHUNK_TOKENS = 256
OVERLAP_TOKENS = 25

def chunk_text(text, tokenizer, chunk_size=CHUNK_TOKENS, overlap=OVERLAP_TOKENS):
    token_ids = tokenizer.encode(text, add_special_tokens=False)
    chunks = []
    start = 0
    while start < len(token_ids):
        end = min(start + chunk_size, len(token_ids))
        window = token_ids[start:end]
        decoded = tokenizer.decode(window, skip_special_tokens=True).strip()
        if decoded:
            chunks.append(decoded)
        if end >= len(token_ids):
            break
        start += chunk_size - overlap
    return chunks

def ingest_pdf(pdf_path, equipment_line, doc_date, model, client):
    reader = PdfReader(str(pdf_path))
    full_text = "\n".join(page.extract_text() or "" for page in reader.pages)
    if not full_text.strip():
        print(f"  [warn] No extractable text in {pdf_path.name}, skipping.")
        return 0
    tokenizer = model.tokenizer
    chunks = chunk_text(full_text, tokenizer)
    points = []
    for idx, chunk in enumerate(chunks):
        embedding = model.encode(chunk, show_progress_bar=False).tolist()
        points.append(
            PointStruct(
                id=str(uuid.uuid5(uuid.NAMESPACE_DNS, f"{pdf_path.name}:{idx}")),
                vector=embedding,
                payload={
                    "equipment_line": equipment_line,
                    "doc_date": doc_date,
                    "source_file": pdf_path.name,
                    "text": chunk,
                    "chunk_index": idx,
                },
            )
        )
    if points:
        client.points.upsert(COLLECTION, points)
    return len(points)

def main(data_dir, equipment_line, doc_date):
    data_path = Path(data_dir)
    pdfs = sorted(data_path.glob("*.pdf"))
    if not pdfs:
        print(f"No PDF files found in '{data_dir}'. Add PDFs to ./data/ and retry.")
        return
    print(f"Loading embedding model '{MODEL_NAME}'...")
    model = SentenceTransformer(MODEL_NAME)
    with VectorAIClient(HOST) as client:
        if not client.collections.exists(COLLECTION):
            client.collections.create(
                COLLECTION,
                vectors_config=VectorParams(size=VECTOR_DIM, distance=Distance.Cosine),
            )
            print(f"Created collection '{COLLECTION}' ({VECTOR_DIM}-dim, Cosine)")
        else:
            print(f"Collection '{COLLECTION}' already exists, appending chunks.")
        total = 0
        for pdf_path in pdfs:
            print(f"Ingesting {pdf_path.name} ...")
            count = ingest_pdf(pdf_path, equipment_line, doc_date, model, client)
            print(f"  → {count} chunks stored")
            log_ingestion(pdf_path.name, equipment_line, count)
            total += count
        print(f"\nDone. {total} total chunks stored in '{COLLECTION}'.")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Ingest PDFs into VectorAI DB")
    parser.add_argument("--data-dir", default="./data")
    parser.add_argument("--equipment-line", required=True)
    parser.add_argument("--doc-date", required=True)
    args = parser.parse_args()
    main(args.data_dir, args.equipment_line, args.doc_date)

Run the ingestion step:

uv run python src/ingest.py --equipment-line turbine-A --doc-date 2024-03-15

Expected output:

The metadata schema saves the equipment line, document date, and source file with each chunk. This lets you filter searches by equipment line or date range without searching the whole collection.

Step 3: Run your first query

Your ingestion pipeline has stored the maintenance records in VectorAI DB. The pipeline can answer questions. When a technician asks something in plain English, the pipeline embeds the question, searches the maintenance_records collection, and returns the top five most relevant chunks with similarity scores.

Create src/query.py:

from __future__ import annotations

import argparse
import time

from actian_vectorai import Field, FilterBuilder, VectorAIClient
from sentence_transformers import SentenceTransformer

from audit import log_query

COLLECTION = "maintenance_records"
HOST = "localhost:50051"
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
TOP_K = 5

def build_filter(equipment_line=None, doc_date=None, doc_date_to=None):
    fb = FilterBuilder()
    if equipment_line:
        fb.must(Field("equipment_line").eq(equipment_line))
    if doc_date and doc_date_to:
        fb.must(Field("doc_date").range(gte=doc_date, lte=doc_date_to))
    elif doc_date:
        fb.must(Field("doc_date").eq(doc_date))
    return fb.build() if (equipment_line or doc_date) else None

def search(question, equipment_line=None, doc_date=None, doc_date_to=None):
    model = SentenceTransformer(MODEL_NAME)
    embedding = model.encode(question, show_progress_bar=False).tolist()
    query_filter = build_filter(equipment_line, doc_date, doc_date_to)
    with VectorAIClient(HOST) as client:
        hits = client.points.search(
            COLLECTION,
            vector=embedding,
            limit=TOP_K,
            filter=query_filter,
        )
    return [
        {
            "score": round(r.score, 4),
            "source_file": r.payload.get("source_file", ""),
            "equipment_line": r.payload.get("equipment_line", ""),
            "doc_date": r.payload.get("doc_date", ""),
            "chunk_index": r.payload.get("chunk_index", -1),
            "text": r.payload.get("text", ""),
        }
        for r in hits
    ]

def main():
    parser = argparse.ArgumentParser(description="Search maintenance records")
    parser.add_argument("question", help="Natural language question")
    parser.add_argument("--equipment-line", default=None)
    parser.add_argument("--doc-date", default=None)
    parser.add_argument("--doc-date-to", default=None)
    args = parser.parse_args()

    start = time.monotonic()
    results = search(args.question, equipment_line=args.equipment_line,
        doc_date=args.doc_date, doc_date_to=args.doc_date_to)
    latency_ms = (time.monotonic() - start) * 1000
    log_query(args.question, args.equipment_line or "", results, latency_ms)

    if not results:
        print("No results found.")
        return

    print(f"Top {len(results)} results for: \"{args.question}\"\n")
    for i, r in enumerate(results, 1):
        print(f"[{i}] score={r['score']:.4f}  {r['source_file']} "
              f"(chunk {r['chunk_index']})  {r['doc_date']}  {r['equipment_line']}")
        print(f"     {r['text'][:200].strip()}...")
        print()

if __name__ == "__main__":
    main()

Try your first query:

uv run python src/query.py "What caused the bearing failure?"

The search uses the same model as ingestion to embed the query, keeping both the query and stored vectors in the same semantic space. For maintenance records with this model, similarity scores between 0.4 and 0.6 indicate relevant matches.

Step 4: Add hybrid filters

Filtering by equipment line and date helps keep search results relevant to the technician's current work. Run the same query from Step 3, but add these filters:

uv run python src/query.py "What caused the bearing failure?" --equipment-line turbine-A

Add a date filter to narrow the results even more:

uv run python src/query.py "What caused the bearing failure?" --equipment-line turbine-A --doc-date 2024-03-15

Expected output:

The build_filter function constructs a FilterBuilder query that combines vector similarity with exact metadata matching. A technician working on turbine-A only sees results from that equipment line, not from the entire maintenance history.

Step 5: Connect the local LLM

The search results feed into a local LLM running via Ollama, which generates a cited answer in plain English. The entire round trip runs on factory-floor hardware.

Create src/llm.py:

from __future__ import annotations

import json
import os
import sys
import urllib.request
from typing import Any

OLLAMA_HOST = os.environ.get("OLLAMA_HOST", "http://localhost:11434")
OLLAMA_MODEL = "llama3.2:3b"
MAX_NEW_TOKENS = 256
TEMPERATURE = 0.1
TIMEOUT_SECONDS = 300

def build_prompt(question: str, results: list[dict[str, Any]]) -> str:
    if not results:
        return f"Question: {question}\n\nAnswer: I have no relevant context to answer this question."
    context_blocks = []
    for i, r in enumerate(results, 1):
        source = r.get("source_file", "unknown")
        date = r.get("doc_date", "unknown")
        equip = r.get("equipment_line", "unknown")
        text = r.get("text", "").strip()
        context_blocks.append(
            f"[{i}] Source: {source} | Equipment: {equip} | Date: {date}\n{text}"
        )
    context = "\n\n".join(context_blocks)
    return (
        "You are a maintenance records assistant. "
        "Answer the question using ONLY the provided context. "
        "Cite sources inline using [1], [2], etc. "
        "If the context does not contain enough information, say so.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}\n\n"
        "Answer:"
    )

def generate(question: str, results: list[dict[str, Any]]) -> str:
    prompt = build_prompt(question, results)
    payload = json.dumps({
        "model": OLLAMA_MODEL,
        "prompt": prompt,
        "stream": False,
        "options": {
            "num_predict": MAX_NEW_TOKENS,
            "temperature": TEMPERATURE,
        },
    }).encode()
    req = urllib.request.Request(
        f"{OLLAMA_HOST}/api/generate",
        data=payload,
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=TIMEOUT_SECONDS) as resp:
        body = json.loads(resp.read().decode())
    return body["response"].strip()

def answer(question: str, results: list[dict[str, Any]]) -> str:
    reply = generate(question, results)
    print(reply)
    print()
    print("Sources")
    for i, r in enumerate(results, 1):
        print(
            f"  [{i}] {r.get('source_file', '?')} "
            f"(chunk {r.get('chunk_index', '?')}, score {r.get('score', 0):.4f}) "
            f"{r.get('doc_date', '?')} / {r.get('equipment_line', '?')}"
        )
    return reply

if __name__ == "__main__":
    question = sys.argv[1] if len(sys.argv) > 1 else "What maintenance was performed?"
    dummy_results = [
        {
            "source_file": "example.pdf",
            "doc_date": "2024-03-15",
            "equipment_line": "turbine-A",
            "chunk_index": 0,
            "score": 0.95,
            "text": (
                "Performed scheduled bearing inspection on turbine-A. "
                "Replaced worn bearing race on shaft 2. "
                "Torque settings verified per spec TRB-004."
            ),
        }
    ]
    answer(question, dummy_results)

Wire everything together by creating src/test_e2e.py:

from query import search
from llm import answer

question = "What maintenance was performed on the gearbox?"
results = search(question, equipment_line="turbine-A")
answer(question, results)

Run the full pipeline:

uv run python src/test_e2e.py

llama3.2:3b fits in the memory of a standard factory edge server. The LLM receives only the retrieved chunks as context, not the full document collection, which keeps responses fast and grounded in cited sources.

Expected output:

The pipeline is fully up and running. A technician can ask a question, get a cited answer from local maintenance records, and never need to use the internet.

Step 6: Add audit logging

IEC 62443 requires full traceability for every operation within the OT network. Without a local audit trail, your pipeline has no record of what was queried, when, or what it returned.

Create src/audit.py:

from __future__ import annotations

import json
import logging
from datetime import datetime, timezone
from pathlib import Path

LOG_PATH = Path("./data/audit.log")
LOG_PATH.parent.mkdir(parents=True, exist_ok=True)

handler = logging.FileHandler(str(LOG_PATH))
handler.setLevel(logging.INFO)

logger = logging.getLogger("actian_vectorai.audit")
logger.setLevel(logging.INFO)
logger.addHandler(handler)

def log_query(question: str, equipment_line: str, results: list, latency_ms: float) -> None:
    entry = {
        "event": "query",
        "timestamp": datetime.now(tz=timezone.utc).isoformat(),
        "question": question,
        "equipment_line": equipment_line,
        "results_returned": len(results),
        "latency_ms": round(latency_ms, 2),
    }
    logger.info(json.dumps(entry))

def log_ingestion(source_file: str, equipment_line: str, chunks_stored: int) -> None:
    entry = {
        "event": "ingestion",
        "timestamp": datetime.now(tz=timezone.utc).isoformat(),
        "source_file": source_file,
        "equipment_line": equipment_line,
        "chunks_stored": chunks_stored,
    }
    logger.info(json.dumps(entry))

Run the audit script with this command:

cat data/audit.log

Expected output:

The pipeline now keeps a structured record of every ingestion and query event in ./data/audit.log, timestamped in UTC and stored inside your security boundary.

Wrapping Up

You just built a local RAG pipeline that runs entirely on factory-floor hardware, serves queries during network outages, and returns cited answers from decades of maintenance records.

AI in manufacturing can operate without a cloud connection. VectorAI DB enables this by running entirely within the IEC 62443 security boundary, without relying on the cloud. Cut the internet connection entirely, and the pipeline keeps working.

Your pipeline ingests PDF maintenance documents, stores embeddings in the VectorAI DB at Level 2 of your OT network, and answers natural-language questions using a local LLM with no cloud dependency at any step. From here, you can extend the pipeline by adding more document types, tuning the embedding model for your specific equipment vocabulary, adding role-based query filtering by technician, or scaling ingestion across multiple equipment lines.

Find the full VectorAI DB documentation and the GitHub repository to explore further.

Join the community and learn more about Actian.

5 Edge AI Architecture Patterns for Disconnected Environments

Praise James — Mon, 18 May 2026 11:05:16 +0000

A haul truck operating 200 miles from the nearest cellular tower does not pause when connectivity drops. An offshore wind turbine does not suspend fault detection because a satellite link fails in a storm. In these environments, inference, control loops, and safety systems must continue operating regardless of network status. Yet the dominant edge AI architecture still revolves around connectivity and cloud AI.

Disconnected environments demand edge-native, offline-first architectures designed for operational autonomy. Market signals reinforce this reality.

ABI Research projects edge server spending to reach $19B by 2027, with on-premises deployments accounting for nearly $10.5B. In 2025, organizations deployed approximately 815 million edge-enabled IoT devices globally.

Most operational environments are inherently distributed, generating data far from centralized cloud systems. Edge deployment strategies that depend on sending that data back and forth for processing cause IoT systems to miss critical insights, increase latency, and introduce data loss. Yet proposed edge architectures still treat offline readiness as an add-on rather than the default.

We present five edge AI deployment patterns that operate without assumed connectivity, covering their implementation tactics, real-world scenarios, trade-offs, and a decision framework for selecting the right pattern for your operational priorities.

TL;DR

Suitable use cases for each documented deployment pattern at a glance.

Pattern	Best for
The drone (self-contained single-node edge AI)	Autonomous mobile systems with strict energy budgets and zero cloud connection
The factory (multi-node edge AI with optional cloud)	Facilities with local infrastructure in intermittent environments
Hierarchical federated learning (client-edge-cloud)	Privacy-sensitive distributed operations where data leakage risks are unacceptable
Store-and-forward disconnected inference	Operations with scheduled connectivity windows
The network (distributed edge-to-edge fabric)	Distributed coordination without cloud dependency

Why Disconnected Environments are an Edge AI Problem

There is a structural blind spot for disconnected environments, driven by the assumption that industries using edge AI models are cloud-centric and operate under persistent connectivity. Where edge AI applications matter most, constant network access does not exist.

What disconnected actually means

Disconnected environments are settings with unreliable or nonexistent connectivity, ranging from airgapped scenarios with complete network isolation to intermittent setups with frequent connectivity degradation.

In these operational settings, edge AI capabilities truly shine because they support the real-time data processing, low latency, bandwidth optimization, and data governance that disconnected environments require.

Precedence Research estimates the global edge AI market will reach $143B by 2034, a potential 472% increase from $25B in 2025. For a significant portion of this market, constant cloud connectivity is not feasible. Yet inference, local data storage, and real-time decision-making must continue regardless of network status or location.

Disconnection is where edge AI earns its value

Disconnected environments such as mining sites, manufacturing plants, military operations, offshore wind farms, and smart cities expose the limitations of current edge AI deployment solutions.

Rio Tinto operates on mining sites up to 930 miles from cellular coverage, where operators cannot rely on a centralized infrastructure. They need autonomous inspection robots that use edge AI to track personnel and vehicles, interpreting data from 3D LiDAR, thermal imaging, and gas sensors in real-time.

At least 300 autonomous haul trucks operate in Rio Tinto’s Pilbara region. Each truck processes roughly 5TB of data daily through subterranean tunnels with limited connectivity, requiring private LTE networks for on-device IoT processing.

Offshore wind farms face a similar constraint. Turbines and inspection vessels go offline when satellite connections fail due to harsh weather or line-of-sight blockage, and each turbine averages approximately 8.3 failures per year. These farms need edge AI systems that detect issues early, monitor real-time maritime traffic, analyze local SCADA data, and trigger inspections based on immediate wind conditions.

In remote manufacturing environments, plant managers also need edge AI to automate quality inspections, predict machine failures, and protect workforce health.

A similar demand for local, secure processing drives military operations, where systems operate within airgapped networks in denied, disrupted, intermittent, and limited (DDIL) environments to maintain data confidentiality and integrity. Soldiers must communicate with command units and analyze real-time warfare data without relying on cloud data centers or large computing resources.

These are the environments where edge AI deployment delivers the most impact. According to Dell, enterprise data processing will shift to distributed data centers in 2026, but most documented architectures still emphasize transmitting data back to cloud data centers.

Constrained hardware shapes model deployment

The demands of AI compute and workload scaling at the edge also fuel the cloud-edge deployment recommendations.

A deep learning model with 3B parameters can require up to 4GB of RAM, but edge devices like microcontrollers and IoT sensors typically have less than 1GB for OS, workloads, and storage combined. Connected environment architectures assume large compute availability that doesn’t exist at the edge.

Edge AI architectures must start with offline-first assumptions and hardware ceilings from day one. Retrofitting offline capability into cloud systems will not compensate for connectivity gaps and limited hardware resources. Below, we detail five architectural patterns tailored for disconnected environments.

Pattern 1: The Drone (Self-Contained Single-Node Edge AI)

In environments where connectivity is unavailable, and operational latency cannot tolerate network round-trips, the deployment boundary collapses to a single device. Inference cannot be delegated, synchronized, or deferred. Edge devices like drones, underwater vehicles, and remote inspection robots must make decisions using only locally available compute, memory, and sensor input.

This constraint defines the drone architecture. All AI logic runs on a single device, without external orchestration or cloud offloading.

When the device is the entire stack

Mobile systems that must function autonomously in disconnected environments benefit most from this pattern.

With no external orchestration layer, data capturing, preprocessing, inference, storage, and control logic operate within a self-contained package. This package runs on a single node without networking with other nodes or distributing model training.

Onboard decision logic means edge devices can execute predefined operations even when disconnected. Once a device captures data, it filters out redundant information, retaining only relevant data for eventual manual retrieval.

Autonomous drones that perform object detection and terrain classification in mining zones cannot pause execution while awaiting external inference. The drone architecture removes network dependency by focusing on on-device inference.

This makes it the most viable pattern for DDIL environments where connectivity is actively denied or degraded. Defense drones cannot assume that the network will recover or that a command signal will arrive at all. Every battlefield coordination must be executable from the device alone.

GE Aerospace, which runs 45,000+ commercial aircraft engines and captures over 480,000 data snapshots daily per aircraft, implements this architecture at scale. Onboard AI models handle predictive maintenance in strict accordance with DO-178C, which requires GE Aerospace to verify every airborne system against all possible failure conditions before it ever leaves the ground. This quality assurance aligns with the drone’s architectural requirement of no external support after model deployment.

Single-node local processing requires machine learning models with small footprints.

Optimizing intelligence for the edge

Edge devices operate within strict memory and power ceilings measured in megabytes and milliwatts. When full-precision networks exceed available RAM or energy budgets, model capacity must be optimized before inference becomes feasible.

Not every edge workload needs a neural network. In constrained environments like offshore wind farms, classical statistical methods, such as Welford’s algorithm and linear regression often outperform neural networks on streaming data processing.

A microcontroller computing sensor data with Welford’s algorithm updates statistics sequentially, without retaining past data points, which keeps memory and power consumption low. Before pushing a neural network to its hardware limit, consider whether the model class itself is suitable for the use case.

When neural networks are the right fit for the workload, quantization addresses their hardware limitations by reducing the numerical precision of their weights, biases, and activations. Downsizing from 32-bit to 8-bit shrinks model size by approximately 75% with less than 1% accuracy loss.

Another model compression technique, pruning, eliminates redundant parameters that contribute minimally to output accuracy. Pruning an object detection model like YOLOv5 can reduce its parameter count and computational cost by 40% before deployment.

TinyML frameworks such as TensorFlow Lite for Microcontrollers, ONNX Runtime, and PyTorch Mobile support compact model deployment. The following code shows an example quantization scenario with TensorFlow Lite.

import tensorflow as tf
import numpy as np

# Post-training quantization using TFLite converter
# Converts 32-bit floats to 8-bit integers

def representative_dataset():
    for i in range(100):
        yield [X_train[i:i+1]]

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset

converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_quant_model = converter.convert()

Start with quantization for higher speedup rates without significant accuracy loss, followed by pruning to compress the model’s size further. For the drone architecture, the target size on a single microcontroller is <1MB. Plumerai’s person detection model demonstrates how compression techniques can achieve this goal. The model achieved 737KB on an ARM Cortex-M7 microcontroller with less than 256KB of on-chip RAM using binarized neural networks.

At the hardware level, energy-efficient processors such as the NVIDIA Jetson Nano, Google Edge TPU, and ARM Cortex-M execute AI models directly on edge devices, purpose-built for computer vision and sensor fusion workloads. ARM Cortex-M variants deliver up to 600 giga-operations per second (GOPS) with an energy efficiency averaging 3 tera-operations per second per watt (TOPS/W), depending on configuration.

Drone deployment introduces an architectural rigidity. With limited runtime intervention, the architecture must anticipate every failure state during design. The DO-178C reinforces this constraint by requiring full system validation before deployment. Teams must engineer every model update and behavioral correction with no orchestration window.

Pattern 2: The Factory (Multi-Node Edge AI With Optional Cloud)

During network outages in manufacturing and large retail facilities, inference must continue in-house across multiple machines. The factory architecture meets this requirement by distributing AI workloads across on-premises edge clusters, keeping operational control within the facility boundary.

Cloud synchronization remains optional, used only for model retraining or batch analytics rather than as a runtime dependency. The priority is maintaining resilience and operational independence across all nodes, regardless of network availability.

Inference stays on the factory floor

The factory architecture centers on three components: edge gateways, compute nodes, and local storage.

An edge gateway routes sensor requests to edge nodes, which pull context from local edge databases like Actian Zen, act on model inference, and write the results back to the database. Decision-making and local computing stays on-premises. Cloud systems only handle model updates periodically or on trigger.

Industrial environments generate continuous, high-volume telemetry data from sensors, controllers, and inspection systems. Distributing inference across multiple edge nodes maintains high inference throughput. But without a local orchestration layer managing distribution and managing model lifecycle, edge nodes operate as isolated processors rather than a coordinated system.

K3s, AWS IoT Greengrass, Azure IoT Edge, and Siemens Industrial Edge are popular orchestration tools for managing edge clusters. Each differs in how they handle model deployment and node management.

K3s deploys containerized models as clusters of worker nodes with a control plane for health visibility. Configuring its datastore endpoint parameter enables teams to store local data in on-premises databases like PostgreSQL and Actian Zen, replacing the default SQLite. Chick-fil-A uses K3s at the edge to process point-of-sale transactions across 3,000+ restaurants.

AWS IoT Greengrass deploys cloud-compiled AI models as components with predefined inference functions to NVIDIA Jetson TX2, Intel Atom boards, and Raspberry Pi-powered devices. Inference remains on-premises, with data exported optionally to AWS IoT Core for model optimization. Pfizer manufacturing sites use AWS IoT Greengrass for near-real-time bioreactor monitoring to minimize contamination risk.

Siemens Industrial Edge deploys Docker-containerized models directly on the shop floor, delivering real-time machine status. Siemens Electronics Factory Erlangen reduced model deployment time by 80% and false anomaly detection on printed circuit boards (PCBs) by 50% using this orchestrator. By running inference on PCB images locally and outsourcing only model retraining to the cloud, the factory has saved data storage costs by 90%.

Azure IoT Edge uses a JSON deployment manifest to specify which containerized models to download to edge devices. Data processing happens at the edge with Azure IoT Hub providing centralized oversight while the devices maintain autonomy. Thomas Concrete Group uses Azure IoT Edge to collect data from sensors embedded in wet concrete, estimate the concrete’s hardening timeline, and send predictions to Azure IoT Hub.

The table below highlights the differences between each orchestrator.

Criteria	K3s	Azure IoT Edge	AWS IoT Greengrass	Siemens Industrial Edge
Node management	Manages nodes via a lightweight control plane	Manages nodes remotely through Azure IoT Hub	Manages nodes via AWS IoT Core	Manages nodes via the Siemens Industrial Edge Management platform
Model deployment	Deploys models as Kubernetes pods using standard container images	Configures deployments via a JSON manifest that defines which modules, containing the trained models, run on which nodes	Deploys models as components with predefined inference functions	Deploys models directly on shop floors as Docker containers
Cloud integration	Can be integrated with a central infrastructure	Supported via Azure IoT Hub	Integrates with AWS IoT Core	Supports integration with AWS services

When the OT network is the security boundary

Industrial companies converge their IT and operational technology (OT) networks to support on-premises AI and IoT integrations. But this convergence expands their attack surface area. 75% of OT attacks originate in IT environments, and 80% of manufacturers report increasing security threats across their IT/OT networks.

For teams considering factory deployment for industrial systems, network segmentation must become a top priority. Edge AI solutions should operate solely within the OT network in compliance with the Purdue model. Sensitive data and inference stay close to the machines, sensors, and Programmable Logic Controllers (PLCs) that need them. This security boundary minimizes lateral movement of threats from the IT network.

Pattern 3: Hierarchical Federated Learning (Client-Edge-Cloud)

Hierarchical federated learning (HFL) builds on a three-layer infrastructure for teams navigating data mobility restrictions at the edge.

At the lowest layer, client devices perform local training, optimizing model parameters through local gradient descent. Edge servers at the intermediate layer aggregate updated model weights from all client devices for statistical coherence. A final aggregation round by a cloud server marks the top layer, producing a global model that the edge servers distribute back to the client devices. Since only parameter updates traverse this hierarchy, intermittent connectivity does not halt training progress.

The image below captures this iteration, which continues until the global model reaches the desired accuracy or converges.

Domains such as healthcare and financial services, where raw data is bound to its origin by privacy constraints, regulatory requirements, and bandwidth limitations, are ideal HFL use cases. Data sovereignty mandates and geopolitical tensions add another layer to this constraint, restricting where and how data flows at the infrastructure level.

A study by BARC found that 19% of companies plan to increase their on-premises investments, driven by this need for data sovereignty. HFL allows a shared model to improve across distributed nodes without the underlying data ever crossing a jurisdictional boundary.

A recent experimental HFL training in healthcare achieved 94.23% accuracy on a modified National Institute of Standards and Technology dataset, while keeping data on client devices. Only relevant aggregated information ever reaches the cloud to preserve privacy and curtail data leakage risks.

In healthcare deployment, wearable devices (lowest layer) transmit raw data to a hospital’s local edge server (intermediate layer), which aggregates data from multiple wearables and sends it to a regional research institution (top layer) for final aggregation without exposing patient data.

HFL is the most complex pattern to implement. Tooling support remains fragmented, and unlike other patterns discussed, it currently lacks native support within the Actian ecosystem. Teams should weigh this implementation overhead before committing to this architecture.

The HFL architecture has three variants depending on which layer orchestrates data decisions.

1. Cloud-orchestrated hierarchical federated learning

The central cloud server coordinates the training process, client-edge communications, synchronization schedules, and the overall topology, with no additional aggregation rounds from the edge servers.

Cloud-orchestrated HFL fits financial institutions, where occasional reliable connectivity can sustain the coordination loop. In a fraud detection deployment, multiple banking institutions might train models using transaction data, sending updates to the cloud, which aggregates, validates, and redistributes the improved model back to the banks.

2. Edge-orchestrated hierarchical federated learning

Edge servers autonomously manage local client assignments, aggregating client updates to produce a locally improved model without cloud round-trips. Cloud systems only support at interval for bulk model retraining. Environments like offshore wind farms, where unstable connectivity is the baseline, benefit most from this variant. Turbines send model updates to a local edge server, which handles aggregation and independent model improvement.

3. Peer-to-peer aggregation

This variant focuses on a gossip-like model with no central orchestrator. Clients exchange their model weights with other nodes, reducing gradient conflicts under heterogeneous data.

Where the core HFL pattern reduces cloud ingress fees through aggregated updates, peer-to-peer aggregation keeps both training and aggregation within participating nodes. In distributed environments like smart cities, traffic sensors exchange anomaly-detection updates directly with neighboring devices until they converge on an improved model across the network organically.

All three variants differ in their functional requirements, highlighted in the table below.

Feature	Cloud-orchestrated	Edge-orchestrated	Peer-to-peer aggregation
Orchestration model	Cloud coordinates all aggregation and model distribution	Edge server aggregates locally, syncs with cloud periodically	No orchestrator; updates propagate between clients until convergence
Privacy level	Medium; the cloud controls model updates	High; raw data remains on local edge servers	High; no central point oversees aggregated updates
Bandwidth requirements	High; all updates are sent to the cloud	Medium; only aggregated updates reach cloud	Low; updates only travel between neighboring peers
Disconnection tolerance	Low; cloud disconnection breaks coordination	High; edge server operates independently during outages	Medium; network partitions slow convergence

HFL’s layered infrastructure supports large-scale model training by distributing computation and communication across multiple nodes in the hierarchy. The challenge with this multi-tier design lies in navigating communication overhead, stale global models, and node reconfigurations.

In HFL, communication cost is directly proportional to the model update size. Gradient compression techniques such as random sparsification and stochastic rounding shrink update payloads by up to 98% before transmission.

The asynchronous update cycle of HFL, where the global model incorporates client updates as they arrive, also amplifies the likelihood of stale model parameters. Weighted aggregation limits the influence of stale updates, preventing slower devices from degrading the global model.

Topology shifts add another challenge. Clients get reassigned to different edge servers, roles shift between client and aggregator nodes, and new devices join mid-training. Each reconfiguration stalls convergence and degrades accuracy if new edge servers lack prior training history.

Pattern 4: Store-and-Forward Disconnected Inference

In disconnected environments, intermittent connectivity can stretch for hours or days. Store-and-forward architecture accounts for this reality, sustaining large-scale data processing and storage during downtime, and forwarding summaries to the cloud once the system reconnects.

For industrial automation environments, such as remote oil and gas operations and maritime vessels operating miles from cellular towers, this architecture solves the core problem of maintaining data continuity despite network disruption.

Inference doesn’t wait for the cloud

Store-and-forward deployment follows a hybrid approach. Training begins in the cloud, but execution shifts to the edge after model deployment. When connectivity drops, decision-making, control loops, and alarm triggers continue locally without interruption, and the system buffers timestamped results to a local edge database until synchronization resumes.

Upon network restoration, the edge gateway offloads all buffered events to a central cloud infrastructure, providing the data required to push updated models and optimize AI pipelines.

Store-and-forward architecture creates a feedback loop that prevents data loss during disconnection. In manufacturing plants, SCADA systems continue collecting data from PLCs, Remote Terminal Units (RTUs), and edge gateways until connection resumes.

When the data finally moves

The “forward” part of this architecture relies on lightweight communication protocols like Message Queuing Telemetry Transport (MQTT), designed for unstable networks and bandwidth-limited environments.

MQTT’s publish-subscribe model routes queued updates from edge gateways to the cloud through brokers like Mosquitto. Publishers (sensors) send messages to a topic (temperature), and subscribers (cloud servers) receive messages from their registered topics. Messages replay in the exact chronological order they were received.

The Python code snippet below illustrates a starting-point implementation using the Paho MQTT library. It uses Quality of Service (QoS) 1, a persistent session that enables Mosquitto to queue messages while the subscriber is offline.

# pip install paho-mqtt

import paho.mqtt.publish as publish
import sys

if len(sys.argv) < 3:
    print("Usage: publisher.py <topic> <message>")
    sys.exit(1)

# Production code will add retry logic, local queue persistence, and message deduplication

topic = sys.argv[1]
message = sys.argv[2]

publish.single(topic, message, hostname="localhost", qos=1)

To initiate data transfer after reconnection, the script below creates a persistent session using clean_session=False and loop_forever().

import paho.mqtt.client as mqtt
import sys

if len(sys.argv) < 2:
    print("Usage: subscriber.py <topic>")
    sys.exit(1)

topic = sys.argv[1]
client_id = "test-client"

def on_connect(client, userdata, flags, rc):
    print(f"Connected with result code {rc}")
    client.subscribe(topic, qos=1)

def on_message(client, userdata, msg):
    print(f"{msg.topic}: {msg.payload.decode()}")

client = mqtt.Client(client_id=client_id, clean_session=False)
client.on_connect = on_connect
client.on_message = on_message

client.connect("localhost", 1883, 60)
client.loop_forever()

Store-and-forward architecture can introduce data replication inconsistencies during gateway synchronization. The system requires an arbitration policy, such as last-write-wins, which applies changes based on each update’s timestamp. When timestamps are identical, data structures like Conflict-free Replicated Data Types (CRDTs) merge copies to achieve a consistent final state across all edge gateways.

Delta sync further improves CRDTs’ results. Where full dataset replication triggers on every record change, delta sync resolves conflicts at the property level, addressing only the modified fields.

Pattern 5: The Network (Distributed Edge-to-Edge Fabric)

The network deployment pattern addresses the lack of fault tolerance and distributed processing prevalent in disconnected multi-site operations such as logistics networks and smart grids.

Coordinating edge devices across multiple locations through a cloud system quickly breaks outside network coverage. This is why the network architecture follows an east-west communication pattern, enabling edge nodes to exchange data directly with peers without central coordination.

Mesh communication handles distributed intelligence

The network deployment pattern adopts a non-hierarchical design, connecting multiple IoT devices through a mesh network to improve system uptime during outages. Each node dynamically communicates with its neighbors, forming a bidirectional network that relays data to remote environments via multi-hop paths.

The cloud only joins as a peer for optional sync, but core computing remains on the network, working without centralized control.

Smart grids are well-suited for this architecture, where teleprotection demands 10–20ms latency. A network of transmission substations continuously tracks electricity flow and consumption patterns in real-time to detect imbalances before they escalate. That real-time visibility supports dynamic load redistribution and autonomous microgrid management.

Military uncrewed aerial vehicles (UAVs) are another use case. When GPS fails in DDIL environments, UAVs relay ISR data between each other through mesh networks. Adaptive interference routing ensures reliable data flow, while line-of-sight transmission reduces latency.

This deployment pattern optimizes for network redundancy. Gossip protocol and distributed consensus algorithms like Raft eliminate single points of failure. When a node loses connection, the network remains operational, rerouting its data through other nodes.

Gossip protocol enables live peer discovery through continuous, lightweight information exchanges. Each node always has a current view of its local network. Raft follows a leader-based approach where an elected leader node handles all writes, and log replication ensures follower nodes maintain a shared state. Edge databases replicate data across multiple nodes to improve consistency.

Treating Gossip and Raft as competing options overlooks what actually matters. The focus should be on understanding where each sits in the CAP theorem and the trade-offs they introduce to a distributed network.

The consistency vs. availability trade-off

When network partitions split the mesh, Raft ensures strong data consistency, while Gossip provides availability fallback and eventual consistency when paired with approaches like CRDTs.

In edge computing, where connection is limited and nodes are numerous, partition tolerance is non-negotiable. Edge AI systems must choose whether to prioritize consistency or availability when implementing the network architecture.

Availability is often optimal, as edge nodes continue to function independently after disconnection. Consistency-focused designs like Raft risk write suspensions and stale reads during network partitions.

Feature	Raft	Gossip
Architecture	Leader election and log replication	Peer-to-peer
Latency	Moderate; requires at least a quorum of nodes in a network to become available	Low; messages travel quickly but propagation rounds can slow down speed
Consistency guarantees	Strong consistency	Eventual consistency
Partition tolerance	Moderate; might not survive a partition	High; heals partitions faster

Speed and data delivery trade-offs are another critical constraint of the network architecture. Mesh networking adds latency with each hop as the node count increases. If your system needs data back in <50ms or your latency requirements can tolerate >100ms, this trade-off should shape your design decision.

Choosing the Right Edge AI Deployment Pattern

There’s no specific “right” edge AI deployment pattern for disconnected environments. A solid architecture implementation begins with a clear grasp of the specific constraints, goals, and characteristics of your target application. This means envisioning the full workload lifecycle, including connectivity profile, available compute resources, and latency requirements.

1. Evaluate network stability

Network stability is the primary driver of any edge AI deployment strategy. Determine how much resilience must be engineered into the edge nodes based on the expected duration of disconnection.

If the system is always disconnected: Use drone or network architectures as they are designed to operate completely offline regardless of connectivity status.
If the interruption persists for only minutes or hours: Use factory or HFL architecture to continue data aggregation and inference without interruption. The system remains functional during the outage because all required dependencies already exist within the operational perimeter.
If intermittent connectivity lasts for days or weeks: Use the store-and-forward architecture to buffer inference results and operational data locally until the scheduled connectivity window becomes available again.

2. Assess latency requirements

Define the maximum acceptable latency for your specific application by considering network hops, node availability, and geographical proximity of the edge nodes. The thresholds below reflect typical deployment patterns. Validate them against your specific hardware and network conditions.

If the system requires <50ms latency: Use the drone deployment pattern. Its single-node architecture keeps inference directly on sensors, cameras, or gateways, enabling near-real-time responses. Factory architecture also minimizes latency by running on edge servers within the same facility or on the factory floor.
If the system requires <100ms latency: Use the network or HFL architecture to distribute model improvement workloads across multiple nodes.
If <500ms latency is acceptable: Use store-and-forward architecture for non-critical IoT data that requires batch processing or long-term analytics. It batch-offloads data-intensive tasks to the cloud.

3. Evaluate resource constraints

Edge AI applications differ in processing power, storage, and bandwidth consumption, which impacts inference speed, data aggregation, and real-time analytics. Evaluate each resource limit independently:

Power constraint: For compute power <1 GFLOPS, common in microcontrollers used for sensor inference, the drone architecture is most suitable. It runs on constrained IoT devices using lightweight, inference-only models. At 10–100 GFLOPS, common in edge gateways, HFL and network architectures become more effective as they handle data aggregation needs well at this level. For edge GPU clusters that scale to >10 TFLOPS, factory and store-and-forward architecture support clustered inference pipelines, since they run on-premises.
Bandwidth constraint: Use store-and-forward architecture or HFL to store and process raw, high-volume data at the edge, forwarding only summarized updates to the cloud if required.
Data storage constraint: Use factory or store-and-forward architectures paired with embedded databases to store time-series data locally and scale vertically within the facility. Databases like Actian Zen are optimized for edge AI use cases and can also sync with the cloud once connectivity is restored.

4. Consider a hybrid approach

Industrial systems often combine the strengths of multiple architectures into a coordinated system that delivers resilience and flexibility. Rio Tinto’s mining operations illustrate what hybrid deployment looks like at scale.

At the Greater Nammuldi iron ore mine, more than 50 autonomous trucks operate on predefined routes, using onboard sensors to detect obstacles, an example of the drone architecture. Across 17 sites in Western Australia, these trucks transmit operational data to Rio Tinto’s Operations Centre in Perth, reflecting the network architecture. Finally, an autonomous rail system transports mined ore, synchronizing with the Operations Centre upon reaching port facilities. This fits the store-and-forward architecture.

Rio Tinto demonstrates that deployment patterns are not mutually exclusive. If your use case requires multiple architectures, consider running them on the layer of the system where they’re best suited, rather than forcing a single architecture across the entire operation.

The following table maps specific deployment scenarios to their optimal disconnected edge AI deployment pattern to inform your decision.

Deployment scenarios	Recommended pattern	Rationale
Autonomous inspection drones over oil fields or offshore wind farms	Drone (single-node self-contained)	A self-contained inference runtime with embedded local storage eliminates distributed computation to meet hardware limitations
Automotive assembly lines running defect detection models	Factory (multi-node edge AI)	Cloud dependency is too risky for uptime requirements, so edge clusters run within the facility
Hospital networks where patient data cannot leave individual facilities under HIPAA	Hierarchical federated learning	Models train locally, sharing only weight updates to the cloud, so raw data remains on the local site in compliance with data sovereignty and privacy
Cargo vessels at sea syncing operational data at port	Store-and-forward	A local buffer ensures no inference result or operational event is lost across connectivity gaps that can last days
Smart city traffic management across distributed intersections with no central server dependency	Network (distributed edge-to-edge fabric)	Nodes communicate peer-to-peer via consensus, so node loss reduces capacity without disrupting overall network operation

The Bottom Line

Industries operating across remote, underground, maritime, and geographically dispersed terrain need edge-native architectures that capture real-time insights and keep critical assets running without cloud dependency.

The deployment patterns discussed prioritize what matters most for disconnected environments: local inference, no centralization latency, lower communication costs, and system autonomy.

Before committing to a pattern, validate three things in your own environment: how long your system can tolerate network outage before data loss becomes operationally significant, whether your edge hardware can sustain the compute demands of your chosen architecture without degrading inference quality, and whether your team has the tooling maturity to manage model lifecycle at the edge without cloud dependency. Map your constraints against the decision framework above.

The right answer might not be a single pattern. Layer in hybrid approaches only when the resilience gains justify the operational complexity.

Each pattern depends on a data infrastructure that can operate, store, and sync entirely at the edge. For teams that need to go beyond structured storage and perform semantic search on their local data without exporting vector embeddings to a cloud server, Actian VectorAI DB is optimized for this use case. Start for free today.

Join the Actian community on Discord to discuss edge AI architecture patterns with engineers deploying in disconnected environments.