DEV Community: NeuronDB Support

NeuronDB Vector vs pgvector: Technical Comparison

NeuronDB Support — Sun, 01 Feb 2026 15:35:44 +0000

You store embeddings as vectors. You run a similarity search inside PostgreSQL. Two extensions matter in this space: pgvector and NeuronDB. This post compares both extensions using real behavior from the source trees in this repo. Every limit and object name matches code and SQL.

Introduction

Vector databases store embeddings. Similarity search ranks rows by distance. PostgreSQL extensions bring vector types, distance operators, and index access methods into the SQL layer.

You choose pgvector when you want a focused extension with broad adoption. You choose NeuronDB when you want pgvector-style SQL plus additional types, GPU paths, and operational surface area inside the extension.

This post uses pgvector v0.8.1 semantics and NeuronDB v3.0.0-devel semantics from the local source. Feature parity varies by object. Some pieces match one-to-one. Some pieces use different names with aliases.

Project references
NeuronDB site: https://www.neurondb.ai
Source code: https://github.com/neurondb/neurondb

Architecture

Architectural choices define performance limits and feature capabilities.

pgvector Architecture

pgvector implements types, operators, and index access methods in C and SQL. The extension exposes a small surface area and relies on PostgreSQL storage, WAL, and query planning.

Type system and layouts

pgvector defines these public types:

vector: dense float32 vector, up to 16000 dimensions.
halfvec: half precision vector, up to 16000 dimensions.
sparsevec: sparse vector with int32 indices and float32 values, limits depend on operation.
bit: PostgreSQL bit type, used as a binary vector for Hamming and Jaccard distance.

The vector type uses a varlena header plus two int16 fields: dim and unused. The payload stores dim float32 values. This layout matches typedef struct Vector in pgvector and yields storage of 4 times dimensions plus 8 bytes per value.

The sparsevec on disk format stores dim (int32), nnz (int32), and unused (int32) in the header, followed by nnz int32 indices. Values follow indices as a contiguous float32 array.

CPU dispatch

pgvector uses a mix of scalar code and CPU dispatch. On Linux x86_64, some functions compile with target_clones to generate multiple code paths. The code selects a path based on CPU capabilities. This approach appears in vector.c and bitutils.c.

NeuronDB Architecture

NeuronDB implements vector types, operators, access methods, and additional systems within a single extension. The extension defines types beyond pgvector, adds IVF under the access method name ivf, and includes GPU backends.

Type system and layouts

NeuronDB exposes pgvector style types, plus additional NeuronDB-specific types:

vector: dense float32 vector with dim int16 and an unused int16 field, up to 16000 dimensions.
vectorp: packed vector with metadata. The layout includes a CRC32 fingerprint, a version, a dimension, and an endian guard, followed by float32 data.
vecmap: sparse high-dimensional map. The layout stores total_dim and nnz, followed by parallel int32 indices and float32 values.
halfvec: half precision vector with a 4000 dimension limit in NeuronDB.
sparsevec: sparse vector type with 1000 nonzero entries and 1M dimensions in NeuronDB.
binaryvec: binary vector type with a Hamming distance operator.

NeuronDB also defines internal structs for quantized vectors, such as int8, int4, and binary-packed representations. The SQL surface exposes conversion and distance functions.

Index access methods

NeuronDB defines two ANN index access methods as PostgreSQL access methods:

hnsw: HNSW index access method.
ivf: IVF index access method.

NeuronDB also defines operator classes for vector, halfvec, sparsevec, bit, and binaryvec in both hnsw and ivf.

CPU SIMD and GPU backends

NeuronDB includes explicit AVX2 and AVX-512 implementations of common distance functions in vector_distance_simd.c. The build selects the compiled path based on compiler flags.

NeuronDB includes three GPU backend families in the tree:

CUDA
ROCm
Metal

The runtime backend selection logic maps backend type to names cuda, rocm, and metal. GPU entry points for HNSW and IVF kNN search are provided via SQL functions.

Feature Comparison

Both extensions integrate with PostgreSQL. NeuronDB adds operational features.

Table 1: Types, distances, indexes, and hard limits

This table focuses on public SQL objects and hard limits enforced by each project.

Area	pgvector	NeuronDB
Extension name	`vector`	`neurondb`
Dense type	`vector` (float32), max 16000 dims	`vector` (float32), max 16000 dims
Half type	`halfvec` (half), max 16000 dims	`halfvec` (FP16), max 4000 dims
Sparse type	`sparsevec` (dim int32, nnz int32, indices int32, values float32), max 1e9 dims, max 16000 nnz	`sparsevec`, max 1M dims, max 1000 nonzero entries
Binary vector	PostgreSQL `bit` plus pgvector operators	PostgreSQL `bit` operator classes plus `binaryvec` type
Distance operators	`<->` L2, `<#>` negative inner product, `<=>` cosine, `<+>` L1, `<~>` Hamming, `<%>` Jaccard	`<->` L2, `<#>` negative inner product, `<=>` cosine, `<+>` L1, `<~>` Hamming, Jaccard via `vector_jaccard_distance(vector, vector)` and `<%>` for `bit`
ANN access methods	`hnsw`, `ivfflat`	`hnsw`, `ivf`
Dense index max dims	2000 for HNSW and IVFFlat	limited by page layout, large dims fail with a page size error during build

Table 2: Tuning knobs, defaults, and where each knob lives

This table lists knobs. Each knob changes recall, latency, or build time. The table also lists the location of each knob.

Knob	pgvector	NeuronDB
HNSW `m`	index option `WITH (m = N)`, default 16	index option `WITH (m = N)`, default 16
HNSW `ef_construction`	index option `WITH (ef_construction = N)`, default 64	index option `WITH (ef_construction = N)`, default 200
HNSW `ef_search`	GUC `hnsw.ef_search`, default 40	GUC `neurondb.hnsw_ef_search`, default 64
HNSW iterative scans	GUC `hnsw.iterative_scan`	GUC `neurondb.hnsw_iterative_scan`
HNSW scan stop	GUC `hnsw.max_scan_tuples` and `hnsw.scan_mem_multiplier`	GUC `neurondb.hnsw_max_scan_tuples` and `neurondb.hnsw_scan_mem_multiplier`
IVF lists	index option `WITH (lists = N)` on `ivfflat`	index option `WITH (lists = N)` on `ivf`, and NeuronDB maps `ivfflat` to `ivf` in helper functions
IVF probes	GUC `ivfflat.probes`, default 1	GUC `neurondb.ivf_probes`, default 10
IVF iterative scans	GUC `ivfflat.iterative_scan` and `ivfflat.max_probes`	GUC `neurondb.ivf_iterative_scan` and `neurondb.ivf_max_probes`

Table 3: Acceleration and storage formats

This table covers CPU SIMD, GPU backends, and compressed formats.

Area	pgvector	NeuronDB
CPU vector dispatch	`target_clones` dispatch on supported builds	explicit AVX2 and AVX-512 distance functions, selected by build flags
GPU backends	none	CUDA, ROCm, Metal
GPU kNN helpers	none	`hnsw_knn_search_gpu(query vector, k int, ef_search int)` and `ivf_knn_search_gpu(query vector, k int, nprobe int)`
Packed dense format	none	`vectorp` with CRC32 fingerprint, version, endian guard, and float32 data
Sparse high dim format	`sparsevec`	`vecmap` and NeuronDB `sparsevec` type
Quantized internal types	binary quantization via `binary_quantize` to `bit`	int8, int4, binary, and FP16 quantization in type and function layer

Production Readiness

Production systems need repeatable behavior, clear configuration, and a monitoring path. NeuronDB ships extra primitives for tenant controls, queue-based workflows, and metrics export.

NeuronDB includes these operational surfaces in SQL:

tenant usage tables and quota tracking
background worker tables and manual triggers
Prometheus compatible metrics via SQL, plus an HTTP exporter endpoint

Performance

Performance depends on dataset shape, index parameters, storage layout, and query patterns. Use the benchmark scripts in this repo to measure your hardware and build.

Benchmarks

The repository includes benchmark scripts and SQL stress tests. Use these tools to compare pgvector and NeuronDB on your own system.

Vector benchmark suite

The vector benchmark suite downloads public ANN datasets, loads them into PostgreSQL, builds indexes, runs queries, and writes JSON results.

Run the full pipeline:

python3 NeuronDB/benchmark/vector/run_bm.py --prepare --load --run --datasets sift-128-euclidean --configs hnsw --k-values 10

Run a quick pipeline with defaults:

python3 NeuronDB/benchmark/vector/run_bm.py --prepare --load --run

Stress tests

The repo includes SQL stress suites for pgvector and NeuronDB.

pgvector stress suite:

\i NeuronDB/benchmark/vector/pgvector_stress.sql

NeuronDB stress suite:

\i NeuronDB/benchmark/vector/neurondb_vector_stress.sql

Example result from committed artifact

This example comes from NeuronDB/benchmark/vector/results/benchmark_sift-128-euclidean_hnsw_20260104_211033.json.

Dataset: sift-128-euclidean
Train vectors: 1000000
Test queries: 10000
Dimension: 128
Index: hnsw, m 16, ef_construction 200
Query: k 10, ef_search 100
Average latency ms: 512.9799604415894
P95 latency ms: 521.4026927947998
QPS: 1.9493938888746618
Recall: 1.0

Practical usage

This section focuses on repeatable workflows. Each workflow uses real object names from both projects.

Basic table and query pattern

Use a fixed-dimension column when a single embedding model drives it. Use a typmod column less when multiple embedding models share one column.

Example with a fixed dimension:

CREATE TABLE items (
    id bigint PRIMARY KEY,
    embedding vector(3)
)\g

INSERT INTO items (id, embedding) VALUES
    (1, '[1,2,3]'::vector),
    (2, '[4,5,6]'::vector)\g

SELECT
    id,
    embedding <-> '[3,1,2]'::vector AS l2_distance
FROM items
ORDER BY l2_distance
LIMIT 5\g

Indexing with HNSW

HNSW uses a graph. You trade recall for speed by changing the candidate list size during search.

pgvector uses these knobs:

index reloptions: m, ef_construction
query time GUC: hnsw.ef_search

NeuronDB uses these knobs:

index reloptions: m, ef_construction
query time GUC: neurondb.hnsw_ef_search

HNSW index creation:

CREATE INDEX items_embedding_hnsw_l2
ON items
USING hnsw (embedding vector_l2_ops)
WITH (m = 16, ef_construction = 64)\g

Query time tuning with pgvector:

SET hnsw.ef_search = 100\g

Query time tuning with NeuronDB:

SET neurondb.hnsw_ef_search = 100\g

Indexing with IVF

IVF uses lists. Search probes determine how many lists participate in a query.

pgvector uses the access method name ivfflat and uses:

index reloption: lists
query time GUC: ivfflat.probes

NeuronDB uses the access method name ivf and uses:

index reloption: lists
query time GUC: neurondb.ivf_probes

IVF index creation in pgvector:

CREATE INDEX items_embedding_ivfflat_l2
ON items
USING ivfflat (embedding vector_l2_ops)
WITH (lists = 100)\g

IVF index creation in NeuronDB:

CREATE INDEX items_embedding_ivf_l2
ON items
USING ivf (embedding vector_l2_ops)
WITH (lists = 100)\g

Query time tuning with pgvector:

SET ivfflat.probes = 10\g

Query time tuning with NeuronDB:

SET neurondb.ivf_probes = 10\g

Filtered search

Filtered kNN queries need two things: a filter predicate and an ordered distance sort with a limit.

SELECT
    id,
    embedding <-> '[3,1,2]'::vector AS l2_distance
FROM items
WHERE id <> 1
ORDER BY l2_distance
LIMIT 5\g

For approximate indexes, filtering happens after index traversal in pgvector. pgvector provides iterative index scans to extend scans when filtering removes rows from the first pass.

Iterative scans for pgvector HNSW:

SET hnsw.iterative_scan = strict_order\g
SET hnsw.max_scan_tuples = 20000\g
SET hnsw.scan_mem_multiplier = 2\g

Iterative scans for NeuronDB HNSW:

SET neurondb.hnsw_iterative_scan = strict_order\g
SET neurondb.hnsw_max_scan_tuples = 20000\g
SET neurondb.hnsw_scan_mem_multiplier = 2\g

NeuronDB packed and sparse formats

Use vectorp when you want a packed dense format with metadata. Use vecmap for sparse high-dimensional inputs.

CREATE TABLE packed_items (
    id bigint PRIMARY KEY,
    embedding vectorp
)\g

INSERT INTO packed_items (id, embedding) VALUES
    (1, '[1,2,3]'::vectorp)\g

SELECT
    id,
    embedding
FROM packed_items
ORDER BY id
LIMIT 1\g

CREATE TABLE sparse_items (
    id bigint PRIMARY KEY,
    embedding vecmap
)\g

Quantization workflows

Quantization trades precision for smaller storage and faster scans. NeuronDB exposes multiple quantization functions. Each function returns a bytea representation.

CPU quantization examples:

SELECT
    vector_to_int8('[1,2,3]'::vector) AS q_int8,
    vector_to_fp16('[1,2,3]'::vector) AS q_fp16,
    vector_to_binary('[1,2,3]'::vector) AS q_binary,
    vector_to_int4('[1,2,3]'::vector) AS q_int4\g

Accuracy analysis examples:

SELECT
    quantize_analyze_int8('[1,2,3]'::vector) AS int8_stats,
    quantize_analyze_fp16('[1,2,3]'::vector) AS fp16_stats,
    quantize_analyze_binary('[1,2,3]'::vector) AS binary_stats,
    quantize_analyze_int4('[1,2,3]'::vector) AS int4_stats\g

GPU workflows in NeuronDB

NeuronDB exposes GPU status, GPU distance functions, and GPU kNN helpers in SQL.

GPU initialization and status:

SELECT neurondb_gpu_enable() AS gpu_enabled\g

SELECT
    device_id,
    device_name,
    total_memory_mb,
    free_memory_mb,
    is_available
FROM neurondb_gpu_info()\g

GPU distance functions:

SELECT
    vector_l2_distance_gpu('[1,2,3]'::vector, '[4,5,6]'::vector) AS l2_gpu,
    vector_cosine_distance_gpu('[1,2,3]'::vector, '[4,5,6]'::vector) AS cosine_gpu,
    vector_inner_product_gpu('[1,2,3]'::vector, '[4,5,6]'::vector) AS ip_gpu\g

GPU kNN helpers:

SELECT
    id,
    distance
FROM hnsw_knn_search_gpu('[1,2,3]'::vector, 10, 100)\g

SELECT
    id,
    distance
FROM ivf_knn_search_gpu('[1,2,3]'::vector, 10, 10)\g

GPU usage stats:

SELECT
    queries_executed,
    fallback_count,
    total_gpu_time_ms,
    total_cpu_time_ms,
    avg_latency_ms
FROM neurondb_gpu_stats()\g

Multi tenant controls in NeuronDB

NeuronDB includes tenant quota tracking and tenant specific helper functions. These objects live in the neurondb schema.

Tenant quota tables and views support workflows such as:

enforce per tenant vector count limits
track per tenant storage usage
query tenant usage and quota percent

NeuronDB includes tenant-aware HNSW helper functions:

hnsw_tenant_create
hnsw_tenant_search
hnsw_tenant_quota

Monitoring in NeuronDB

NeuronDB exposes Prometheus-compatible metrics via SQL:

SELECT
    queries_total,
    queries_success,
    queries_error,
    query_duration_sum,
    vectors_total,
    cache_hits,
    cache_misses,
    workers_active
FROM neurondb_prometheus_metrics()\g

Background workers and queues in NeuronDB

NeuronDB stores the queue and metrics state in SQL tables under the neurondb schema. Background workers process or sample those tables when enabled in PostgreSQL.

Queue workflow:

INSERT INTO neurondb.job_queue (tenant_id, job_type, payload)
VALUES (0, 'embedding', '{"text":"hello"}'::jsonb)\g

SELECT
    job_id,
    tenant_id,
    job_type,
    status,
    retry_count,
    created_at
FROM neurondb.job_queue
ORDER BY created_at DESC
LIMIT 10\g

Manual trigger helpers exist for testing:

SELECT neuranq_run_once() AS queued_work\g
SELECT neuranmon_sample() AS tuner_sample\g
SELECT neurandefrag_run() AS defrag_ran\g

LLM configuration and jobs in NeuronDB

NeuronDB stores LLM provider configuration in neurondb.llm_config and stores jobs in neurondb.llm_jobs.

Configuration workflow:

SELECT neurondb.set_llm_config(
    'https://api-inference.huggingface.co',
    'REPLACE_WITH_KEY',
    'REPLACE_WITH_MODEL'
)\g

SELECT
    api_base,
    default_model,
    updated_at
FROM neurondb.get_llm_config()\g

Job enqueue workflow:

SELECT ndb_llm_enqueue(
    'embed',
    'REPLACE_WITH_MODEL',
    'hello world',
    'tenant0'
) AS job_id\g

Index tuning helpers in NeuronDB

NeuronDB exposes index tuning and diagnostics helpers in SQL. These helpers return JSONB.

Examples:

SELECT index_tune_hnsw('items', 'embedding') AS hnsw_recommendation\g
SELECT index_tune_ivf('items', 'embedding') AS ivf_recommendation\g
SELECT index_recommend_type('items', 'embedding') AS index_choice\g
SELECT index_tune_query_params('items_embedding_hnsw_l2') AS query_knobs\g

Migration

Migration replaces one extension with another extension. Existing tables remain. Indexes with dependencies on vector extension objects drop during DROP EXTENSION vector CASCADE.

Drop pgvector

DROP EXTENSION vector CASCADE\g

Install NeuronDB

CREATE EXTENSION neurondb\g

Verify data

SELECT count(1) AS row_count FROM items\g

Your table rows remain. You still need to rebuild ANN indexes after dropping pgvector.

Recreate indexes For large tables (>100GB), increase maintenance_work_mem before building.

SET maintenance_work_mem = '4GB'\g
CREATE INDEX ON items USING hnsw (embedding vector_l2_ops)
WITH (m = 16, ef_construction = 64)\g

GPU Use the NeuronDB GPU functions and settings from the extension. The SQL surface exposes GPU kNN functions for HNSW and IVF.

Use Case Recommendations

Select the tool for your infrastructure and requirements. Start with your workload, then map your constraints to a short decision.

Use pgvector when your goal is simple vector search

Pick pgvector when you want fewer moving parts and fewer extension specific features.

Use pgvector when you meet most of these conditions:

You run CPU only workloads.
You want ivfflat and hnsw naming across docs, examples, and client libraries.
You want distance operators and two ANN access methods, with minimal extra SQL objects.
You want a smaller operational surface area inside the extension.

Use pgvector when your query pattern looks like this most of the time:

SELECT
    id,
    embedding <-> $1::vector AS distance
FROM items
ORDER BY distance
LIMIT 10

Use pgvector when you tune with pgvector GUCs:

SET hnsw.ef_search = 100\g
SET ivfflat.probes = 10\g

Use NeuronDB when your goal is a larger in database surface area

Pick NeuronDB when you want the same distance operators and index patterns plus additional SQL objects for GPU workflows, quantization workflows, tuning helpers, and operational queues.

Use NeuronDB when you meet most of these conditions:

You want ivf as an access method name, plus hnsw.
You want vectorp and vecmap as additional storage formats.
You want SQL functions for quantization, with both CPU and GPU entry points.
You want SQL functions for GPU status, GPU distance, and GPU kNN helpers.
You want SQL tables and views for tenant quotas, job queues, metrics, and Prometheus export.

Use NeuronDB when you want NeuronDB specific query time tuning:

SET neurondb.hnsw_ef_search = 100\g
SET neurondb.ivf_probes = 10\g

Use NeuronDB when you want index tuning helpers and diagnostics in SQL:

SELECT index_tune_hnsw('items', 'embedding') AS hnsw_recommendation\g
SELECT index_tune_ivf('items', 'embedding') AS ivf_recommendation\g
SELECT index_recommend_type('items', 'embedding') AS index_choice\g

Short decision flow

Start here when you need a quick answer.

If you want ivfflat, pick pgvector.
If you want ivf, pick NeuronDB.
If you want GPU SQL entry points, pick NeuronDB.
If you want fewer extension-owned tables and views, pick pgvector.
If you want quantization helpers and analysis functions in SQL, pick NeuronDB.

Practical scenarios

Use this section as a checklist.

Scenario 1: Single app, single embedding model, CPU only

Use pgvector
Create one HNSW index on vector_l2_ops or vector_cosine_ops
Tune hnsw.ef_search per endpoint or per query

Scenario 2: Multi-tenant SaaS with per-tenant limits

Use NeuronDB
Use tenant quota tables and views under the neurondb schema
Use tenant aware HNSW helper functions when you want tenant scoped index management

Scenario 3: Storage pressure from large embeddings

Use NeuronDB
Use quantization functions to produce compact bytea outputs
Compare distance preservation with quantize_compare_distances

Scenario 4: GPU present, batch heavy workloads

Use NeuronDB
Enable GPU runtime and query neurondb_gpu_info
Use GPU kNN helpers where your workflow matches those function signatures

Conclusion

pgvector focuses on vector search primitives. NeuronDB adds additional types, an ivf access method, GPU entry points, quantization helpers, worker tables, and metrics export.

Pick the extension based on your operational goal. Keep your schema and query patterns simple. Measure with the benchmark scripts in this repo, then tune one knob at a time.

8 RAG Patterns You Should Stop Ignoring

NeuronDB Support — Sun, 01 Feb 2026 13:00:19 +0000

Large language models generate fluent text. They fail to meet grounding, traceability, freshness, and access control requirements. Retrieval-Augmented Generation addresses this by forcing models to answer using external evidence.

Early RAG used one simple pipeline. Production systems now use multiple architecture patterns. Each pattern targets a different failure mode. This post explains eight major RAG architectures used in production today.

Project references
NeuronDB site: https://www.neurondb.ai
Source code: https://github.com/neurondb/neurondb

What Is RAG

RAG links three systems: storage, retrieval, and generation. The storage layer holds your documents, chunks, and embeddings. The retrieval layer finds relevant evidence for each query. The generation layer produces answers conditioned on the retrieved context. The pipeline flows from query through evidence retrieval, context building, answer generation, and citation return. You get factual grounding, fresh data usage, private data isolation, and audit trace support. RAG shifted AI engineering from prompt tuning toward data pipeline engineering.

The storage layer supports multiple backends, including vector databases (Pinecone, Weaviate, Milvus), document stores (Elasticsearch, OpenSearch), and hybrid systems. The retrieval layer runs embedding models, keyword search, or graph traversal, depending on the architecture. The generation layer typically uses a large language model with a prompt template. The three layers communicate through a well-defined interface. You swap components without rewriting the full pipeline.

1. Naive RAG

Naive RAG uses direct vector similarity retrieval with no feedback loop. The name comes from the original RAG paper. The architecture remains the baseline for comparison.

Pipeline

Document ingestion loads raw text from files, databases, or APIs. Preprocessing normalizes whitespace, strips markup, and segments by logical boundaries. Text chunking splits documents into fixed-size or variable-size segments. Common choices: 256 tokens, 512 tokens, or sentence-based chunks. Embedding generation converts each chunk into a vector using a pretrained model. Vector storage writes embeddings to a vector database with metadata (source doc, chunk index, timestamp). At query time, the user submits a question. Query embedding converts the question into a vector. Vector search returns the top-k nearest chunks by cosine similarity or Euclidean distance. Context injection concatenates retrieved chunks into a prompt. Response generation passes the prompt to an LLM. Citation return attaches source references to the output.

Strengths

Implementation takes one to two weeks for an experienced engineer. Infrastructure cost stays low: one embedding model, one vector store, one LLM endpoint. The approach works well for static knowledge domains. FAQ corpora, product documentation, and internal wikis fit this pattern. Latency stays under 2 seconds for most deployments. No feedback loops mean deterministic behavior. The same query returns the same retrieval set. Debugging is straightforward.

Weaknesses

No verification loop validates retrieved evidence. Irrelevant chunks slip through when embedding similarity is misleading. Ranking quality depends entirely on embedding similarity. Ambiguous queries return weak results. A query like "how do I fix the error" returns generic troubleshooting content rather than error-specific documentation. Multi-faceted queries suffer. A question about "pricing and integration" retrieves only chunks for one facet. The model hallucinates to fill gaps when retrieval fails.

Chunk Size

Chunk size selection impacts recall quality. Small chunks (128 tokens) give precise matches but miss context. A section on "connection timeout" often fails to identify the cause or solution. Large chunks (512 tokens) capture more context but dilute relevance. The top-k retrieval returns fewer distinct documents. Overlap between chunks (50 tokens) helps preserve context across boundaries. Test multiple chunk sizes (128, 256, 512) against your query set. Measure recall at k=5 and k=10. Choose the size where recall plateaus.

Embedding Models

Embedding model choice impacts semantic coverage. Models trained on general text (OpenAI text-embedding-ada-002, sentence-transformers/all-MiniLM) underperform on domain-specific corpora. Medical, legal, and financial texts use terminology absent from training data. Use domain-tuned embeddings when available. Fine-tune on your corpus with contrastive loss. Or use domain-specific models (e.g., BioBERT for medical applications). Embedding dimension matters. 384-dim models are faster and cheaper. 1536-dim models capture finer distinctions. Benchmark both on your data.

Production Use Cases

FAQ bots with fewer than 10,000 questions. Documentation search for product manuals and API references. Internal knowledge bases where content changes infrequently. POCs and demos where speed of implementation outweighs accuracy.

2. Agentic RAG

Agentic RAG adds planning, tool selection, and iterative reasoning. The agent breaks complex questions into steps, chooses tools for each step, executes them, and synthesizes a final answer. This architecture handles workflows that a single retrieval call cannot support.

Pipeline

Task planning analyzes the user query and produces a step-by-step plan. The planner uses an LLM with few-shot examples or a structured prompt. Plan steps include "retrieve documents about X," "call API Y," and "summarize results." Tool selection maps each step to a tool. Tools include vector search, keyword search, calculator, API calls, and code execution. The agent selects tools based on step descriptions and tool schemas. Multi-step retrieval executes tools in sequence. Outputs from earlier steps feed into later steps. A retrieval about "company revenue" informs a follow-up retrieval about "competitor revenue." Tool execution runs each tool and captures results. Memory update stores tool outputs, intermediate conclusions, and user feedback. Response synthesis generates the final answer from the accumulated context. The agent loops back to planning when synthesis indicates missing information.

Strengths

Handles complex workflows. A query like "compare our Q3 results to our top three competitors and summarize the gap" requires multiple retrievals, API calls, and summarization. You run multiple tools in sequence. Long-running reasoning tasks become feasible. Research assistants draw from papers, patents, and news sources. Competitive intelligence agents aggregate data from multiple sources. Autonomous analytics agents run queries, join data, and produce reports.

Weaknesses

Latency increases with each planning and execution step. A single query often triggers 3 to 10 model calls. End-to-end latency reaches 10 to 30 seconds. Debugging is hard. The agent chooses different tools or paths for similar queries. Reproducing a failure requires logging every decision. Infrastructure cost rises. Each step consumes tokens. State tracking, retry logic, and execution budget control add engineering overhead.

Implementation Guidance

Set a maximum step count. Without limits, agents loop or drift. A typical cap is 5 to 10 steps. Log every tool call and plan step. Store the full execution trace. Reproducibility matters when users report errors. Use deterministic seeds where possible for plan generation. Define tool schemas with clear descriptions. The agent relies on schemas to select tools. Vague descriptions cause wrong tool selection. Implement timeouts per step. A stuck tool blocks the whole pipeline. Add fallback behavior when tools fail. The agent should degrade gracefully.

Production Use Cases

Research automation: literature review, patent analysis, trend summarization. Competitive intelligence: market monitoring, competitor tracking, strategic briefs. Autonomous analytics: ad-hoc reporting, data exploration, dashboard generation.

3. HyDE RAG

HyDE (Hypothetical Document Embeddings) generates synthetic documents to improve retrieval matching. The idea: hypothetical answers are closer in embedding space to real answers than raw queries. This bridges the vocabulary gap between how users ask and how documents are written.

Pipeline

The user submits a query. Hypothetical answer generation produces one or more plausible answers using an LLM. A query like "how do I configure SSL" might generate "To configure SSL, you need to generate a certificate, add the certificate path to the config file, and restart the server." Embedding generation converts the hypothetical answer into a vector. Retrieval uses this vector instead of the query vector to search the corpus. The retrieved chunks are real documents, not hypothetical. Context assembly concatenates retrieved chunks. The final generation produces the actual answer from the retrieved evidence. The model cites real sources rather than hypothetical answers.

Variations

Single HyDE generates one hypothetical answer per query. Multi-HyDE generates 3 to 5 hypothetical answers, embeds each, retrieves the corresponding results for each, and merges the results. Multi-HyDE improves recall, but multiplies cost. HyDE with reranking adds a reranker after retrieval. The reranker scores chunks based on their relevance to the original query. This filters false positives from the expanded retrieval set.

Strengths

Recall quality improves. Benchmarks report 10-30% recall gains over naive retrieval. Vocabulary mismatch between queries and corpus documents drops. Users ask, "Why is my app slow?" while docs say "performance degradation" and "latency issues." Hypothetical answers use doc-like language. Technical search benefits most. Developer questions, error messages, and API usage patterns align better after HyDE.

Weaknesses

Extra inference: the model must generate a hypothetical answer before retrieval. Expect 1.5x to 2x token usage per query. Synthetic bias is a risk. Generated documents sometimes skew retrieval toward certain document types. A model trained on tutorials often generates tutorial-style hypotheticals and over-retrieves tutorials. Production use cases include developer search, technical troubleshooting, and scientific literature retrieval. HyDE works best when combined with reranking models. The reranker filters false positives from the expanded retrieval set.

Implementation Guidance

Use a fast, cheap model for hypothetical generation. You do not need the best model. A 7B parameter model often suffices. Keep hypothetical answers concise. Long hypotheticals add noise. 50 to 100 tokens per hypothetical works well. Consider caching. Repeated queries (e.g., popular FAQs) reuse cached hypotheticals. Cache key: query embedding.

4. Graph RAG

Graph RAG is extracted from entity relationships in knowledge graphs. Documents become nodes and edges. Queries traverse the graph to assemble context. This architecture excels when relationships matter as much as raw text.

Pipeline

Entity extraction identifies named entities in a document, such as people, organizations, products, and concepts. Extraction uses NER models, rule-based patterns, or LLM-based parsing. Entity linking resolves extracted entities to canonical IDs. "Apple Inc" and "Apple" map to the same node. Linking uses knowledge bases (Wikidata, DBpedia) or custom ontologies. Graph construction creates nodes for entities and edges for relationships. Relationships come from co-occurrence, dependency parsing, or relation extraction models. Graph storage writes to a graph database (E.g., Neo4j or Amazon Neptune) or to an in-memory graph. At query time, query understanding identifies entities mentioned in the query. Graph traversal starts from those entities and follows edges. Traversal strategies include k-hop neighborhood, path finding, and community detection. Context assembly pulls text from documents associated with traversed nodes. Generation produces an answer from the assembled context.

Strengths

Multi-hop reasoning becomes tractable. "What drugs interact with the patient's current medication?" requires chaining drug-to-drug relationships across multiple hops. The answer depends on chaining relationships across multiple entities. Explainability is strong. The reasoning path follows explicit graph edges. You show users the path from query entities to answer entities. Relationship-aware retrieval surfaces related concepts naive vector search misses.

Weaknesses

Graph construction is expensive. Entity extraction and linking require trained models or rules. Expect weeks of tuning for a new domain. Schema design is complex. You must decide which relationship types matter for retrieval. Too many relationship types create noise. Too few missed connections. Graph refresh pipelines must align with source data refresh cycles. Stale graphs return stale answers. Production use cases include healthcare decision support, fraud detection, and scientific research, where relationship structure matters as much as raw text.

Implementation Guidance

Start with a minimal schema. Two or three relationship types (e.g., "treats," "interacts with") often suffice. Add more as you validate the need. Use hybrid retrieval. Combine graph traversal with vector search. Graph finds structure. Vectors find semantic similarity. Run incremental updates. Rebuild the full graph only when schema changes. For daily doc updates, add or update affected nodes and edges.

5. Corrective RAG

Corrective RAG adds self-validation and iterative refinement. The system generates an answer, critiques it, and re-retrieves or regenerates when the critique identifies issues. The loop continues until the answer meets a quality threshold. This architecture is well-suited to high-stakes domains where errors are costly.

Pipeline

The initial retrieval fetches the top-k chunks for the query. Initial generation produces a draft answer. Critique evaluates the draft. The critic checks: does the answer cite retrieved evidence? Are claims supported? Are there contradictions? The critic uses an LLM with a structured prompt or a trained classifier. Scoring produces a numeric score (0 to 1) or a pass/fail. Re-query triggers when the critique finds missing evidence or unsupported claims. The re-query reformulates the search or expands k. Re-generation produces a new draft from the expanded context. The loop repeats until the score exceeds a threshold or the maximum iterations (e.g., 3) are reached. Final output returns the best-scoring answer with citations.

Strengths

Factual accuracy improves. Benchmarks show a 15 to 25 percent reduction in hallucination rate. Hallucination rate drops. The critic catches unsupported claims before they reach the user. The architecture suits applications that require robust audit trails. Financial analytics, legal research, and compliance systems need traceable reasoning. Each answer comes with a critique log.

Weaknesses

Higher latency. Most implementations run 2 to 4 generation passes per query. Latency doubles or triples. Token usage rises proportionally. Engineering teams must design scoring functions for the critique stage. The critic must reliably detect factual errors or missing evidence. A weak critic adds cost without benefit. False negatives let errors through. A harsh critic triggers unnecessary re-retrieval. False positives waste tokens and time. Tuning the critic is non-trivial.

Implementation Guidance

Start with a simple criticism: "Does each claim have a citation?" Then add checks for contradiction and hallucination. Use the chain-of-thought for the critic. Ask the critic to explain its reasoning before scoring. This improves reliability. Set a conservative max iteration count. Three passes usually suffice. More passes yield diminishing returns. Log critique scores over time. Track the distribution. Drift indicates a need to retune.

Production Use Cases

Financial analytics: earnings summaries, risk reports, compliance checks. Legal research: case law retrieval, contract analysis, and regulatory lookup. Compliance systems: policy verification, audit support, regulatory reporting.

6. Contextual RAG

Contextual RAG uses conversation state and session memory. Retrieval considers prior turns. Generation maintains continuity. This architecture supports multi-turn dialogues where each question depends on context.

Pipeline

Session storage keeps a log of user messages and assistant responses. Each turn appends to the log. Context summarization runs when the log exceeds a token limit. Summarization compresses old turns into a shorter summary. The summary plus recent turns form the active context. Context-aware retrieval uses the full conversation, not only the latest message. A query "what about the second one?" is retrieved by concatenating "second one" with the prior discussion of a list. Some systems embed the full conversation. Others extract key entities and concepts for retrieval. Response generation receives the retrieved context plus conversation history. The model produces answers referencing prior turns. Memory updates with new information from the current turn for future retrieval.

Strengths

Multi-turn consistency improves. Follow-up questions receive correct answers. "What is the price?" after "Tell me about Product X" returns Product X's price. Personalization based on user history becomes possible. Preferences, prior queries, and corrections influence retrieval and generation. Session continuity supports long interactions. Meeting assistants, customer success tools, and personal knowledge systems rely on this.

Weaknesses

Memory drift is a risk. Stale or irrelevant context accumulates over long sessions. A conversation about "Product A" often drifts to "Product B," but retrieval remains biased toward A. Context contamination occurs when prior turns bias retrieval in unwanted ways. A user correction ("I meant Product B, not A") must override prior context. Implementation is tricky. Memory compaction must run periodically. Without compaction, context windows overflow, and relevance degrades. Summarization loses detail. Aggressive summarization drops information needed for later turns.

Implementation Guidance

Define a context window budget. Reserve tokens for conversation history, retrieval context, and generation. When history exceeds the budget, summarize the oldest turns. Use a sliding window with a summary: keep the last N turns verbatim and summarize the rest. Store user corrections explicitly. "User clarified X" should override prior assumptions. Test with long sessions. Simulate 20-turn conversations. Measure consistency and relevance at turn 5, 10, 15, 20.

Production Use Cases

Meeting assistants: summarization, action items, follow-up questions. Customer success tools: support dialogues, onboarding flows, and feature discovery. Personal knowledge systems: note-taking, research assistants, learning companions.

7. Modular RAG

Modular RAG splits retrieval into independent components. Each component has a single responsibility. You swap, upgrade, or bypass components without rewriting the pipeline. This architecture supports complex enterprise needs where one-size-fits-all fails.

Pipeline

Query rewriting normalizes and expands the user query. Spelling correction, query expansion, and multi-query generation (HyDE-style) run here. Hybrid retrieval runs multiple search strategies in parallel. Vector search, keyword search, and graph traversal execute concurrently. Results feed into a fusion step. Filtering removes irrelevant results. Filters apply metadata constraints (date range, source, access control). Deduplication merges near-duplicate chunks. Reranking scores the filtered set with a cross-encoder or learned ranker. Reranking is expensive, so you run reranking on the top 20 to 50 candidates. Tool routing sends queries to specialized tools. A legal query goes to the legal corpus. A support query goes to the support corpus. Routing uses classifiers or keyword rules. Response synthesis assembles the final answer. Synthesis calls the LLM once or multiple times. Some architectures add a citation verification step.

Strengths

Each module upgrades or replaces independently. Swap the embedding model without touching the retrieval logic. Add a new data source by adding a retrieval module. The architecture supports flexible enterprise workflows. Different departments need different corpora and rules. Modular RAG accommodates this. Adding new data sources or retrieval strategies is straightforward. Implement a new module. Add the module to the pipeline. Configure routing.

Weaknesses

System complexity rises. A full modular pipeline has 6 to 10 components. Each component has its own config, dependencies, and failure modes. Maintenance cost rises. Observability across modules becomes critical. Failures occur at any stage. A bug in query rewriting silently corrupts downstream retrieval. You need per-module metrics and tracing. Latency adds up. Each module adds milliseconds. End-to-end latency requires careful optimization. Production use cases include enterprise AI platforms, large data pipeline systems, and research automation systems where modularity is a core requirement.

Implementation Guidance

Define clear interfaces between modules. Each module accepts a standard input format and produces a standard output format. Use a pipeline framework (e.g., LangChain, LlamaIndex, or a custom DAG) to enforce this. Instrument every module. Log inputs, outputs, and latency. Add tracing IDs to follow a query across modules. Version your pipeline. When you change a module, record the version. A/B test module changes before full rollout. Start minimal. Add modules only when you have a concrete problem. A 3-module pipeline (retrieve, rerank, generate) often suffices for early deployments.

Production Use Cases

Enterprise AI platforms: multi-tenant, multi-corpus, role-based access. Large data pipeline systems: billions of documents, multiple retrieval backends. Research automation: federated search, specialized tools, reproducibility.

8. Hybrid RAG

Hybrid RAG combines keyword retrieval and semantic retrieval. Keyword search finds exact and lexical matches. Semantic search finds conceptual matches. Together, they cover cases where either alone fails.

Pipeline

Query parsing extracts keywords and optionally generates a semantic query. Keyword search runs on an inverted index (e.g., BM25 or Elasticsearch). Semantic search runs against a vector index. Both return ranked lists. Rank fusion merges the two lists. Reciprocal Rank Fusion (RRF) is the common baseline: score = sum(1/(k + rank)) across lists. k is typically 60. Other methods include weighted linear combination and learned fusion. Optionally, reranking scores the fused list. Reranking uses a cross-encoder or a learned model. Generation receives the top chunks and produces the answer.

Keyword vs Semantic

Keyword search excels at exact matches. Product IDs, error codes, and proper nouns. "ERR_SSL_PROTOCOL_ERROR" retrieves the right doc. Semantic search fails here if the embedding does not capture the token. Semantic search excels at paraphrasing and conceptual queries. "How do I fix connection problems" matches "troubleshooting network connectivity." Keyword search misses this. Hybrid covers both. A query about "Q3 revenue" gets keyword hits on "Q3" and "revenue" plus semantic hits on earnings reports and financial summaries.

Strengths

Precision comes from keyword matching. Recall comes from semantic search. Structured and unstructured data both work. Keyword search handles tables, metadata, and structured fields. Semantic search handles free text. Production use cases include legal search, compliance audits, and enterprise search platforms, where both precision and recall matter.

Weaknesses

Ranking tuning is complex. Rank fusion models require continuous optimization. You must balance keyword and semantic signals. RRF assumes equal contribution. Your data often needs different weights. Learned fusion models often outperform RRF but need training data. You need labeled query-document pairs. Tuning is iterative. Add keyword weight when users complain about missed exact matches. Assign semantic weight when users report missed conceptual matches.

Implementation Guidance

Start with RRF. No training required. Tune k (typically 40-80) on a small validation set. Add metadata filters. Both keyword and semantic results benefit from source, date, and access filters. Consider query-type routing. Short queries (1 to 3 words) often need more keyword weight. Long, conceptual queries need more semantic weight. Implement both paths in parallel. Parallel execution keeps latency low. Fusion adds minimal overhead.

Production Use Cases

Legal search: case law, contracts, regulations. Compliance audit: policy lookup, regulatory check. Enterprise search: intranet, document management, knowledge base.

Cross-Architecture Comparison

Naive RAG: low complexity, medium accuracy, low cost, low latency. Implementation in days. Best for static, narrow corpora.

Agentic RAG and Modular RAG: high complexity, high accuracy, high cost, higher latency. Implementation in weeks or months. Best for complex workflows and enterprise needs.

Corrective RAG: high accuracy, high latency, high token usage. Best for high-stakes domains where verification matters.

HyDE, Contextual, and Hybrid RAG: medium complexity, cost, and latency with accuracy gains over Naive RAG. Implementation in one to two weeks. Best for technical search, multi-turn dialogue, or mixed precision-recall needs.

Choose an architecture by failure mode. Naive RAG solves simplicity. Agentic RAG solves autonomy. HyDE solves vocabulary mismatch. Graph RAG solves relationship reasoning. Corrective RAG solves verification. Contextual RAG solves memory. Modular RAG solves enterprise workflow composition. Hybrid RAG solves the balance between precision and semantic coverage.

Decision Framework

Ask: Does your corpus change frequently? Yes favors Naive or Modular. Does your domain have rich entity relationships? Yes, favors Graph. Do users ask multi-turn questions? Ye,s favors Contextual. Do you need high factual accuracy and audit trails? Yes, favors Corrective. Do users and docs use different terminology? Yes, favors HyDE or Hybrid. Do you need multiple tools and complex workflows? Yes favors Agentic or Modular.

Conclusion

RAG is no longer a single architecture. Each pattern solves a specific problem. Production success depends on pipeline design, data quality, and evaluation discipline. The strongest systems combine multiple RAG patterns into a single, orchestrated platform. A single system might use Hybrid retrieval, Corrective verification, and Contextual memory. Future RAG systems will look less like search pipelines and more like distributed data operating systems.

Running AI on premises with Postgres

NeuronDB Support — Thu, 01 Jan 2026 19:30:42 +0000

Many AI systems struggle with unpredictable latency and excessive data movement. Documents, embeddings, and vector search often live in different systems, adding hops, cost, and failure modes. This post explains when running vector search and RAG directly in PostgreSQL on-premises makes sense, and how to design it for stable production behavior.

Project references
NeuronDB site: https://www.neurondb.ai
Source code: https://github.com/neurondb/neurondb

Decide if you should run on premises

Pick on premises when you must control where data lives. Use it when you need to keep traffic private. Pick it when you must hit a strict latency target. Pick it when costs grow with API calls and egress. If you need a fast setup for a small pilot, start in the cloud, then move the data plane later.

Compliance: HIPAA, GDPR, PCI, residency rules, audit rules
Security: private networks, strict access, limited outbound traffic
Latency: stable p95 and p99, fewer hops
Cost: high volume usage, where per-call fees add up
Control: standard Postgres and a clear ops surface

Cloud vs on-premises, quick view

Figure: Comparison of data flow, latency paths, and operational boundaries between cloud and on-premises AI systems.

Watch your data movement. In many systems, you fetch documents in one place, run embeddings in another, and run vector search in a third place. Each hop adds latency and failure modes. If you keep these steps within a single network, you reduce variance and debug faster.

Architecture overview

Figure: On-premises AI architecture with documents, embeddings, vector indexes, and retrieval inside PostgreSQL.

Keep the data plane local. Store documents and metadata in Postgres. Store embeddings next to the rows they describe. Build vector indexes in the same database. Run retrieval queries over private links. Expose results through your app services.

Keep three paths clear. Ingest is write-heavy. Retrieval is read-heavy. Admin work is rare but sensitive. Split these paths by network rules and by roles.

Put ingestion on a schedule. Batch it. Keep queries stable. Do not let ad hoc scripts write to the central database. Use a queue or a worker process. Record each run.

What you run

Keep the component list short. Assign an owner to each part. If you cannot name the host and the pager, you are not done.

Postgres with NeuronDB for storage, embeddings, indexes, and retrieval
Ingestion workers for cleaning, chunking, and loads
Embedding execution on CPU or GPU, batch jobs, steady throughput
App services that call Postgres and return citations
Monitoring for latency, load, pool use, lag, and backups

Deployment patterns

Start simple. Prove retrieval quality. Prove latency. Add resilience only when you need it. Keep changes small so you can reverse them.

Single server

Figure: Single-host deployment for early-stage or low-scale workloads.

Use this for your first release. You get one host to secure. You get one Postgres instance to tune. You get precise failure handling. Add backups and dashboards before you add more servers.

CREATE EXTENSION neurondb;

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
  content TEXT NOT NULL,
    embedding vector(384)
);

INSERT INTO documents (content, embedding)
VALUES ('Document content', embed_text('Document content', 'sentence-transformers/all-MiniLM-L6-v2'));

SELECT
  content
FROM documents
ORDER BY embedding <=> embed_text('query', 'sentence-transformers/all-MiniLM-L6-v2')
LIMIT 10;

Add filters early. It keeps results stable. It keeps cost stable. It keeps latency stable.

Data model and chunking

Store chunks, not whole files. Keep the original document id. Store offsets. Store a version. Keep chunk size stable. Start with 300 to 800 tokens per chunk. Start with a 50 to 150 token overlap. Measure answer quality. Then change one variable.

CREATE TABLE doc_chunks (
  doc_id BIGINT NOT NULL,
  chunk_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  tenant_id TEXT NOT NULL,
  source TEXT,
  content TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    embedding vector(384)
);

CREATE INDEX doc_chunks_tenant_doc_idx
  ON doc_chunks (tenant_id, doc_id);

Track a content hash. It lets you skip re-embedding on retries. It enables you to detect duplicates. Use a text hash or a stable id from your upstream system.

Hybrid search with metadata and vectors

Filter with metadata, then rank by vector distance. Use this per tenant. Use it per source. Use it per time window.

ALTER TABLE documents
  ADD COLUMN tenant_id TEXT NOT NULL DEFAULT 'default',
  ADD COLUMN source TEXT,
  ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT now();

SELECT id, content
FROM documents
WHERE tenant_id = 'acme'
  AND (source IS NULL OR source <> 'spam')
ORDER BY embedding <=> embed_text('query', 'sentence-transformers/all-MiniLM-L6-v2')
LIMIT 10;

Ingestion workflow

Use one workflow. Keep it the same across development, testing, and production. Run it in batches. Track each run. Start with these steps.

Fetch raw documents
Normalize text, strip boilerplate
Split into chunks, keep offsets
Insert rows without embeddings
Compute embeddings in batches of 32 to 256
Update embeddings
Build or refresh indexes
Run a sample query set, record p95

Set one target. Ingest 100k chunks in under 30 minutes. Then tune. If you cannot meet that target, reduce the batch size, increase the number of workers, or move the embedding computation to a GPU.

Primary and replicas

Use this when you need uptime and read scale. Keep writing on the primary. Send retrieval reads to replicas. Use a pooler. Track replication lag. Set a rule for stale reads.

CREATE EXTENSION neurondb;

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
  content TEXT NOT NULL,
  embedding vector(384)
);

SELECT hnsw_create_index('documents', 'embedding', 'documents_embedding_hnsw', 16, 200);

Connection pooling

Use a pooler for app traffic. Set a hard limit on connections. Keep idle connections low. Track pool saturation. Start with 20 to 50 connections per app node. Raise it only after you measure.

Keep one rule. Do not let each app pod open hundreds of direct connections to Postgres. It will fail under load.

Indexing and maintenance

Indexes drift. Stats drift. Tables bloat. Plan for it. Batch ingestion. Refresh stats. Watch index size. Watch vacuum behavior.

ANALYZE documents;

Check query plans. Do it before and after each major ingest. You want an index scan for retrieval queries. You do not want a full table scan.

EXPLAIN (ANALYZE, BUFFERS)
SELECT id, content
FROM documents
ORDER BY embedding <=> embed_text('query', 'sentence-transformers/all-MiniLM-L6-v2')
LIMIT 10;

Replication checks

Track lag. Track replay delay. Set an alert. Use a number. Start with 5 seconds for p95 lag. Use reads from the primary if lag exceeds your limit.

SELECT
    application_name,
    state,
  write_lag,
  flush_lag,
  replay_lag
FROM pg_stat_replication;

Sizing

Start with three numbers. Vector count. Embedding dimension. Peak reads per second. Then add headroom. For raw float storage use vectors times dims times 4 bytes. Ten million vectors at 384 dims is about 15.4 GB for floats. Plan for more once you add row overhead and indexes.

Use a simple table. It keeps planning honest.

1 million vectors at 384 dims, about 1.5 GB floats
10 million vectors at 384 dims, about 15.4 GB floats
10 million vectors at 768 dims, about 30.7 GB floats

Security

Figure: Network isolation, role separation, and access control for on-premises AI systems.

Keep the database private. Restrict inbound. Restrict outbound. Limit roles. Log access. Keep backups protected.

Put the database in private subnets.
Use a bastion or VPN for admin access
Use TLS on internal links
Use disk encryption at rest
Use least privilege roles for apps

Roles

Create one app role per service. Grant only what it needs. Avoid superuser. Avoid owner roles in apps.

CREATE ROLE app_reader NOINHERIT;
GRANT CONNECT ON DATABASE postgres TO app_reader;
GRANT USAGE ON SCHEMA public TO app_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO app_reader;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO app_reader;

Performance

Start with measurement, not assumptions. Measure query latency, index usage, and embedding throughput under realistic load. Verify that the planner uses vector indexes and that queries avoid full-table scans. Run embedding generation in controlled batches to smooth CPU or GPU usage. Apply relational filters as early as possible to reduce the candidate set before vector ranking. Keep result sets small and predictable. Monitor connection pool saturation continuously, since pool exhaustion often becomes the first bottleneck long before CPU or storage limits are reached.

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

SELECT 
    calls,
  ROUND(mean_exec_time::numeric, 2) AS mean_ms,
  ROUND(max_exec_time::numeric, 2) AS max_ms,
  LEFT(query, 120) AS query_preview
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;

Pick two numbers. Retrieval p95. Ingest throughput. Track them daily. Change one thing at a time.

Backups and recovery

Set RPO and RTO. Run restore drills. Write a steps document. Test failover in test. Keep the process repeatable.

Run a restore drill each month. Time it. Record it. Fix the slow steps. Keep one target. Restore your core dataset in under 60 minutes.

Migration from the cloud

" width="800" height="346">

Move the data plane first. Export docs and embeddings. Import into Postgres. Rebuild indexes. Mirror traffic. Compare answers and latency. Cut over with a rollback plan.

CREATE TABLE documents (
  id BIGINT PRIMARY KEY,
    content TEXT,
  embedding vector(384)
);

SELECT hnsw_create_index('documents', 'embedding', 'documents_embedding_hnsw', 16, 200);

Cost model

Use break-even months. Use CapEx divided by cloud monthly minus on-premises monthly. Include staff time, power, support, and depreciation. Include egress and API fees on the cloud side.

Use one example with numbers. Keep it simple.

Capex 120000
Cloud monthly 18000
On premises monthly 9000
Break-even months are 120000 divided by 9000, about 13.3

Checklist

Pick a pattern: single server, cluster, hybrid, or edge
Set targets for p95 latency, QPS, RPO, RTO
Lock down networks, subnets, firewall, bastion
Add TLS and disk encryption
Add a pooler
Build indexes and check query plans
Add monitoring and alerts
Set backups and run a restore drill

Conclusion

On-premises AI works best when the architecture remains simple and close to the data. Keeping embeddings, vector search, and retrieval inside PostgreSQL reduces moving parts and failure modes.
Hybrid SQL plus vector queries deliver control, stable latency, and clear operational boundaries. For teams prioritizing data ownership, predictability, and long-term maintainability, this model fits real production needs.