NeuronDB Support

Posted on Jan 1

Running AI on premises with Postgres

#postgres #vectordatabase #ai #rag

Many AI systems struggle with unpredictable latency and excessive data movement. Documents, embeddings, and vector search often live in different systems, adding hops, cost, and failure modes. This post explains when running vector search and RAG directly in PostgreSQL on-premises makes sense, and how to design it for stable production behavior.

Project references
NeuronDB site: https://www.neurondb.ai
Source code: https://github.com/neurondb/neurondb

Decide if you should run on premises

Pick on premises when you must control where data lives. Use it when you need to keep traffic private. Pick it when you must hit a strict latency target. Pick it when costs grow with API calls and egress. If you need a fast setup for a small pilot, start in the cloud, then move the data plane later.

Compliance: HIPAA, GDPR, PCI, residency rules, audit rules
Security: private networks, strict access, limited outbound traffic
Latency: stable p95 and p99, fewer hops
Cost: high volume usage, where per-call fees add up
Control: standard Postgres and a clear ops surface

Cloud vs on-premises, quick view

Figure: Comparison of data flow, latency paths, and operational boundaries between cloud and on-premises AI systems.

Watch your data movement. In many systems, you fetch documents in one place, run embeddings in another, and run vector search in a third place. Each hop adds latency and failure modes. If you keep these steps within a single network, you reduce variance and debug faster.

Architecture overview

Figure: On-premises AI architecture with documents, embeddings, vector indexes, and retrieval inside PostgreSQL.

Keep the data plane local. Store documents and metadata in Postgres. Store embeddings next to the rows they describe. Build vector indexes in the same database. Run retrieval queries over private links. Expose results through your app services.

Keep three paths clear. Ingest is write-heavy. Retrieval is read-heavy. Admin work is rare but sensitive. Split these paths by network rules and by roles.

Put ingestion on a schedule. Batch it. Keep queries stable. Do not let ad hoc scripts write to the central database. Use a queue or a worker process. Record each run.

What you run

Keep the component list short. Assign an owner to each part. If you cannot name the host and the pager, you are not done.

Postgres with NeuronDB for storage, embeddings, indexes, and retrieval
Ingestion workers for cleaning, chunking, and loads
Embedding execution on CPU or GPU, batch jobs, steady throughput
App services that call Postgres and return citations
Monitoring for latency, load, pool use, lag, and backups

Deployment patterns

Start simple. Prove retrieval quality. Prove latency. Add resilience only when you need it. Keep changes small so you can reverse them.

Single server

Figure: Single-host deployment for early-stage or low-scale workloads.

Use this for your first release. You get one host to secure. You get one Postgres instance to tune. You get precise failure handling. Add backups and dashboards before you add more servers.

CREATE EXTENSION neurondb;

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
  content TEXT NOT NULL,
    embedding vector(384)
);

INSERT INTO documents (content, embedding)
VALUES ('Document content', embed_text('Document content', 'sentence-transformers/all-MiniLM-L6-v2'));

SELECT
  content
FROM documents
ORDER BY embedding <=> embed_text('query', 'sentence-transformers/all-MiniLM-L6-v2')
LIMIT 10;

Add filters early. It keeps results stable. It keeps cost stable. It keeps latency stable.

Data model and chunking

Store chunks, not whole files. Keep the original document id. Store offsets. Store a version. Keep chunk size stable. Start with 300 to 800 tokens per chunk. Start with a 50 to 150 token overlap. Measure answer quality. Then change one variable.

CREATE TABLE doc_chunks (
  doc_id BIGINT NOT NULL,
  chunk_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  tenant_id TEXT NOT NULL,
  source TEXT,
  content TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    embedding vector(384)
);

CREATE INDEX doc_chunks_tenant_doc_idx
  ON doc_chunks (tenant_id, doc_id);

Track a content hash. It lets you skip re-embedding on retries. It enables you to detect duplicates. Use a text hash or a stable id from your upstream system.

Hybrid search with metadata and vectors

Filter with metadata, then rank by vector distance. Use this per tenant. Use it per source. Use it per time window.

ALTER TABLE documents
  ADD COLUMN tenant_id TEXT NOT NULL DEFAULT 'default',
  ADD COLUMN source TEXT,
  ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT now();

SELECT id, content
FROM documents
WHERE tenant_id = 'acme'
  AND (source IS NULL OR source <> 'spam')
ORDER BY embedding <=> embed_text('query', 'sentence-transformers/all-MiniLM-L6-v2')
LIMIT 10;

Ingestion workflow

Use one workflow. Keep it the same across development, testing, and production. Run it in batches. Track each run. Start with these steps.

Fetch raw documents
Normalize text, strip boilerplate
Split into chunks, keep offsets
Insert rows without embeddings
Compute embeddings in batches of 32 to 256
Update embeddings
Build or refresh indexes
Run a sample query set, record p95

Set one target. Ingest 100k chunks in under 30 minutes. Then tune. If you cannot meet that target, reduce the batch size, increase the number of workers, or move the embedding computation to a GPU.

Primary and replicas

Use this when you need uptime and read scale. Keep writing on the primary. Send retrieval reads to replicas. Use a pooler. Track replication lag. Set a rule for stale reads.

CREATE EXTENSION neurondb;

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
  content TEXT NOT NULL,
  embedding vector(384)
);

SELECT hnsw_create_index('documents', 'embedding', 'documents_embedding_hnsw', 16, 200);

Connection pooling

Use a pooler for app traffic. Set a hard limit on connections. Keep idle connections low. Track pool saturation. Start with 20 to 50 connections per app node. Raise it only after you measure.

Keep one rule. Do not let each app pod open hundreds of direct connections to Postgres. It will fail under load.

Indexing and maintenance

Indexes drift. Stats drift. Tables bloat. Plan for it. Batch ingestion. Refresh stats. Watch index size. Watch vacuum behavior.

ANALYZE documents;

Check query plans. Do it before and after each major ingest. You want an index scan for retrieval queries. You do not want a full table scan.

EXPLAIN (ANALYZE, BUFFERS)
SELECT id, content
FROM documents
ORDER BY embedding <=> embed_text('query', 'sentence-transformers/all-MiniLM-L6-v2')
LIMIT 10;

Replication checks

Track lag. Track replay delay. Set an alert. Use a number. Start with 5 seconds for p95 lag. Use reads from the primary if lag exceeds your limit.

SELECT
    application_name,
    state,
  write_lag,
  flush_lag,
  replay_lag
FROM pg_stat_replication;

Sizing

Start with three numbers. Vector count. Embedding dimension. Peak reads per second. Then add headroom. For raw float storage use vectors times dims times 4 bytes. Ten million vectors at 384 dims is about 15.4 GB for floats. Plan for more once you add row overhead and indexes.

Use a simple table. It keeps planning honest.

1 million vectors at 384 dims, about 1.5 GB floats
10 million vectors at 384 dims, about 15.4 GB floats
10 million vectors at 768 dims, about 30.7 GB floats

Security

Figure: Network isolation, role separation, and access control for on-premises AI systems.

Keep the database private. Restrict inbound. Restrict outbound. Limit roles. Log access. Keep backups protected.

Put the database in private subnets.
Use a bastion or VPN for admin access
Use TLS on internal links
Use disk encryption at rest
Use least privilege roles for apps

Roles

Create one app role per service. Grant only what it needs. Avoid superuser. Avoid owner roles in apps.

CREATE ROLE app_reader NOINHERIT;
GRANT CONNECT ON DATABASE postgres TO app_reader;
GRANT USAGE ON SCHEMA public TO app_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO app_reader;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO app_reader;

Performance

Start with measurement, not assumptions. Measure query latency, index usage, and embedding throughput under realistic load. Verify that the planner uses vector indexes and that queries avoid full-table scans. Run embedding generation in controlled batches to smooth CPU or GPU usage. Apply relational filters as early as possible to reduce the candidate set before vector ranking. Keep result sets small and predictable. Monitor connection pool saturation continuously, since pool exhaustion often becomes the first bottleneck long before CPU or storage limits are reached.

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

SELECT 
    calls,
  ROUND(mean_exec_time::numeric, 2) AS mean_ms,
  ROUND(max_exec_time::numeric, 2) AS max_ms,
  LEFT(query, 120) AS query_preview
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;

Pick two numbers. Retrieval p95. Ingest throughput. Track them daily. Change one thing at a time.

Backups and recovery

Set RPO and RTO. Run restore drills. Write a steps document. Test failover in test. Keep the process repeatable.

Run a restore drill each month. Time it. Record it. Fix the slow steps. Keep one target. Restore your core dataset in under 60 minutes.

Migration from the cloud

Move the data plane first. Export docs and embeddings. Import into Postgres. Rebuild indexes. Mirror traffic. Compare answers and latency. Cut over with a rollback plan.

CREATE TABLE documents (
  id BIGINT PRIMARY KEY,
    content TEXT,
  embedding vector(384)
);

SELECT hnsw_create_index('documents', 'embedding', 'documents_embedding_hnsw', 16, 200);

Cost model

Use break-even months. Use CapEx divided by cloud monthly minus on-premises monthly. Include staff time, power, support, and depreciation. Include egress and API fees on the cloud side.

Use one example with numbers. Keep it simple.

Capex 120000
Cloud monthly 18000
On premises monthly 9000
Break-even months are 120000 divided by 9000, about 13.3

Checklist

Pick a pattern: single server, cluster, hybrid, or edge
Set targets for p95 latency, QPS, RPO, RTO
Lock down networks, subnets, firewall, bastion
Add TLS and disk encryption
Add a pooler
Build indexes and check query plans
Add monitoring and alerts
Set backups and run a restore drill

Conclusion

On-premises AI works best when the architecture remains simple and close to the data. Keeping embeddings, vector search, and retrieval inside PostgreSQL reduces moving parts and failure modes.
Hybrid SQL plus vector queries deliver control, stable latency, and clear operational boundaries. For teams prioritizing data ownership, predictability, and long-term maintainability, this model fits real production needs.

DEV Community