<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: NeuronDB Support</title>
    <description>The latest articles on DEV Community by NeuronDB Support (@neurondb_support_d73fa7ba).</description>
    <link>https://dev.to/neurondb_support_d73fa7ba</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3688837%2Fb6baa658-3a8c-47fa-8fcc-30315baa5bc7.png</url>
      <title>DEV Community: NeuronDB Support</title>
      <link>https://dev.to/neurondb_support_d73fa7ba</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/neurondb_support_d73fa7ba"/>
    <language>en</language>
    <item>
      <title>NeuronDB Vector vs pgvector: Technical Comparison</title>
      <dc:creator>NeuronDB Support</dc:creator>
      <pubDate>Sun, 01 Feb 2026 15:35:44 +0000</pubDate>
      <link>https://dev.to/neurondb_support_d73fa7ba/neurondb-vector-vs-pgvector-technical-comparison-4mmh</link>
      <guid>https://dev.to/neurondb_support_d73fa7ba/neurondb-vector-vs-pgvector-technical-comparison-4mmh</guid>
      <description>&lt;p&gt;You store embeddings as vectors. You run a similarity search inside PostgreSQL. Two extensions matter in this space: pgvector and NeuronDB. This post compares both extensions using real behavior from the source trees in this repo. Every limit and object name matches code and SQL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Vector databases store embeddings. Similarity search ranks rows by distance. PostgreSQL extensions bring vector types, distance operators, and index access methods into the SQL layer.&lt;/p&gt;

&lt;p&gt;You choose pgvector when you want a focused extension with broad adoption. You choose NeuronDB when you want pgvector-style SQL plus additional types, GPU paths, and operational surface area inside the extension.&lt;/p&gt;

&lt;p&gt;This post uses pgvector v0.8.1 semantics and NeuronDB v3.0.0-devel semantics from the local source. Feature parity varies by object. Some pieces match one-to-one. Some pieces use different names with aliases.&lt;/p&gt;

&lt;p&gt;Project references&lt;br&gt;
NeuronDB site: &lt;a href="https://www.neurondb.ai" rel="noopener noreferrer"&gt;https://www.neurondb.ai&lt;/a&gt;&lt;br&gt;
Source code: &lt;a href="https://github.com/neurondb/neurondb" rel="noopener noreferrer"&gt;https://github.com/neurondb/neurondb&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Architectural choices define performance limits and feature capabilities.&lt;/p&gt;
&lt;h3&gt;
  
  
  pgvector Architecture
&lt;/h3&gt;

&lt;p&gt;pgvector implements types, operators, and index access methods in C and SQL. The extension exposes a small surface area and relies on PostgreSQL storage, WAL, and query planning.&lt;/p&gt;
&lt;h4&gt;
  
  
  Type system and layouts
&lt;/h4&gt;

&lt;p&gt;pgvector defines these public types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;vector&lt;/code&gt;: dense float32 vector, up to 16000 dimensions.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;halfvec&lt;/code&gt;: half precision vector, up to 16000 dimensions.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sparsevec&lt;/code&gt;: sparse vector with int32 indices and float32 values, limits depend on operation.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;bit&lt;/code&gt;: PostgreSQL &lt;code&gt;bit&lt;/code&gt; type, used as a binary vector for Hamming and Jaccard distance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;vector&lt;/code&gt; type uses a varlena header plus two int16 fields: &lt;code&gt;dim&lt;/code&gt; and &lt;code&gt;unused&lt;/code&gt;. The payload stores &lt;code&gt;dim&lt;/code&gt; float32 values. This layout matches &lt;code&gt;typedef struct Vector&lt;/code&gt; in pgvector and yields storage of 4 times dimensions plus 8 bytes per value.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;sparsevec&lt;/code&gt; on disk format stores &lt;code&gt;dim&lt;/code&gt; (int32), &lt;code&gt;nnz&lt;/code&gt; (int32), and &lt;code&gt;unused&lt;/code&gt; (int32) in the header, followed by &lt;code&gt;nnz&lt;/code&gt; int32 indices. Values follow indices as a contiguous float32 array.&lt;/p&gt;
&lt;h4&gt;
  
  
  CPU dispatch
&lt;/h4&gt;

&lt;p&gt;pgvector uses a mix of scalar code and CPU dispatch. On Linux x86_64, some functions compile with &lt;code&gt;target_clones&lt;/code&gt; to generate multiple code paths. The code selects a path based on CPU capabilities. This approach appears in &lt;code&gt;vector.c&lt;/code&gt; and &lt;code&gt;bitutils.c&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  NeuronDB Architecture
&lt;/h3&gt;

&lt;p&gt;NeuronDB implements vector types, operators, access methods, and additional systems within a single extension. The extension defines types beyond pgvector, adds IVF under the access method name &lt;code&gt;ivf&lt;/code&gt;, and includes GPU backends.&lt;/p&gt;
&lt;h4&gt;
  
  
  Type system and layouts
&lt;/h4&gt;

&lt;p&gt;NeuronDB exposes pgvector style types, plus additional NeuronDB-specific types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;vector&lt;/code&gt;: dense float32 vector with &lt;code&gt;dim&lt;/code&gt; int16 and an &lt;code&gt;unused&lt;/code&gt; int16 field, up to 16000 dimensions.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vectorp&lt;/code&gt;: packed vector with metadata. The layout includes a CRC32 fingerprint, a version, a dimension, and an endian guard, followed by float32 data.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vecmap&lt;/code&gt;: sparse high-dimensional map. The layout stores &lt;code&gt;total_dim&lt;/code&gt; and &lt;code&gt;nnz&lt;/code&gt;, followed by parallel int32 indices and float32 values.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;halfvec&lt;/code&gt;: half precision vector with a 4000 dimension limit in NeuronDB.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sparsevec&lt;/code&gt;: sparse vector type with 1000 nonzero entries and 1M dimensions in NeuronDB.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;binaryvec&lt;/code&gt;: binary vector type with a Hamming distance operator.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;NeuronDB also defines internal structs for quantized vectors, such as int8, int4, and binary-packed representations. The SQL surface exposes conversion and distance functions.&lt;/p&gt;
&lt;h4&gt;
  
  
  Index access methods
&lt;/h4&gt;

&lt;p&gt;NeuronDB defines two ANN index access methods as PostgreSQL access methods:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;hnsw&lt;/code&gt;: HNSW index access method.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ivf&lt;/code&gt;: IVF index access method.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;NeuronDB also defines operator classes for &lt;code&gt;vector&lt;/code&gt;, &lt;code&gt;halfvec&lt;/code&gt;, &lt;code&gt;sparsevec&lt;/code&gt;, &lt;code&gt;bit&lt;/code&gt;, and &lt;code&gt;binaryvec&lt;/code&gt; in both &lt;code&gt;hnsw&lt;/code&gt; and &lt;code&gt;ivf&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  CPU SIMD and GPU backends
&lt;/h4&gt;

&lt;p&gt;NeuronDB includes explicit AVX2 and AVX-512 implementations of common distance functions in &lt;code&gt;vector_distance_simd.c&lt;/code&gt;. The build selects the compiled path based on compiler flags.&lt;/p&gt;

&lt;p&gt;NeuronDB includes three GPU backend families in the tree:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CUDA&lt;/li&gt;
&lt;li&gt;ROCm&lt;/li&gt;
&lt;li&gt;Metal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The runtime backend selection logic maps backend type to names &lt;code&gt;cuda&lt;/code&gt;, &lt;code&gt;rocm&lt;/code&gt;, and &lt;code&gt;metal&lt;/code&gt;. GPU entry points for HNSW and IVF kNN search are provided via SQL functions.&lt;/p&gt;
&lt;h2&gt;
  
  
  Feature Comparison
&lt;/h2&gt;

&lt;p&gt;Both extensions integrate with PostgreSQL. NeuronDB adds operational features.&lt;/p&gt;
&lt;h3&gt;
  
  
  Table 1: Types, distances, indexes, and hard limits
&lt;/h3&gt;

&lt;p&gt;This table focuses on public SQL objects and hard limits enforced by each project.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Area&lt;/th&gt;
&lt;th&gt;pgvector&lt;/th&gt;
&lt;th&gt;NeuronDB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Extension name&lt;/td&gt;
&lt;td&gt;&lt;code&gt;vector&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;neurondb&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dense type&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;vector&lt;/code&gt; (float32), max 16000 dims&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;vector&lt;/code&gt; (float32), max 16000 dims&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Half type&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;halfvec&lt;/code&gt; (half), max 16000 dims&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;halfvec&lt;/code&gt; (FP16), max 4000 dims&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sparse type&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;sparsevec&lt;/code&gt; (dim int32, nnz int32, indices int32, values float32), max 1e9 dims, max 16000 nnz&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;sparsevec&lt;/code&gt;, max 1M dims, max 1000 nonzero entries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Binary vector&lt;/td&gt;
&lt;td&gt;PostgreSQL &lt;code&gt;bit&lt;/code&gt; plus pgvector operators&lt;/td&gt;
&lt;td&gt;PostgreSQL &lt;code&gt;bit&lt;/code&gt; operator classes plus &lt;code&gt;binaryvec&lt;/code&gt; type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Distance operators&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;&amp;lt;-&amp;gt;&lt;/code&gt; L2, &lt;code&gt;&amp;lt;#&amp;gt;&lt;/code&gt; negative inner product, &lt;code&gt;&amp;lt;=&amp;gt;&lt;/code&gt; cosine, &lt;code&gt;&amp;lt;+&amp;gt;&lt;/code&gt; L1, &lt;code&gt;&amp;lt;~&amp;gt;&lt;/code&gt; Hamming, &lt;code&gt;&amp;lt;%&amp;gt;&lt;/code&gt; Jaccard&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;&amp;lt;-&amp;gt;&lt;/code&gt; L2, &lt;code&gt;&amp;lt;#&amp;gt;&lt;/code&gt; negative inner product, &lt;code&gt;&amp;lt;=&amp;gt;&lt;/code&gt; cosine, &lt;code&gt;&amp;lt;+&amp;gt;&lt;/code&gt; L1, &lt;code&gt;&amp;lt;~&amp;gt;&lt;/code&gt; Hamming, Jaccard via &lt;code&gt;vector_jaccard_distance(vector, vector)&lt;/code&gt; and &lt;code&gt;&amp;lt;%&amp;gt;&lt;/code&gt; for &lt;code&gt;bit&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ANN access methods&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;hnsw&lt;/code&gt;, &lt;code&gt;ivfflat&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;hnsw&lt;/code&gt;, &lt;code&gt;ivf&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dense index max dims&lt;/td&gt;
&lt;td&gt;2000 for HNSW and IVFFlat&lt;/td&gt;
&lt;td&gt;limited by page layout, large dims fail with a page size error during build&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Table 2: Tuning knobs, defaults, and where each knob lives
&lt;/h3&gt;

&lt;p&gt;This table lists knobs. Each knob changes recall, latency, or build time. The table also lists the location of each knob.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Knob&lt;/th&gt;
&lt;th&gt;pgvector&lt;/th&gt;
&lt;th&gt;NeuronDB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HNSW &lt;code&gt;m&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;index option &lt;code&gt;WITH (m = N)&lt;/code&gt;, default 16&lt;/td&gt;
&lt;td&gt;index option &lt;code&gt;WITH (m = N)&lt;/code&gt;, default 16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HNSW &lt;code&gt;ef_construction&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;index option &lt;code&gt;WITH (ef_construction = N)&lt;/code&gt;, default 64&lt;/td&gt;
&lt;td&gt;index option &lt;code&gt;WITH (ef_construction = N)&lt;/code&gt;, default 200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HNSW &lt;code&gt;ef_search&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;GUC &lt;code&gt;hnsw.ef_search&lt;/code&gt;, default 40&lt;/td&gt;
&lt;td&gt;GUC &lt;code&gt;neurondb.hnsw_ef_search&lt;/code&gt;, default 64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HNSW iterative scans&lt;/td&gt;
&lt;td&gt;GUC &lt;code&gt;hnsw.iterative_scan&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;GUC &lt;code&gt;neurondb.hnsw_iterative_scan&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HNSW scan stop&lt;/td&gt;
&lt;td&gt;GUC &lt;code&gt;hnsw.max_scan_tuples&lt;/code&gt; and &lt;code&gt;hnsw.scan_mem_multiplier&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;GUC &lt;code&gt;neurondb.hnsw_max_scan_tuples&lt;/code&gt; and &lt;code&gt;neurondb.hnsw_scan_mem_multiplier&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IVF lists&lt;/td&gt;
&lt;td&gt;index option &lt;code&gt;WITH (lists = N)&lt;/code&gt; on &lt;code&gt;ivfflat&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;index option &lt;code&gt;WITH (lists = N)&lt;/code&gt; on &lt;code&gt;ivf&lt;/code&gt;, and NeuronDB maps &lt;code&gt;ivfflat&lt;/code&gt; to &lt;code&gt;ivf&lt;/code&gt; in helper functions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IVF probes&lt;/td&gt;
&lt;td&gt;GUC &lt;code&gt;ivfflat.probes&lt;/code&gt;, default 1&lt;/td&gt;
&lt;td&gt;GUC &lt;code&gt;neurondb.ivf_probes&lt;/code&gt;, default 10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IVF iterative scans&lt;/td&gt;
&lt;td&gt;GUC &lt;code&gt;ivfflat.iterative_scan&lt;/code&gt; and &lt;code&gt;ivfflat.max_probes&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;GUC &lt;code&gt;neurondb.ivf_iterative_scan&lt;/code&gt; and &lt;code&gt;neurondb.ivf_max_probes&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Table 3: Acceleration and storage formats
&lt;/h3&gt;

&lt;p&gt;This table covers CPU SIMD, GPU backends, and compressed formats.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Area&lt;/th&gt;
&lt;th&gt;pgvector&lt;/th&gt;
&lt;th&gt;NeuronDB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU vector dispatch&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;target_clones&lt;/code&gt; dispatch on supported builds&lt;/td&gt;
&lt;td&gt;explicit AVX2 and AVX-512 distance functions, selected by build flags&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU backends&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;CUDA, ROCm, Metal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU kNN helpers&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;hnsw_knn_search_gpu(query vector, k int, ef_search int)&lt;/code&gt; and &lt;code&gt;ivf_knn_search_gpu(query vector, k int, nprobe int)&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Packed dense format&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;vectorp&lt;/code&gt; with CRC32 fingerprint, version, endian guard, and float32 data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sparse high dim format&lt;/td&gt;
&lt;td&gt;&lt;code&gt;sparsevec&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;vecmap&lt;/code&gt; and NeuronDB &lt;code&gt;sparsevec&lt;/code&gt; type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quantized internal types&lt;/td&gt;
&lt;td&gt;binary quantization via &lt;code&gt;binary_quantize&lt;/code&gt; to &lt;code&gt;bit&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;int8, int4, binary, and FP16 quantization in type and function layer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  Production Readiness
&lt;/h2&gt;

&lt;p&gt;Production systems need repeatable behavior, clear configuration, and a monitoring path. NeuronDB ships extra primitives for tenant controls, queue-based workflows, and metrics export.&lt;/p&gt;

&lt;p&gt;NeuronDB includes these operational surfaces in SQL:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tenant usage tables and quota tracking&lt;/li&gt;
&lt;li&gt;background worker tables and manual triggers&lt;/li&gt;
&lt;li&gt;Prometheus compatible metrics via SQL, plus an HTTP exporter endpoint&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;Performance depends on dataset shape, index parameters, storage layout, and query patterns. Use the benchmark scripts in this repo to measure your hardware and build.&lt;/p&gt;
&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;p&gt;The repository includes benchmark scripts and SQL stress tests. Use these tools to compare pgvector and NeuronDB on your own system.&lt;/p&gt;
&lt;h3&gt;
  
  
  Vector benchmark suite
&lt;/h3&gt;

&lt;p&gt;The vector benchmark suite downloads public ANN datasets, loads them into PostgreSQL, builds indexes, runs queries, and writes JSON results.&lt;/p&gt;

&lt;p&gt;Run the full pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 NeuronDB/benchmark/vector/run_bm.py &lt;span class="nt"&gt;--prepare&lt;/span&gt; &lt;span class="nt"&gt;--load&lt;/span&gt; &lt;span class="nt"&gt;--run&lt;/span&gt; &lt;span class="nt"&gt;--datasets&lt;/span&gt; sift-128-euclidean &lt;span class="nt"&gt;--configs&lt;/span&gt; hnsw &lt;span class="nt"&gt;--k-values&lt;/span&gt; 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run a quick pipeline with defaults:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 NeuronDB/benchmark/vector/run_bm.py &lt;span class="nt"&gt;--prepare&lt;/span&gt; &lt;span class="nt"&gt;--load&lt;/span&gt; &lt;span class="nt"&gt;--run&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Stress tests
&lt;/h3&gt;

&lt;p&gt;The repo includes SQL stress suites for pgvector and NeuronDB.&lt;/p&gt;

&lt;p&gt;pgvector stress suite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="n"&gt;NeuronDB&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;pgvector_stress&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;sql&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;NeuronDB stress suite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="n"&gt;NeuronDB&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;neurondb_vector_stress&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;sql&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Example result from committed artifact
&lt;/h3&gt;

&lt;p&gt;This example comes from &lt;code&gt;NeuronDB/benchmark/vector/results/benchmark_sift-128-euclidean_hnsw_20260104_211033.json&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Dataset: sift-128-euclidean
Train vectors: 1000000
Test queries: 10000
Dimension: 128
Index: hnsw, m 16, ef_construction 200
Query: k 10, ef_search 100
Average latency ms: 512.9799604415894
P95 latency ms: 521.4026927947998
QPS: 1.9493938888746618
Recall: 1.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Practical usage
&lt;/h2&gt;

&lt;p&gt;This section focuses on repeatable workflows. Each workflow uses real object names from both projects.&lt;/p&gt;

&lt;h3&gt;
  
  
  Basic table and query pattern
&lt;/h3&gt;

&lt;p&gt;Use a fixed-dimension column when a single embedding model drives it. Use a typmod column less when multiple embedding models share one column.&lt;/p&gt;

&lt;p&gt;Example with a fixed dimension:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;bigint&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'[1,2,3]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'[4,5,6]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'[3,1,2]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;l2_distance&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;l2_distance&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Indexing with HNSW
&lt;/h3&gt;

&lt;p&gt;HNSW uses a graph. You trade recall for speed by changing the candidate list size during search.&lt;/p&gt;

&lt;p&gt;pgvector uses these knobs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;index reloptions: &lt;code&gt;m&lt;/code&gt;, &lt;code&gt;ef_construction&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;query time GUC: &lt;code&gt;hnsw.ef_search&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;NeuronDB uses these knobs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;index reloptions: &lt;code&gt;m&lt;/code&gt;, &lt;code&gt;ef_construction&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;query time GUC: &lt;code&gt;neurondb.hnsw_ef_search&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;HNSW index creation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;items_embedding_hnsw_l2&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;hnsw&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector_l2_ops&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ef_construction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Query time tuning with pgvector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;hnsw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ef_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Query time tuning with NeuronDB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;neurondb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hnsw_ef_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Indexing with IVF
&lt;/h3&gt;

&lt;p&gt;IVF uses lists. Search probes determine how many lists participate in a query.&lt;/p&gt;

&lt;p&gt;pgvector uses the access method name &lt;code&gt;ivfflat&lt;/code&gt; and uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;index reloption: &lt;code&gt;lists&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;query time GUC: &lt;code&gt;ivfflat.probes&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;NeuronDB uses the access method name &lt;code&gt;ivf&lt;/code&gt; and uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;index reloption: &lt;code&gt;lists&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;query time GUC: &lt;code&gt;neurondb.ivf_probes&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;IVF index creation in pgvector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;items_embedding_ivfflat_l2&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;ivfflat&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector_l2_ops&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lists&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;IVF index creation in NeuronDB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;items_embedding_ivf_l2&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;ivf&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector_l2_ops&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lists&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Query time tuning with pgvector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;ivfflat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;probes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Query time tuning with NeuronDB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;neurondb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ivf_probes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Filtered search
&lt;/h3&gt;

&lt;p&gt;Filtered kNN queries need two things: a filter predicate and an ordered distance sort with a limit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'[3,1,2]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;l2_distance&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;l2_distance&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For approximate indexes, filtering happens after index traversal in pgvector. pgvector provides iterative index scans to extend scans when filtering removes rows from the first pass.&lt;/p&gt;

&lt;p&gt;Iterative scans for pgvector HNSW:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;hnsw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iterative_scan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;strict_order&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;hnsw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_scan_tuples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20000&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;hnsw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scan_mem_multiplier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Iterative scans for NeuronDB HNSW:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;neurondb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hnsw_iterative_scan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;strict_order&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;neurondb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hnsw_max_scan_tuples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20000&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;neurondb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hnsw_scan_mem_multiplier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  NeuronDB packed and sparse formats
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;vectorp&lt;/code&gt; when you want a packed dense format with metadata. Use &lt;code&gt;vecmap&lt;/code&gt; for sparse high-dimensional inputs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;packed_items&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;bigint&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vectorp&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;packed_items&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'[1,2,3]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vectorp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;packed_items&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sparse_items&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;bigint&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vecmap&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quantization workflows
&lt;/h2&gt;

&lt;p&gt;Quantization trades precision for smaller storage and faster scans. NeuronDB exposes multiple quantization functions. Each function returns a &lt;code&gt;bytea&lt;/code&gt; representation.&lt;/p&gt;

&lt;p&gt;CPU quantization examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;vector_to_int8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'[1,2,3]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;q_int8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;vector_to_fp16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'[1,2,3]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;q_fp16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;vector_to_binary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'[1,2,3]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;q_binary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;vector_to_int4&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'[1,2,3]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;q_int4&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Accuracy analysis examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;quantize_analyze_int8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'[1,2,3]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;int8_stats&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantize_analyze_fp16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'[1,2,3]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;fp16_stats&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantize_analyze_binary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'[1,2,3]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;binary_stats&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantize_analyze_int4&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'[1,2,3]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;int4_stats&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  GPU workflows in NeuronDB
&lt;/h2&gt;

&lt;p&gt;NeuronDB exposes GPU status, GPU distance functions, and GPU kNN helpers in SQL.&lt;/p&gt;

&lt;p&gt;GPU initialization and status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;neurondb_gpu_enable&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;gpu_enabled&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;device_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;total_memory_mb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;free_memory_mb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;is_available&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;neurondb_gpu_info&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GPU distance functions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;vector_l2_distance_gpu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'[1,2,3]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'[4,5,6]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;l2_gpu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;vector_cosine_distance_gpu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'[1,2,3]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'[4,5,6]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;cosine_gpu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;vector_inner_product_gpu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'[1,2,3]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'[4,5,6]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;ip_gpu&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GPU kNN helpers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;distance&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;hnsw_knn_search_gpu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'[1,2,3]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;distance&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ivf_knn_search_gpu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'[1,2,3]'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GPU usage stats:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;queries_executed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fallback_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;total_gpu_time_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;total_cpu_time_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;avg_latency_ms&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;neurondb_gpu_stats&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Multi tenant controls in NeuronDB
&lt;/h2&gt;

&lt;p&gt;NeuronDB includes tenant quota tracking and tenant specific helper functions. These objects live in the &lt;code&gt;neurondb&lt;/code&gt; schema.&lt;/p&gt;

&lt;p&gt;Tenant quota tables and views support workflows such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;enforce per tenant vector count limits&lt;/li&gt;
&lt;li&gt;track per tenant storage usage&lt;/li&gt;
&lt;li&gt;query tenant usage and quota percent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;NeuronDB includes tenant-aware HNSW helper functions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;hnsw_tenant_create&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;hnsw_tenant_search&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;hnsw_tenant_quota&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Monitoring in NeuronDB
&lt;/h2&gt;

&lt;p&gt;NeuronDB exposes Prometheus-compatible metrics via SQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;queries_total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;queries_success&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;queries_error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;query_duration_sum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;vectors_total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;cache_hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;cache_misses&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;workers_active&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;neurondb_prometheus_metrics&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Background workers and queues in NeuronDB
&lt;/h2&gt;

&lt;p&gt;NeuronDB stores the queue and metrics state in SQL tables under the &lt;code&gt;neurondb&lt;/code&gt; schema. Background workers process or sample those tables when enabled in PostgreSQL.&lt;/p&gt;

&lt;p&gt;Queue workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;neurondb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;job_queue&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'embedding'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'{"text":"hello"}'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;jsonb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;job_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;retry_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;neurondb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;job_queue&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Manual trigger helpers exist for testing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;neuranq_run_once&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;queued_work&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;neuranmon_sample&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;tuner_sample&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;neurandefrag_run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;defrag_ran&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  LLM configuration and jobs in NeuronDB
&lt;/h2&gt;

&lt;p&gt;NeuronDB stores LLM provider configuration in &lt;code&gt;neurondb.llm_config&lt;/code&gt; and stores jobs in &lt;code&gt;neurondb.llm_jobs&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Configuration workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;neurondb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_llm_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'https://api-inference.huggingface.co'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'REPLACE_WITH_KEY'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'REPLACE_WITH_MODEL'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;api_base&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;default_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;updated_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;neurondb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_llm_config&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Job enqueue workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;ndb_llm_enqueue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'embed'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'REPLACE_WITH_MODEL'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'hello world'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'tenant0'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Index tuning helpers in NeuronDB
&lt;/h2&gt;

&lt;p&gt;NeuronDB exposes index tuning and diagnostics helpers in SQL. These helpers return JSONB.&lt;/p&gt;

&lt;p&gt;Examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;index_tune_hnsw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'items'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'embedding'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;hnsw_recommendation&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;index_tune_ivf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'items'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'embedding'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;ivf_recommendation&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;index_recommend_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'items'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'embedding'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_choice&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;index_tune_query_params&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'items_embedding_hnsw_l2'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;query_knobs&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Migration
&lt;/h2&gt;

&lt;p&gt;Migration replaces one extension with another extension. Existing tables remain. Indexes with dependencies on &lt;code&gt;vector&lt;/code&gt; extension objects drop during &lt;code&gt;DROP EXTENSION vector CASCADE&lt;/code&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Drop pgvector
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="k"&gt;CASCADE&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Install NeuronDB
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;neurondb&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Verify data
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;row_count&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your table rows remain. You still need to rebuild ANN indexes after dropping pgvector.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Recreate indexes
For large tables (&amp;gt;100GB), increase &lt;code&gt;maintenance_work_mem&lt;/code&gt; before building.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;maintenance_work_mem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'4GB'&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;hnsw&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector_l2_ops&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ef_construction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;GPU
Use the NeuronDB GPU functions and settings from the extension. The SQL surface exposes GPU kNN functions for HNSW and IVF.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Use Case Recommendations
&lt;/h2&gt;

&lt;p&gt;Select the tool for your infrastructure and requirements. Start with your workload, then map your constraints to a short decision.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use pgvector when your goal is simple vector search
&lt;/h3&gt;

&lt;p&gt;Pick pgvector when you want fewer moving parts and fewer extension specific features.&lt;/p&gt;

&lt;p&gt;Use pgvector when you meet most of these conditions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You run CPU only workloads.&lt;/li&gt;
&lt;li&gt;You want &lt;code&gt;ivfflat&lt;/code&gt; and &lt;code&gt;hnsw&lt;/code&gt; naming across docs, examples, and client libraries.&lt;/li&gt;
&lt;li&gt;You want distance operators and two ANN access methods, with minimal extra SQL objects.&lt;/li&gt;
&lt;li&gt;You want a smaller operational surface area inside the extension.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use pgvector when your query pattern looks like this most of the time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use pgvector when you tune with pgvector GUCs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;hnsw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ef_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;ivfflat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;probes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Use NeuronDB when your goal is a larger in database surface area
&lt;/h3&gt;

&lt;p&gt;Pick NeuronDB when you want the same distance operators and index patterns plus additional SQL objects for GPU workflows, quantization workflows, tuning helpers, and operational queues.&lt;/p&gt;

&lt;p&gt;Use NeuronDB when you meet most of these conditions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want &lt;code&gt;ivf&lt;/code&gt; as an access method name, plus &lt;code&gt;hnsw&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;You want &lt;code&gt;vectorp&lt;/code&gt; and &lt;code&gt;vecmap&lt;/code&gt; as additional storage formats.&lt;/li&gt;
&lt;li&gt;You want SQL functions for quantization, with both CPU and GPU entry points.&lt;/li&gt;
&lt;li&gt;You want SQL functions for GPU status, GPU distance, and GPU kNN helpers.&lt;/li&gt;
&lt;li&gt;You want SQL tables and views for tenant quotas, job queues, metrics, and Prometheus export.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use NeuronDB when you want NeuronDB specific query time tuning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;neurondb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hnsw_ef_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;neurondb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ivf_probes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use NeuronDB when you want index tuning helpers and diagnostics in SQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;index_tune_hnsw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'items'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'embedding'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;hnsw_recommendation&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;index_tune_ivf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'items'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'embedding'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;ivf_recommendation&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;index_recommend_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'items'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'embedding'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_choice&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Short decision flow
&lt;/h3&gt;

&lt;p&gt;Start here when you need a quick answer.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;If you want &lt;code&gt;ivfflat&lt;/code&gt;, pick pgvector.&lt;/li&gt;
&lt;li&gt;If you want &lt;code&gt;ivf&lt;/code&gt;, pick NeuronDB.&lt;/li&gt;
&lt;li&gt;If you want GPU SQL entry points, pick NeuronDB.&lt;/li&gt;
&lt;li&gt;If you want fewer extension-owned tables and views, pick pgvector.&lt;/li&gt;
&lt;li&gt;If you want quantization helpers and analysis functions in SQL, pick NeuronDB.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Practical scenarios
&lt;/h3&gt;

&lt;p&gt;Use this section as a checklist.&lt;/p&gt;

&lt;p&gt;Scenario 1: Single app, single embedding model, CPU only&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use pgvector&lt;/li&gt;
&lt;li&gt;Create one HNSW index on &lt;code&gt;vector_l2_ops&lt;/code&gt; or &lt;code&gt;vector_cosine_ops&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Tune &lt;code&gt;hnsw.ef_search&lt;/code&gt; per endpoint or per query&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scenario 2: Multi-tenant SaaS with per-tenant limits&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use NeuronDB&lt;/li&gt;
&lt;li&gt;Use tenant quota tables and views under the &lt;code&gt;neurondb&lt;/code&gt; schema&lt;/li&gt;
&lt;li&gt;Use tenant aware HNSW helper functions when you want tenant scoped index management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scenario 3: Storage pressure from large embeddings&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use NeuronDB&lt;/li&gt;
&lt;li&gt;Use quantization functions to produce compact &lt;code&gt;bytea&lt;/code&gt; outputs&lt;/li&gt;
&lt;li&gt;Compare distance preservation with &lt;code&gt;quantize_compare_distances&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scenario 4: GPU present, batch heavy workloads&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use NeuronDB&lt;/li&gt;
&lt;li&gt;Enable GPU runtime and query &lt;code&gt;neurondb_gpu_info&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Use GPU kNN helpers where your workflow matches those function signatures&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;pgvector focuses on vector search primitives. NeuronDB adds additional types, an &lt;code&gt;ivf&lt;/code&gt; access method, GPU entry points, quantization helpers, worker tables, and metrics export.&lt;/p&gt;

&lt;p&gt;Pick the extension based on your operational goal. Keep your schema and query patterns simple. Measure with the benchmark scripts in this repo, then tune one knob at a time.&lt;/p&gt;

</description>
      <category>vectordatabase</category>
      <category>pgvector</category>
      <category>postgres</category>
      <category>vector</category>
    </item>
    <item>
      <title>8 RAG Patterns You Should Stop Ignoring</title>
      <dc:creator>NeuronDB Support</dc:creator>
      <pubDate>Sun, 01 Feb 2026 13:00:19 +0000</pubDate>
      <link>https://dev.to/neurondb_support_d73fa7ba/retrieval-augmented-generation-architectures-patterns-and-production-reality-49g1</link>
      <guid>https://dev.to/neurondb_support_d73fa7ba/retrieval-augmented-generation-architectures-patterns-and-production-reality-49g1</guid>
      <description>&lt;p&gt;Large language models generate fluent text. They fail to meet grounding, traceability, freshness, and access control requirements. Retrieval-Augmented Generation addresses this by forcing models to answer using external evidence.&lt;/p&gt;

&lt;p&gt;Early RAG used one simple pipeline. Production systems now use multiple architecture patterns. Each pattern targets a different failure mode. This post explains eight major RAG architectures used in production today.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F75oygsxfybutofphj3kp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F75oygsxfybutofphj3kp.png" alt="Retrieval Augmented Generation: Architectures, Patterns, and Production Reality" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Project references&lt;br&gt;
NeuronDB site: &lt;a href="https://www.neurondb.ai" rel="noopener noreferrer"&gt;https://www.neurondb.ai&lt;/a&gt;&lt;br&gt;
Source code: &lt;a href="https://github.com/neurondb/neurondb" rel="noopener noreferrer"&gt;https://github.com/neurondb/neurondb&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is RAG
&lt;/h2&gt;

&lt;p&gt;RAG links three systems: storage, retrieval, and generation. The storage layer holds your documents, chunks, and embeddings. The retrieval layer finds relevant evidence for each query. The generation layer produces answers conditioned on the retrieved context. The pipeline flows from query through evidence retrieval, context building, answer generation, and citation return. You get factual grounding, fresh data usage, private data isolation, and audit trace support. RAG shifted AI engineering from prompt tuning toward data pipeline engineering.&lt;/p&gt;

&lt;p&gt;The storage layer supports multiple backends, including vector databases (Pinecone, Weaviate, Milvus), document stores (Elasticsearch, OpenSearch), and hybrid systems. The retrieval layer runs embedding models, keyword search, or graph traversal, depending on the architecture. The generation layer typically uses a large language model with a prompt template. The three layers communicate through a well-defined interface. You swap components without rewriting the full pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Naive RAG
&lt;/h2&gt;

&lt;p&gt;Naive RAG uses direct vector similarity retrieval with no feedback loop. The name comes from the original RAG paper. The architecture remains the baseline for comparison.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9782k79mfsevxwx5qyu6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9782k79mfsevxwx5qyu6.png" alt="Naive RAG" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Pipeline
&lt;/h3&gt;

&lt;p&gt;Document ingestion loads raw text from files, databases, or APIs. Preprocessing normalizes whitespace, strips markup, and segments by logical boundaries. Text chunking splits documents into fixed-size or variable-size segments. Common choices: 256 tokens, 512 tokens, or sentence-based chunks. Embedding generation converts each chunk into a vector using a pretrained model. Vector storage writes embeddings to a vector database with metadata (source doc, chunk index, timestamp). At query time, the user submits a question. Query embedding converts the question into a vector. Vector search returns the top-k nearest chunks by cosine similarity or Euclidean distance. Context injection concatenates retrieved chunks into a prompt. Response generation passes the prompt to an LLM. Citation return attaches source references to the output.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strengths
&lt;/h3&gt;

&lt;p&gt;Implementation takes one to two weeks for an experienced engineer. Infrastructure cost stays low: one embedding model, one vector store, one LLM endpoint. The approach works well for static knowledge domains. FAQ corpora, product documentation, and internal wikis fit this pattern. Latency stays under 2 seconds for most deployments. No feedback loops mean deterministic behavior. The same query returns the same retrieval set. Debugging is straightforward.&lt;/p&gt;

&lt;h3&gt;
  
  
  Weaknesses
&lt;/h3&gt;

&lt;p&gt;No verification loop validates retrieved evidence. Irrelevant chunks slip through when embedding similarity is misleading. Ranking quality depends entirely on embedding similarity. Ambiguous queries return weak results. A query like "how do I fix the error" returns generic troubleshooting content rather than error-specific documentation. Multi-faceted queries suffer. A question about "pricing and integration" retrieves only chunks for one facet. The model hallucinates to fill gaps when retrieval fails.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chunk Size
&lt;/h3&gt;

&lt;p&gt;Chunk size selection impacts recall quality. Small chunks (128 tokens) give precise matches but miss context. A section on "connection timeout" often fails to identify the cause or solution. Large chunks (512 tokens) capture more context but dilute relevance. The top-k retrieval returns fewer distinct documents. Overlap between chunks (50 tokens) helps preserve context across boundaries. Test multiple chunk sizes (128, 256, 512) against your query set. Measure recall at k=5 and k=10. Choose the size where recall plateaus.&lt;/p&gt;

&lt;h3&gt;
  
  
  Embedding Models
&lt;/h3&gt;

&lt;p&gt;Embedding model choice impacts semantic coverage. Models trained on general text (OpenAI text-embedding-ada-002, sentence-transformers/all-MiniLM) underperform on domain-specific corpora. Medical, legal, and financial texts use terminology absent from training data. Use domain-tuned embeddings when available. Fine-tune on your corpus with contrastive loss. Or use domain-specific models (e.g., BioBERT for medical applications). Embedding dimension matters. 384-dim models are faster and cheaper. 1536-dim models capture finer distinctions. Benchmark both on your data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Production Use Cases
&lt;/h3&gt;

&lt;p&gt;FAQ bots with fewer than 10,000 questions. Documentation search for product manuals and API references. Internal knowledge bases where content changes infrequently. POCs and demos where speed of implementation outweighs accuracy.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Agentic RAG
&lt;/h2&gt;

&lt;p&gt;Agentic RAG adds planning, tool selection, and iterative reasoning. The agent breaks complex questions into steps, chooses tools for each step, executes them, and synthesizes a final answer. This architecture handles workflows that a single retrieval call cannot support.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8jhcjewaow224zmnppep.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8jhcjewaow224zmnppep.png" alt="Agentic RAG" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Pipeline
&lt;/h3&gt;

&lt;p&gt;Task planning analyzes the user query and produces a step-by-step plan. The planner uses an LLM with few-shot examples or a structured prompt. Plan steps include "retrieve documents about X," "call API Y," and "summarize results." Tool selection maps each step to a tool. Tools include vector search, keyword search, calculator, API calls, and code execution. The agent selects tools based on step descriptions and tool schemas. Multi-step retrieval executes tools in sequence. Outputs from earlier steps feed into later steps. A retrieval about "company revenue" informs a follow-up retrieval about "competitor revenue." Tool execution runs each tool and captures results. Memory update stores tool outputs, intermediate conclusions, and user feedback. Response synthesis generates the final answer from the accumulated context. The agent loops back to planning when synthesis indicates missing information.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strengths
&lt;/h3&gt;

&lt;p&gt;Handles complex workflows. A query like "compare our Q3 results to our top three competitors and summarize the gap" requires multiple retrievals, API calls, and summarization. You run multiple tools in sequence. Long-running reasoning tasks become feasible. Research assistants draw from papers, patents, and news sources. Competitive intelligence agents aggregate data from multiple sources. Autonomous analytics agents run queries, join data, and produce reports.&lt;/p&gt;

&lt;h3&gt;
  
  
  Weaknesses
&lt;/h3&gt;

&lt;p&gt;Latency increases with each planning and execution step. A single query often triggers 3 to 10 model calls. End-to-end latency reaches 10 to 30 seconds. Debugging is hard. The agent chooses different tools or paths for similar queries. Reproducing a failure requires logging every decision. Infrastructure cost rises. Each step consumes tokens. State tracking, retry logic, and execution budget control add engineering overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation Guidance
&lt;/h3&gt;

&lt;p&gt;Set a maximum step count. Without limits, agents loop or drift. A typical cap is 5 to 10 steps. Log every tool call and plan step. Store the full execution trace. Reproducibility matters when users report errors. Use deterministic seeds where possible for plan generation. Define tool schemas with clear descriptions. The agent relies on schemas to select tools. Vague descriptions cause wrong tool selection. Implement timeouts per step. A stuck tool blocks the whole pipeline. Add fallback behavior when tools fail. The agent should degrade gracefully.&lt;/p&gt;

&lt;h3&gt;
  
  
  Production Use Cases
&lt;/h3&gt;

&lt;p&gt;Research automation: literature review, patent analysis, trend summarization. Competitive intelligence: market monitoring, competitor tracking, strategic briefs. Autonomous analytics: ad-hoc reporting, data exploration, dashboard generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. HyDE RAG
&lt;/h2&gt;

&lt;p&gt;HyDE (Hypothetical Document Embeddings) generates synthetic documents to improve retrieval matching. The idea: hypothetical answers are closer in embedding space to real answers than raw queries. This bridges the vocabulary gap between how users ask and how documents are written.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjaen5e83remqelwz4k8l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjaen5e83remqelwz4k8l.png" alt="HyDE RAG" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Pipeline
&lt;/h3&gt;

&lt;p&gt;The user submits a query. Hypothetical answer generation produces one or more plausible answers using an LLM. A query like "how do I configure SSL" might generate "To configure SSL, you need to generate a certificate, add the certificate path to the config file, and restart the server." Embedding generation converts the hypothetical answer into a vector. Retrieval uses this vector instead of the query vector to search the corpus. The retrieved chunks are real documents, not hypothetical. Context assembly concatenates retrieved chunks. The final generation produces the actual answer from the retrieved evidence. The model cites real sources rather than hypothetical answers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Variations
&lt;/h3&gt;

&lt;p&gt;Single HyDE generates one hypothetical answer per query. Multi-HyDE generates 3 to 5 hypothetical answers, embeds each, retrieves the corresponding results for each, and merges the results. Multi-HyDE improves recall, but multiplies cost. HyDE with reranking adds a reranker after retrieval. The reranker scores chunks based on their relevance to the original query. This filters false positives from the expanded retrieval set.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strengths
&lt;/h3&gt;

&lt;p&gt;Recall quality improves. Benchmarks report 10-30% recall gains over naive retrieval. Vocabulary mismatch between queries and corpus documents drops. Users ask, "Why is my app slow?" while docs say "performance degradation" and "latency issues." Hypothetical answers use doc-like language. Technical search benefits most. Developer questions, error messages, and API usage patterns align better after HyDE.&lt;/p&gt;

&lt;h3&gt;
  
  
  Weaknesses
&lt;/h3&gt;

&lt;p&gt;Extra inference: the model must generate a hypothetical answer before retrieval. Expect 1.5x to 2x token usage per query. Synthetic bias is a risk. Generated documents sometimes skew retrieval toward certain document types. A model trained on tutorials often generates tutorial-style hypotheticals and over-retrieves tutorials. Production use cases include developer search, technical troubleshooting, and scientific literature retrieval. HyDE works best when combined with reranking models. The reranker filters false positives from the expanded retrieval set.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation Guidance
&lt;/h3&gt;

&lt;p&gt;Use a fast, cheap model for hypothetical generation. You do not need the best model. A 7B parameter model often suffices. Keep hypothetical answers concise. Long hypotheticals add noise. 50 to 100 tokens per hypothetical works well. Consider caching. Repeated queries (e.g., popular FAQs) reuse cached hypotheticals. Cache key: query embedding.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Graph RAG
&lt;/h2&gt;

&lt;p&gt;Graph RAG is extracted from entity relationships in knowledge graphs. Documents become nodes and edges. Queries traverse the graph to assemble context. This architecture excels when relationships matter as much as raw text.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7gjvc2cv0c9d5huw8pxf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7gjvc2cv0c9d5huw8pxf.png" alt="Graph RAG" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Pipeline
&lt;/h3&gt;

&lt;p&gt;Entity extraction identifies named entities in a document, such as people, organizations, products, and concepts. Extraction uses NER models, rule-based patterns, or LLM-based parsing. Entity linking resolves extracted entities to canonical IDs. "Apple Inc" and "Apple" map to the same node. Linking uses knowledge bases (Wikidata, DBpedia) or custom ontologies. Graph construction creates nodes for entities and edges for relationships. Relationships come from co-occurrence, dependency parsing, or relation extraction models. Graph storage writes to a graph database (E.g., Neo4j or Amazon Neptune) or to an in-memory graph. At query time, query understanding identifies entities mentioned in the query. Graph traversal starts from those entities and follows edges. Traversal strategies include k-hop neighborhood, path finding, and community detection. Context assembly pulls text from documents associated with traversed nodes. Generation produces an answer from the assembled context.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strengths
&lt;/h3&gt;

&lt;p&gt;Multi-hop reasoning becomes tractable. "What drugs interact with the patient's current medication?" requires chaining drug-to-drug relationships across multiple hops. The answer depends on chaining relationships across multiple entities. Explainability is strong. The reasoning path follows explicit graph edges. You show users the path from query entities to answer entities. Relationship-aware retrieval surfaces related concepts naive vector search misses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Weaknesses
&lt;/h3&gt;

&lt;p&gt;Graph construction is expensive. Entity extraction and linking require trained models or rules. Expect weeks of tuning for a new domain. Schema design is complex. You must decide which relationship types matter for retrieval. Too many relationship types create noise. Too few missed connections. Graph refresh pipelines must align with source data refresh cycles. Stale graphs return stale answers. Production use cases include healthcare decision support, fraud detection, and scientific research, where relationship structure matters as much as raw text.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation Guidance
&lt;/h3&gt;

&lt;p&gt;Start with a minimal schema. Two or three relationship types (e.g., "treats," "interacts with") often suffice. Add more as you validate the need. Use hybrid retrieval. Combine graph traversal with vector search. Graph finds structure. Vectors find semantic similarity. Run incremental updates. Rebuild the full graph only when schema changes. For daily doc updates, add or update affected nodes and edges.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Corrective RAG
&lt;/h2&gt;

&lt;p&gt;Corrective RAG adds self-validation and iterative refinement. The system generates an answer, critiques it, and re-retrieves or regenerates when the critique identifies issues. The loop continues until the answer meets a quality threshold. This architecture is well-suited to high-stakes domains where errors are costly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fei74ed5gulovvn7e4ms9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fei74ed5gulovvn7e4ms9.png" alt="Corrective RAG" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Pipeline
&lt;/h3&gt;

&lt;p&gt;The initial retrieval fetches the top-k chunks for the query. Initial generation produces a draft answer. Critique evaluates the draft. The critic checks: does the answer cite retrieved evidence? Are claims supported? Are there contradictions? The critic uses an LLM with a structured prompt or a trained classifier. Scoring produces a numeric score (0 to 1) or a pass/fail. Re-query triggers when the critique finds missing evidence or unsupported claims. The re-query reformulates the search or expands k. Re-generation produces a new draft from the expanded context. The loop repeats until the score exceeds a threshold or the maximum iterations (e.g., 3) are reached. Final output returns the best-scoring answer with citations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strengths
&lt;/h3&gt;

&lt;p&gt;Factual accuracy improves. Benchmarks show a 15 to 25 percent reduction in hallucination rate. Hallucination rate drops. The critic catches unsupported claims before they reach the user. The architecture suits applications that require robust audit trails. Financial analytics, legal research, and compliance systems need traceable reasoning. Each answer comes with a critique log.&lt;/p&gt;

&lt;h3&gt;
  
  
  Weaknesses
&lt;/h3&gt;

&lt;p&gt;Higher latency. Most implementations run 2 to 4 generation passes per query. Latency doubles or triples. Token usage rises proportionally. Engineering teams must design scoring functions for the critique stage. The critic must reliably detect factual errors or missing evidence. A weak critic adds cost without benefit. False negatives let errors through. A harsh critic triggers unnecessary re-retrieval. False positives waste tokens and time. Tuning the critic is non-trivial.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation Guidance
&lt;/h3&gt;

&lt;p&gt;Start with a simple criticism: "Does each claim have a citation?" Then add checks for contradiction and hallucination. Use the chain-of-thought for the critic. Ask the critic to explain its reasoning before scoring. This improves reliability. Set a conservative max iteration count. Three passes usually suffice. More passes yield diminishing returns. Log critique scores over time. Track the distribution. Drift indicates a need to retune.&lt;/p&gt;

&lt;h3&gt;
  
  
  Production Use Cases
&lt;/h3&gt;

&lt;p&gt;Financial analytics: earnings summaries, risk reports, compliance checks. Legal research: case law retrieval, contract analysis, and regulatory lookup. Compliance systems: policy verification, audit support, regulatory reporting.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Contextual RAG
&lt;/h2&gt;

&lt;p&gt;Contextual RAG uses conversation state and session memory. Retrieval considers prior turns. Generation maintains continuity. This architecture supports multi-turn dialogues where each question depends on context.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8yl8go8dz6oqz1tqpab1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8yl8go8dz6oqz1tqpab1.png" alt="Contextual RAG" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Pipeline
&lt;/h3&gt;

&lt;p&gt;Session storage keeps a log of user messages and assistant responses. Each turn appends to the log. Context summarization runs when the log exceeds a token limit. Summarization compresses old turns into a shorter summary. The summary plus recent turns form the active context. Context-aware retrieval uses the full conversation, not only the latest message. A query "what about the second one?" is retrieved by concatenating "second one" with the prior discussion of a list. Some systems embed the full conversation. Others extract key entities and concepts for retrieval. Response generation receives the retrieved context plus conversation history. The model produces answers referencing prior turns. Memory updates with new information from the current turn for future retrieval.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strengths
&lt;/h3&gt;

&lt;p&gt;Multi-turn consistency improves. Follow-up questions receive correct answers. "What is the price?" after "Tell me about Product X" returns Product X's price. Personalization based on user history becomes possible. Preferences, prior queries, and corrections influence retrieval and generation. Session continuity supports long interactions. Meeting assistants, customer success tools, and personal knowledge systems rely on this.&lt;/p&gt;

&lt;h3&gt;
  
  
  Weaknesses
&lt;/h3&gt;

&lt;p&gt;Memory drift is a risk. Stale or irrelevant context accumulates over long sessions. A conversation about "Product A" often drifts to "Product B," but retrieval remains biased toward A. Context contamination occurs when prior turns bias retrieval in unwanted ways. A user correction ("I meant Product B, not A") must override prior context. Implementation is tricky. Memory compaction must run periodically. Without compaction, context windows overflow, and relevance degrades. Summarization loses detail. Aggressive summarization drops information needed for later turns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation Guidance
&lt;/h3&gt;

&lt;p&gt;Define a context window budget. Reserve tokens for conversation history, retrieval context, and generation. When history exceeds the budget, summarize the oldest turns. Use a sliding window with a summary: keep the last N turns verbatim and summarize the rest. Store user corrections explicitly. "User clarified X" should override prior assumptions. Test with long sessions. Simulate 20-turn conversations. Measure consistency and relevance at turn 5, 10, 15, 20.&lt;/p&gt;

&lt;h3&gt;
  
  
  Production Use Cases
&lt;/h3&gt;

&lt;p&gt;Meeting assistants: summarization, action items, follow-up questions. Customer success tools: support dialogues, onboarding flows, and feature discovery. Personal knowledge systems: note-taking, research assistants, learning companions.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Modular RAG
&lt;/h2&gt;

&lt;p&gt;Modular RAG splits retrieval into independent components. Each component has a single responsibility. You swap, upgrade, or bypass components without rewriting the pipeline. This architecture supports complex enterprise needs where one-size-fits-all fails.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmuv6kjjn6gg1iyo88l3j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmuv6kjjn6gg1iyo88l3j.png" alt="Modular RAG" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Pipeline
&lt;/h3&gt;

&lt;p&gt;Query rewriting normalizes and expands the user query. Spelling correction, query expansion, and multi-query generation (HyDE-style) run here. Hybrid retrieval runs multiple search strategies in parallel. Vector search, keyword search, and graph traversal execute concurrently. Results feed into a fusion step. Filtering removes irrelevant results. Filters apply metadata constraints (date range, source, access control). Deduplication merges near-duplicate chunks. Reranking scores the filtered set with a cross-encoder or learned ranker. Reranking is expensive, so you run reranking on the top 20 to 50 candidates. Tool routing sends queries to specialized tools. A legal query goes to the legal corpus. A support query goes to the support corpus. Routing uses classifiers or keyword rules. Response synthesis assembles the final answer. Synthesis calls the LLM once or multiple times. Some architectures add a citation verification step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strengths
&lt;/h3&gt;

&lt;p&gt;Each module upgrades or replaces independently. Swap the embedding model without touching the retrieval logic. Add a new data source by adding a retrieval module. The architecture supports flexible enterprise workflows. Different departments need different corpora and rules. Modular RAG accommodates this. Adding new data sources or retrieval strategies is straightforward. Implement a new module. Add the module to the pipeline. Configure routing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Weaknesses
&lt;/h3&gt;

&lt;p&gt;System complexity rises. A full modular pipeline has 6 to 10 components. Each component has its own config, dependencies, and failure modes. Maintenance cost rises. Observability across modules becomes critical. Failures occur at any stage. A bug in query rewriting silently corrupts downstream retrieval. You need per-module metrics and tracing. Latency adds up. Each module adds milliseconds. End-to-end latency requires careful optimization. Production use cases include enterprise AI platforms, large data pipeline systems, and research automation systems where modularity is a core requirement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation Guidance
&lt;/h3&gt;

&lt;p&gt;Define clear interfaces between modules. Each module accepts a standard input format and produces a standard output format. Use a pipeline framework (e.g., LangChain, LlamaIndex, or a custom DAG) to enforce this. Instrument every module. Log inputs, outputs, and latency. Add tracing IDs to follow a query across modules. Version your pipeline. When you change a module, record the version. A/B test module changes before full rollout. Start minimal. Add modules only when you have a concrete problem. A 3-module pipeline (retrieve, rerank, generate) often suffices for early deployments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Production Use Cases
&lt;/h3&gt;

&lt;p&gt;Enterprise AI platforms: multi-tenant, multi-corpus, role-based access. Large data pipeline systems: billions of documents, multiple retrieval backends. Research automation: federated search, specialized tools, reproducibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Hybrid RAG
&lt;/h2&gt;

&lt;p&gt;Hybrid RAG combines keyword retrieval and semantic retrieval. Keyword search finds exact and lexical matches. Semantic search finds conceptual matches. Together, they cover cases where either alone fails.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvcabyj7e6rtsb4f6bb31.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvcabyj7e6rtsb4f6bb31.png" alt="Hybrid RAG" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Pipeline
&lt;/h3&gt;

&lt;p&gt;Query parsing extracts keywords and optionally generates a semantic query. Keyword search runs on an inverted index (e.g., BM25 or Elasticsearch). Semantic search runs against a vector index. Both return ranked lists. Rank fusion merges the two lists. Reciprocal Rank Fusion (RRF) is the common baseline: score = sum(1/(k + rank)) across lists. k is typically 60. Other methods include weighted linear combination and learned fusion. Optionally, reranking scores the fused list. Reranking uses a cross-encoder or a learned model. Generation receives the top chunks and produces the answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Keyword vs Semantic
&lt;/h3&gt;

&lt;p&gt;Keyword search excels at exact matches. Product IDs, error codes, and proper nouns. "ERR_SSL_PROTOCOL_ERROR" retrieves the right doc. Semantic search fails here if the embedding does not capture the token. Semantic search excels at paraphrasing and conceptual queries. "How do I fix connection problems" matches "troubleshooting network connectivity." Keyword search misses this. Hybrid covers both. A query about "Q3 revenue" gets keyword hits on "Q3" and "revenue" plus semantic hits on earnings reports and financial summaries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strengths
&lt;/h3&gt;

&lt;p&gt;Precision comes from keyword matching. Recall comes from semantic search. Structured and unstructured data both work. Keyword search handles tables, metadata, and structured fields. Semantic search handles free text. Production use cases include legal search, compliance audits, and enterprise search platforms, where both precision and recall matter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Weaknesses
&lt;/h3&gt;

&lt;p&gt;Ranking tuning is complex. Rank fusion models require continuous optimization. You must balance keyword and semantic signals. RRF assumes equal contribution. Your data often needs different weights. Learned fusion models often outperform RRF but need training data. You need labeled query-document pairs. Tuning is iterative. Add keyword weight when users complain about missed exact matches. Assign semantic weight when users report missed conceptual matches.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation Guidance
&lt;/h3&gt;

&lt;p&gt;Start with RRF. No training required. Tune k (typically 40-80) on a small validation set. Add metadata filters. Both keyword and semantic results benefit from source, date, and access filters. Consider query-type routing. Short queries (1 to 3 words) often need more keyword weight. Long, conceptual queries need more semantic weight. Implement both paths in parallel. Parallel execution keeps latency low. Fusion adds minimal overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Production Use Cases
&lt;/h3&gt;

&lt;p&gt;Legal search: case law, contracts, regulations. Compliance audit: policy lookup, regulatory check. Enterprise search: intranet, document management, knowledge base.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cross-Architecture Comparison
&lt;/h2&gt;

&lt;p&gt;Naive RAG: low complexity, medium accuracy, low cost, low latency. Implementation in days. Best for static, narrow corpora.&lt;/p&gt;

&lt;p&gt;Agentic RAG and Modular RAG: high complexity, high accuracy, high cost, higher latency. Implementation in weeks or months. Best for complex workflows and enterprise needs.&lt;/p&gt;

&lt;p&gt;Corrective RAG: high accuracy, high latency, high token usage. Best for high-stakes domains where verification matters.&lt;/p&gt;

&lt;p&gt;HyDE, Contextual, and Hybrid RAG: medium complexity, cost, and latency with accuracy gains over Naive RAG. Implementation in one to two weeks. Best for technical search, multi-turn dialogue, or mixed precision-recall needs.&lt;/p&gt;

&lt;p&gt;Choose an architecture by failure mode. Naive RAG solves simplicity. Agentic RAG solves autonomy. HyDE solves vocabulary mismatch. Graph RAG solves relationship reasoning. Corrective RAG solves verification. Contextual RAG solves memory. Modular RAG solves enterprise workflow composition. Hybrid RAG solves the balance between precision and semantic coverage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decision Framework
&lt;/h3&gt;

&lt;p&gt;Ask: Does your corpus change frequently? Yes favors Naive or Modular. Does your domain have rich entity relationships? Yes, favors Graph. Do users ask multi-turn questions? Ye,s favors Contextual. Do you need high factual accuracy and audit trails? Yes, favors Corrective. Do users and docs use different terminology? Yes, favors HyDE or Hybrid. Do you need multiple tools and complex workflows? Yes favors Agentic or Modular.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;RAG is no longer a single architecture. Each pattern solves a specific problem. Production success depends on pipeline design, data quality, and evaluation discipline. The strongest systems combine multiple RAG patterns into a single, orchestrated platform. A single system might use Hybrid retrieval, Corrective verification, and Contextual memory. Future RAG systems will look less like search pipelines and more like distributed data operating systems.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>database</category>
      <category>postgres</category>
    </item>
    <item>
      <title>Running AI on premises with Postgres</title>
      <dc:creator>NeuronDB Support</dc:creator>
      <pubDate>Thu, 01 Jan 2026 19:30:42 +0000</pubDate>
      <link>https://dev.to/neurondb_support_d73fa7ba/running-ai-on-premises-with-postgres-50g6</link>
      <guid>https://dev.to/neurondb_support_d73fa7ba/running-ai-on-premises-with-postgres-50g6</guid>
      <description>&lt;p&gt;Many AI systems struggle with unpredictable latency and excessive data movement. Documents, embeddings, and vector search often live in different systems, adding hops, cost, and failure modes. This post explains when running vector search and RAG directly in PostgreSQL on-premises makes sense, and how to design it for stable production behavior.&lt;/p&gt;

&lt;p&gt;Project references&lt;br&gt;
NeuronDB site: &lt;a href="https://www.neurondb.ai" rel="noopener noreferrer"&gt;https://www.neurondb.ai&lt;/a&gt;&lt;br&gt;
Source code: &lt;a href="https://github.com/neurondb/neurondb" rel="noopener noreferrer"&gt;https://github.com/neurondb/neurondb&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Decide if you should run on premises
&lt;/h2&gt;

&lt;p&gt;Pick on premises when you must control where data lives. Use it when you need to keep traffic private. Pick it when you must hit a strict latency target. Pick it when costs grow with API calls and egress. If you need a fast setup for a small pilot, start in the cloud, then move the data plane later.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compliance: HIPAA, GDPR, PCI, residency rules, audit rules&lt;/li&gt;
&lt;li&gt;Security: private networks, strict access, limited outbound traffic&lt;/li&gt;
&lt;li&gt;Latency: stable p95 and p99, fewer hops&lt;/li&gt;
&lt;li&gt;Cost: high volume usage, where per-call fees add up&lt;/li&gt;
&lt;li&gt;Control: standard Postgres and a clear ops surface&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Cloud vs on-premises, quick view
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjdp9xiv16yov2qd51fx2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjdp9xiv16yov2qd51fx2.png" alt="On-premises vs cloud AI comparison" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Figure: Comparison of data flow, latency paths, and operational boundaries between cloud and on-premises AI systems.&lt;/p&gt;

&lt;p&gt;Watch your data movement. In many systems, you fetch documents in one place, run embeddings in another, and run vector search in a third place. Each hop adds latency and failure modes. If you keep these steps within a single network, you reduce variance and debug faster.&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7qfx2reli2d0sm5836ll.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7qfx2reli2d0sm5836ll.png" alt="On-premises AI architecture overview" width="800" height="533"&gt;&lt;/a&gt;&lt;br&gt;
Figure: On-premises AI architecture with documents, embeddings, vector indexes, and retrieval inside PostgreSQL.&lt;/p&gt;

&lt;p&gt;Keep the data plane local. Store documents and metadata in Postgres. Store embeddings next to the rows they describe. Build vector indexes in the same database. Run retrieval queries over private links. Expose results through your app services.&lt;/p&gt;

&lt;p&gt;Keep three paths clear. Ingest is write-heavy. Retrieval is read-heavy. Admin work is rare but sensitive. Split these paths by network rules and by roles.&lt;/p&gt;

&lt;p&gt;Put ingestion on a schedule. Batch it. Keep queries stable. Do not let ad hoc scripts write to the central database. Use a queue or a worker process. Record each run.&lt;/p&gt;
&lt;h2&gt;
  
  
  What you run
&lt;/h2&gt;

&lt;p&gt;Keep the component list short. Assign an owner to each part. If you cannot name the host and the pager, you are not done.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Postgres with NeuronDB for storage, embeddings, indexes, and retrieval&lt;/li&gt;
&lt;li&gt;Ingestion workers for cleaning, chunking, and loads&lt;/li&gt;
&lt;li&gt;Embedding execution on CPU or GPU, batch jobs, steady throughput&lt;/li&gt;
&lt;li&gt;App services that call Postgres and return citations&lt;/li&gt;
&lt;li&gt;Monitoring for latency, load, pool use, lag, and backups&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Deployment patterns
&lt;/h2&gt;

&lt;p&gt;Start simple. Prove retrieval quality. Prove latency. Add resilience only when you need it. Keep changes small so you can reverse them.&lt;/p&gt;
&lt;h2&gt;
  
  
  Single server
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjmjdmr6jevaunyzf4ydy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjmjdmr6jevaunyzf4ydy.png" alt="Single server deployment diagram" width="800" height="351"&gt;&lt;/a&gt;&lt;br&gt;
Figure: Single-host deployment for early-stage or low-scale workloads.&lt;/p&gt;

&lt;p&gt;Use this for your first release. You get one host to secure. You get one Postgres instance to tune. You get precise failure handling. Add backups and dashboards before you add more servers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;neurondb&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;SERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;384&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Document content'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embed_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Document content'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'sentence-transformers/all-MiniLM-L6-v2'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;content&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;embed_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'query'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'sentence-transformers/all-MiniLM-L6-v2'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add filters early. It keeps results stable. It keeps cost stable. It keeps latency stable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data model and chunking
&lt;/h2&gt;

&lt;p&gt;Store chunks, not whole files. Keep the original document id. Store offsets. Store a version. Keep chunk size stable. Start with 300 to 800 tokens per chunk. Start with a 50 to 150 token overlap. Measure answer quality. Then change one variable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc_chunks&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;chunk_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;IDENTITY&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;tenant_id&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;source&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;384&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;doc_chunks_tenant_doc_idx&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;doc_chunks&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Track a content hash. It lets you skip re-embedding on retries. It enables you to detect duplicates. Use a text hash or a stable id from your upstream system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hybrid search with metadata and vectors
&lt;/h2&gt;

&lt;p&gt;Filter with metadata, then rank by vector distance. Use this per tenant. Use it per source. Use it per time window.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;
  &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;tenant_id&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="s1"&gt;'default'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;tenant_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'acme'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;source&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'spam'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;embed_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'query'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'sentence-transformers/all-MiniLM-L6-v2'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Ingestion workflow
&lt;/h2&gt;

&lt;p&gt;Use one workflow. Keep it the same across development, testing, and production. Run it in batches. Track each run. Start with these steps.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fetch raw documents&lt;/li&gt;
&lt;li&gt;Normalize text, strip boilerplate&lt;/li&gt;
&lt;li&gt;Split into chunks, keep offsets&lt;/li&gt;
&lt;li&gt;Insert rows without embeddings&lt;/li&gt;
&lt;li&gt;Compute embeddings in batches of 32 to 256&lt;/li&gt;
&lt;li&gt;Update embeddings&lt;/li&gt;
&lt;li&gt;Build or refresh indexes&lt;/li&gt;
&lt;li&gt;Run a sample query set, record p95&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Set one target. Ingest 100k chunks in under 30 minutes. Then tune. If you cannot meet that target, reduce the batch size, increase the number of workers, or move the embedding computation to a GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Primary and replicas
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb3whkxs9oih4dyu4t3x8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb3whkxs9oih4dyu4t3x8.png" alt="Multi server cluster diagram" width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Use this when you need uptime and read scale. Keep writing on the primary. Send retrieval reads to replicas. Use a pooler. Track replication lag. Set a rule for stale reads.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;neurondb&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;SERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;384&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;hnsw_create_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'documents'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'embedding'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'documents_embedding_hnsw'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Connection pooling
&lt;/h2&gt;

&lt;p&gt;Use a pooler for app traffic. Set a hard limit on connections. Keep idle connections low. Track pool saturation. Start with 20 to 50 connections per app node. Raise it only after you measure.&lt;/p&gt;

&lt;p&gt;Keep one rule. Do not let each app pod open hundreds of direct connections to Postgres. It will fail under load.&lt;/p&gt;

&lt;h2&gt;
  
  
  Indexing and maintenance
&lt;/h2&gt;

&lt;p&gt;Indexes drift. Stats drift. Tables bloat. Plan for it. Batch ingestion. Refresh stats. Watch index size. Watch vacuum behavior.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check query plans. Do it before and after each major ingest. You want an index scan for retrieval queries. You do not want a full table scan.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;embed_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'query'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'sentence-transformers/all-MiniLM-L6-v2'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Replication checks
&lt;/h2&gt;

&lt;p&gt;Track lag. Track replay delay. Set an alert. Use a number. Start with 5 seconds for p95 lag. Use reads from the primary if lag exceeds your limit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;application_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;write_lag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;flush_lag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;replay_lag&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_replication&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Sizing
&lt;/h2&gt;

&lt;p&gt;Start with three numbers. Vector count. Embedding dimension. Peak reads per second. Then add headroom. For raw float storage use vectors times dims times 4 bytes. Ten million vectors at 384 dims is about 15.4 GB for floats. Plan for more once you add row overhead and indexes.&lt;/p&gt;

&lt;p&gt;Use a simple table. It keeps planning honest.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 million vectors at 384 dims, about 1.5 GB floats&lt;/li&gt;
&lt;li&gt;10 million vectors at 384 dims, about 15.4 GB floats&lt;/li&gt;
&lt;li&gt;10 million vectors at 768 dims, about 30.7 GB floats&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Security
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpaur5erh7mnhvwxqfb1u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpaur5erh7mnhvwxqfb1u.png" alt="Security architecture diagram" width="800" height="408"&gt;&lt;/a&gt;&lt;br&gt;
Figure: Network isolation, role separation, and access control for on-premises AI systems.&lt;/p&gt;

&lt;p&gt;Keep the database private. Restrict inbound. Restrict outbound. Limit roles. Log access. Keep backups protected.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Put the database in private subnets.&lt;/li&gt;
&lt;li&gt;Use a bastion or VPN for admin access&lt;/li&gt;
&lt;li&gt;Use TLS on internal links&lt;/li&gt;
&lt;li&gt;Use disk encryption at rest&lt;/li&gt;
&lt;li&gt;Use least privilege roles for apps&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Roles
&lt;/h2&gt;

&lt;p&gt;Create one app role per service. Grant only what it needs. Avoid superuser. Avoid owner roles in apps.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;app_reader&lt;/span&gt; &lt;span class="n"&gt;NOINHERIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;CONNECT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;postgres&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;app_reader&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;USAGE&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;app_reader&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt; &lt;span class="n"&gt;TABLES&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;app_reader&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;PRIVILEGES&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;TABLES&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;app_reader&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;Start with measurement, not assumptions. Measure query latency, index usage, and embedding throughput under realistic load. Verify that the planner uses vector indexes and that queries avoid full-table scans. Run embedding generation in controlled batches to smooth CPU or GPU usage. Apply relational filters as early as possible to reduce the candidate set before vector ranking. Keep result sets small and predictable. Monitor connection pool saturation continuously, since pool exhaustion often becomes the first bottleneck long before CPU or storage limits are reached.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;pg_stat_statements&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mean_exec_time&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;mean_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_exec_time&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;max_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;LEFT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;query_preview&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_statements&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;mean_exec_time&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pick two numbers. Retrieval p95. Ingest throughput. Track them daily. Change one thing at a time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Backups and recovery
&lt;/h2&gt;

&lt;p&gt;Set RPO and RTO. Run restore drills. Write a steps document. Test failover in test. Keep the process repeatable.&lt;/p&gt;

&lt;p&gt;Run a restore drill each month. Time it. Record it. Fix the slow steps. Keep one target. Restore your core dataset in under 60 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migration from the cloud
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrqhmhbf5ovmdcgt15jg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrqhmhbf5ovmdcgt15jg.png" alt="Cloud-to-on-premises migration diagram&amp;lt;br&amp;gt;
" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Move the data plane first. Export docs and embeddings. Import into Postgres. Rebuild indexes. Mirror traffic. Compare answers and latency. Cut over with a rollback plan.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;384&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;hnsw_create_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'documents'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'embedding'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'documents_embedding_hnsw'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Cost model
&lt;/h2&gt;

&lt;p&gt;Use break-even months. Use CapEx divided by cloud monthly minus on-premises monthly. Include staff time, power, support, and depreciation. Include egress and API fees on the cloud side.&lt;/p&gt;

&lt;p&gt;Use one example with numbers. Keep it simple.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Capex 120000&lt;/li&gt;
&lt;li&gt;Cloud monthly 18000&lt;/li&gt;
&lt;li&gt;On premises monthly 9000&lt;/li&gt;
&lt;li&gt;Break-even months are 120000 divided by 9000, about 13.3&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Pick a pattern: single server, cluster, hybrid, or edge&lt;/li&gt;
&lt;li&gt;Set targets for p95 latency, QPS, RPO, RTO&lt;/li&gt;
&lt;li&gt;Lock down networks, subnets, firewall, bastion&lt;/li&gt;
&lt;li&gt;Add TLS and disk encryption&lt;/li&gt;
&lt;li&gt;Add a pooler&lt;/li&gt;
&lt;li&gt;Build indexes and check query plans&lt;/li&gt;
&lt;li&gt;Add monitoring and alerts&lt;/li&gt;
&lt;li&gt;Set backups and run a restore drill&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;On-premises AI works best when the architecture remains simple and close to the data. Keeping embeddings, vector search, and retrieval inside PostgreSQL reduces moving parts and failure modes.&lt;br&gt;
Hybrid SQL plus vector queries deliver control, stable latency, and clear operational boundaries. For teams prioritizing data ownership, predictability, and long-term maintainability, this model fits real production needs.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>postgres</category>
      <category>vectordatabase</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
