Gabriel Anhaia

Posted on May 7

Vector Index Cold Start: Why Your First Query Takes 8 Seconds

#database #rag #vectordatabase #ai

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You ship the RAG service. Tests are green, p99 retrieval is around 40 ms, the dashboard looks healthy. Then a deploy lands at 03:00 UTC, the pod restarts, and the first user query takes eight seconds. The second one takes 38 ms. Same query, same index, same code path. The graph in Grafana shows a single sharp spike on every rolling deploy, and your on-call shrugs and says it is fine because the spike never lasts.

It is not fine. The spike is the index reading itself off disk one page at a time while the user sits at a loading spinner. On a quiet morning that is one frustrated user. On a Black-Friday rollout it is a few thousand of them, hitting freshly cycled pods at once and waiting on the same page faults.

This post is about the cold-start tax: why HNSW indexes wait until the first query to actually load, what the real numbers look like across pgvector, Qdrant, and Pinecone, and four warm-up patterns that stop the spike before any user sees it.

Why HNSW loads lazy

An HNSW (Hierarchical Navigable Small World) index is a graph. Each vector is a node, each node has neighbour links at multiple layers, and a query walks the graph from a top-layer entry point down through the layers, visiting maybe a few hundred nodes for a k=10 search. The graph data lives on disk as a memory-mapped file or a set of segments. The vectors themselves usually live next to the graph.

When the process boots, it does not read the whole index into RAM. It opens the file, asks the kernel to mmap it, and that is the end of the boot work. Pages get read in only when something touches them, and "something" is the first query.

The first query walks the graph. Every neighbour pointer it follows triggers a page fault. Every vector it scores triggers another page fault for the vector data. Modern NVMe is fast, but a page fault is still 50–150 microseconds (typical NVMe random-read latency), and a single HNSW search can touch hundreds of pages: graph pages, vector pages, payload pages. Multiply: a few hundred page faults at ~100 µs each lands you in the tens of milliseconds for the lookups alone, plus the actual compute, plus any extra cold path on the embedding side.

That is the per-query story. The first query also pays for a few one-shot setup costs that do not repeat: thread-pool warm-up, JIT kernel selection, allocator init. Those are cheap individually and ugly when they pile up on the same request.

The eight-second number in the title is the upper end of the band on a 5M-vector pgvector index with cold page cache and slower SATA SSDs; the bench in the next section reproduces a 4.8 s point in that band on faster NVMe. On a fresh container with a cold page cache, the first query commonly lands in the 3–10 second range depending on disk speed and dataset size. Qdrant with its on-disk segments lands in a similar band per its docs, and Pinecone serverless cold-starts depend on the tier (see the Pinecone link below). The pattern is the same: the work was deferred, and somebody has to pay for it.

What the per-engine docs actually say

A short tour, with sources you can verify. Read the docs in front of your version, because every engine moves on this.

pgvector. The index lives in PostgreSQL pages. The first query after a restart pays for filesystem cache misses on every page the search walks. The pgvector README is explicit that the index is built over pages and that performance depends on the pages being in memory; it recommends using pg_prewarm to load relations into shared buffers. A common pattern is SELECT pg_prewarm('items_embedding_idx'); immediately after restart. Without that, your first dozen queries pay the cache-miss tax.

Qdrant. Qdrant stores vectors and the HNSW graph in segments. You can configure the segments to be mmap (lazy) or fully in-memory via the on_disk and on_disk_payload flags in the collection config. The Qdrant indexing docs describe the trade-off: mmap keeps RAM low at the cost of cold-page faults, in-memory loads pay the cost up front but never again. You choose where the cost lands.

Pinecone. Serverless indexes auto-suspend cold tenants and pay a wake-up cost on the first query after suspension. The Pinecone docs on serverless cold-start call this out as a known behaviour and recommend periodic keep-alive queries for latency-sensitive workloads. If your traffic is bursty, you are paying it on every burst.

The shared theme: every modern vector engine has a knob that decides whether the cold cost lands at boot or on the first query. The default is almost always the first query. Boot time is the metric on the status page. p99 is buried in a tail dashboard. So engines defer the cost to the first query.

A reproducible test

A 20-line bench that shows the spike on a real machine. Drop into a fresh process, run twice, watch the gap.

import time
import numpy as np
import psycopg
from psycopg.rows import tuple_row

DSN = "postgresql://postgres@localhost/rag"
DIM = 1536


def query_once(conn, q_vec):
    with conn.cursor() as cur:
        t0 = time.perf_counter()
        cur.execute(
            "SELECT id FROM docs "
            "ORDER BY embedding <-> %s "
            "LIMIT 10",
            (q_vec,),
        )
        cur.fetchall()
        return time.perf_counter() - t0


def run():
    q = np.random.rand(DIM).astype(np.float32).tolist()
    with psycopg.connect(DSN, row_factory=tuple_row) as c:
        for i in range(5):
            dt = query_once(c, q)
            print(f"q{i}: {dt * 1000:.1f} ms")


if __name__ == "__main__":
    run()

On a freshly restarted Postgres on a developer-class laptop with a 5M-row corpus, a typical run prints something close to:

q0: 4812.3 ms
q1: 41.7 ms
q2: 38.9 ms
q3: 39.1 ms
q4: 38.4 ms

A hundred-fold gap on the same connection pool. The first query paid for the page cache to populate; everything after rode the warm cache. Numbers vary widely by disk, dataset, and shared_buffers configuration. Don't anchor on the absolute figures; the shape is what carries.

Pattern 1: warm-up queries on deploy

The smallest fix is also the cheapest. After the index is loaded but before the pod accepts traffic, run a handful of synthetic queries to walk the index pages into the page cache. This goes in your readiness probe, not liveness. Readiness gates traffic; liveness gates restarts. You do not want a slow warm-up to kill the pod.

import asyncio
import numpy as np


async def warmup(client, k: int = 10, runs: int = 8):
    """
    Hit the index with random vectors so HNSW pages
    fault in before real traffic arrives.
    """
    dim = client.embedding_dim
    for _ in range(runs):
        v = np.random.rand(dim).astype(np.float32)
        await client.search(vector=v.tolist(), top_k=k)


async def ready():
    if not client.warmed:
        await warmup(client)
        client.warmed = True
    return True

A few notes on getting this right. Use random vectors, not a fixed query — a fixed query walks the same path every time and only warms a sliver of the graph. Random vectors fan the walks across the index. Eight to sixteen warm-up queries is usually enough to push the high-traffic neighbourhoods into RAM; running 1,000 of them does not pay for itself. If you have a cached query log from before the deploy, replay the top 50 instead of random vectors — that warms the actual hot pages instead of the statistical average.

This pattern alone is commonly reported to take pgvector cold-starts from multi-second to sub-200 ms. It is the smallest possible fix and it is what most teams should ship first.

Pattern 2: pg_prewarm and mlock

Warm-up queries lazily walk the graph. The brute-force alternative is to load the file into RAM directly and pin it there.

For pgvector, the pg_prewarm extension is built into Postgres and reads index pages into shared_buffers up front. The full sequence on a fresh container looks like:

CREATE EXTENSION IF NOT EXISTS pg_prewarm;

SELECT pg_prewarm('docs_embedding_hnsw_idx');
SELECT pg_prewarm('docs_pkey');

You can also pass 'buffer' as the second argument to load specifically into shared_buffers, or 'read' to populate the OS page cache. For a vector index that fits in shared_buffers (shared_buffers set to at least the index size plus headroom), 'buffer' is the right choice.

For self-hosted Qdrant or any engine that mmap's a file, mlock(2) on the file pins those pages in RAM and prevents the kernel from evicting them under memory pressure. Be careful: mlock without a memory budget starves the rest of your workload. Raise RLIMIT_MEMLOCK (ulimit -l) and set the cgroup memory limits to leave room.

When this beats Pattern 1: if your fleet sees rolling deploys often enough that warm-up queries do not finish before traffic lands, or if the workload has long enough idle periods that the kernel evicts hot pages, pinning is the right hammer.

Pattern 3: sidecar warm-pool

Pod restarts are the norm in some environments: Kubernetes rolling deploys, autoscaler scale-ups, spot preemption. In those, even a warm-up probe loses to the cold-start race when traffic ramps fast. The pattern that scales: keep a warm pool of pre-loaded replicas behind the live ones.

The shape: an autoscaler keeps N "ready" replicas with the index loaded and the pages warm. Each replica runs a periodic warm query (every 30–60 seconds) to keep the kernel from evicting pages under memory pressure. When a live replica goes down, traffic shifts to the warm pool. The autoscaler spawns a fresh replica, and the new one promotes from "warming" to "ready" only after a warm-up probe passes. Users never hit a cold pod.

This costs more — you are paying for replicas that are not handling traffic. The math is straightforward: (warm_pool_size * cost_per_replica) vs (cold_start_p99 * affected_request_share). For a payments flow where one slow request blows an SLO, the warm pool pays for itself. For a background ingestion pipeline, it does not.

Pattern 4: write-through caches at the app layer

The previous patterns warm the engine. This one removes the engine from the hot path entirely.

Most RAG workloads have a long tail and a sharp head: a small number of queries get hit a lot, and most queries are unique. A two-tier cache catches the head before the engine sees it.

from functools import lru_cache
import hashlib


def query_key(text: str, k: int) -> str:
    h = hashlib.sha256(text.encode("utf-8"))
    return f"qry:{h.hexdigest()}:{k}"


class CachedRetriever:
    def __init__(self, backend, ttl=300):
        self.backend = backend
        self.ttl = ttl
        self._memo = {}

    def search(self, text, k=10):
        key = query_key(text, k)
        hit = self._memo.get(key)
        if hit is not None:
            ts, results = hit
            if ts + self.ttl > _now():
                return results
        results = self.backend.search(text, k)
        self._memo[key] = (_now(), results)
        return results


def _now():
    import time
    return time.monotonic()

A short TTL (60–300 seconds) is usually right. RAG corpora do not turn over fast enough to justify shorter, and longer TTLs collide with reranker-output staleness when the LLM judgement on context has shifted. Pair this with the embedding cache from Caching Pre-Computed Embeddings so neither side of the retrieval call pays full price on repeats.

The query cache is not a substitute for the warm-up patterns. It catches the head of the distribution; the long tail still hits a cold engine. It's the outer layer. Warm-up is the inner one.

What to do on Monday

If you have nothing today, the order is: warm-up queries in the readiness probe, pg_prewarm or equivalent if your engine supports it, query cache at the app layer, and a warm pool only when the math says so.

The trap: skipping straight to the warm pool because cold starts feel scary. A four-line warm-up probe usually fixes 90% of the spike at zero infra cost. The pool is for the last 10%, when the SLO demands it and the bill agrees.

Measure the cold tail before and after each pattern. The metric to chart: time-to-first-successful-query after pod ready. If it is a single-digit-second number on cold pods, you have not shipped the fix yet.

If this was useful

The RAG Pocket Guide walks the retrieval stack end to end with the production knobs that decide whether your p99 looks like a flat line or a sawtooth. The chapters on index selection, warm-pool sizing, and embedding-pipeline economics are the ones to open first if you are about to push your first vector DB to production traffic.