Jake Casto

Posted on Dec 8, 2025

Signal-driven health monitoring for HNSW indices w/ pgvector

#ai #vectordatabase #postgres #pgvector

Vector search in PostgreSQL has come a long way, and pgvector makes it simple:

insert embeddings, create an HNSW index, and run ANN queries.

At least for a while.

As our system scaled across stores, regions, and product volumes, something became clear:

HNSW is not “fire and forget.”

Indexes age. They bloat. They fragment. They silently lose recall.

When they degrade, search results degrade too — sometimes gradually, sometimes catastrophically.

For months, we assumed our approach was fine:

Rebuild every night at 1 AM
Rebuild after imports
Rebuild after new embeddings
Rebuild everything, always

It felt safe.

It was absolutely not.

When Scheduled Rebuilds Turned Into a Problem

Our first approach was periodic instead of condition-based — every night, every tenant, every index.

And here's what happened:

What we saw	Why it was bad
We rebuilt indexes with zero new writes	Massive waste of compute + IO
Large tenants saw daily index decay	Recall dropped, latency increased
Crash mid-rebuild left index invalid but present	PostgreSQL stopped using it entirely
Replicas diverged from primaries	ANN returned different results per region

The worst case was the INVALID index scenario.

A Postgres restart happened during a CONCURRENT index build. During this operation the index is created first as INVALID, then validated after two full table scans.

A crash meant the scans never completed — leaving a corrupt index that still appeared valid.

Result:

An index that looks present but is unusable, silently ruining search until someone notices.

That incident forced a rethink.

We didn’t need scheduled rebuilding.

We needed index health monitoring — rebuild only when necessary.

What Does “Healthy” Mean for an HNSW Index?

B-trees give you bloat %, page churn, dead tuples.

HNSW is different — it’s a navigable multi-layer graph whose health depends on:

how many vectors were replaced without pruning
how many nodes reference embeddings that no longer exist
whether graph layers remain navigable under memory pressure
whether the index fits in cache vs spilling to disk

There is no single fragmentation metric.

So we built one.

The Three-Metric Health Model

We generate one score per index using:

Metric	Detects
Dead tuple ratio	Graph churn, stale nodes, recall degradation
Cache hit ratio	Memory residency vs disk traversal
Bytes per vector	HNSW structural bloat / unnecessary overhead

If an index inflates, falls out of cache, or accumulates dead references — ANN quality drops.

PostgreSQL already exposes everything we need.

📊 Metric 1 — Dead Tuple Ratio

Dead tuples = deleted or overwritten embeddings still referenced inside the graph.

They increase hop count and reduce recall.

dead_tuple_ratio = n_dead_tup / n_live_tup

Threshold: 10–15% is worth rebuilding.

SELECT
    relname AS table_name,
    n_dead_tup,
    n_live_tup,
    ROUND((n_dead_tup / GREATEST(n_live_tup, 1)::float) * 100, 2) AS dead_tuple_ratio_pct
FROM pg_stat_all_tables
WHERE relname = 'your_embedding_table_name';

📊 Metric 2 — Cache Hit Ratio

Most performance issues weren't degraded structure —

they were cache eviction.

If HNSW pages fall out of shared_buffers → traversal becomes disk-bound → planner may stop using the index.

SELECT
    indexrelid::regclass AS index_name,
    idx_blks_hit,
    idx_blks_read,
    ROUND((idx_blks_hit * 100.0 / GREATEST(idx_blks_hit + idx_blks_read, 1)), 2) AS cache_hit_ratio_pct
FROM pg_statio_user_indexes
WHERE indexrelid::regclass::text LIKE '%hnsw%';

Ratio	Status
> 90%	Healthy
90–75%	Yellow
< 75%	Red — likely disk-bound, planner risk

📊 Metric 3 — Bytes Per Vector

Heuristic for structural bloat.

If stored bytes per embedding drift too high → rebuild pays off.

WITH idx AS (
    SELECT c.relname AS index_name, pg_relation_size(c.oid) AS index_bytes
    FROM pg_class c
    JOIN pg_index i ON i.indexrelid = c.oid
    WHERE c.relname LIKE '%hnsw%'
)
SELECT
    index_name,
    pg_size_pretty(index_bytes) AS index_size,
    (
        index_bytes / NULLIF(
            (SELECT reltuples FROM pg_class WHERE relname = 'your_embedding_table_name'),
            0
        )
    ) AS bytes_per_vector
FROM idx;

Example: expected 5.0 KB/vector, observed 9.2 KB/vector → rebuild.

Not perfect, but highly correlated with real-world recall loss.

The Health Score

Instead of binary rebuild vs no rebuild, we compute:

score = (dead_score * 0.4) + (cache_score * 0.4) + (bloat_score * 0.2)

Weighting rationale:

Dead tuples = recall degradation → heaviest signal
Cache residency = latency + planner reliability
Bloat matters, but less across large embeddings

If score < threshold → rebuild just that index.

Result

We stopped rebuilding everything.

We rebuild only when needed.

Smarter. Faster. Cheaper.

And search quality no longer surprises us.

Like the writing? Follow product development at Layers

DEV Community