DEV Community

Cover image for Signal-driven health monitoring for HNSW indices w/ pgvector
Jake Casto
Jake Casto

Posted on

Signal-driven health monitoring for HNSW indices w/ pgvector

Vector search in PostgreSQL has come a long way, and pgvector makes it simple:

insert embeddings, create an HNSW index, and run ANN queries.

At least for a while.

As our system scaled across stores, regions, and product volumes, something became clear:

HNSW is not “fire and forget.”

Indexes age. They bloat. They fragment. They silently lose recall.

When they degrade, search results degrade too — sometimes gradually, sometimes catastrophically.

For months, we assumed our approach was fine:

  • Rebuild every night at 1 AM
  • Rebuild after imports
  • Rebuild after new embeddings
  • Rebuild everything, always

It felt safe.

It was absolutely not.


When Scheduled Rebuilds Turned Into a Problem

Our first approach was periodic instead of condition-based — every night, every tenant, every index.

And here's what happened:

What we saw Why it was bad
We rebuilt indexes with zero new writes Massive waste of compute + IO
Large tenants saw daily index decay Recall dropped, latency increased
Crash mid-rebuild left index invalid but present PostgreSQL stopped using it entirely
Replicas diverged from primaries ANN returned different results per region

The worst case was the INVALID index scenario.

A Postgres restart happened during a CONCURRENT index build. During this operation the index is created first as INVALID, then validated after two full table scans.

A crash meant the scans never completed — leaving a corrupt index that still appeared valid.

Result:

An index that looks present but is unusable, silently ruining search until someone notices.

That incident forced a rethink.

We didn’t need scheduled rebuilding.

We needed index health monitoring — rebuild only when necessary.


What Does “Healthy” Mean for an HNSW Index?

B-trees give you bloat %, page churn, dead tuples.

HNSW is different — it’s a navigable multi-layer graph whose health depends on:

  • how many vectors were replaced without pruning
  • how many nodes reference embeddings that no longer exist
  • whether graph layers remain navigable under memory pressure
  • whether the index fits in cache vs spilling to disk

There is no single fragmentation metric.

So we built one.


The Three-Metric Health Model

We generate one score per index using:

Metric Detects
Dead tuple ratio Graph churn, stale nodes, recall degradation
Cache hit ratio Memory residency vs disk traversal
Bytes per vector HNSW structural bloat / unnecessary overhead

If an index inflates, falls out of cache, or accumulates dead references — ANN quality drops.

PostgreSQL already exposes everything we need.


📊 Metric 1 — Dead Tuple Ratio

Dead tuples = deleted or overwritten embeddings still referenced inside the graph.

They increase hop count and reduce recall.

dead_tuple_ratio = n_dead_tup / n_live_tup
Enter fullscreen mode Exit fullscreen mode

Threshold: 10–15% is worth rebuilding.

SELECT
    relname AS table_name,
    n_dead_tup,
    n_live_tup,
    ROUND((n_dead_tup / GREATEST(n_live_tup, 1)::float) * 100, 2) AS dead_tuple_ratio_pct
FROM pg_stat_all_tables
WHERE relname = 'your_embedding_table_name';
Enter fullscreen mode Exit fullscreen mode

📊 Metric 2 — Cache Hit Ratio

Most performance issues weren't degraded structure —

they were cache eviction.

If HNSW pages fall out of shared_buffers → traversal becomes disk-bound → planner may stop using the index.

SELECT
    indexrelid::regclass AS index_name,
    idx_blks_hit,
    idx_blks_read,
    ROUND((idx_blks_hit * 100.0 / GREATEST(idx_blks_hit + idx_blks_read, 1)), 2) AS cache_hit_ratio_pct
FROM pg_statio_user_indexes
WHERE indexrelid::regclass::text LIKE '%hnsw%';
Enter fullscreen mode Exit fullscreen mode
Ratio Status
> 90% Healthy
90–75% Yellow
< 75% Red — likely disk-bound, planner risk

📊 Metric 3 — Bytes Per Vector

Heuristic for structural bloat.

If stored bytes per embedding drift too high → rebuild pays off.

WITH idx AS (
    SELECT c.relname AS index_name, pg_relation_size(c.oid) AS index_bytes
    FROM pg_class c
    JOIN pg_index i ON i.indexrelid = c.oid
    WHERE c.relname LIKE '%hnsw%'
)
SELECT
    index_name,
    pg_size_pretty(index_bytes) AS index_size,
    (
        index_bytes / NULLIF(
            (SELECT reltuples FROM pg_class WHERE relname = 'your_embedding_table_name'),
            0
        )
    ) AS bytes_per_vector
FROM idx;
Enter fullscreen mode Exit fullscreen mode

Example: expected 5.0 KB/vector, observed 9.2 KB/vector → rebuild.

Not perfect, but highly correlated with real-world recall loss.


The Health Score

Instead of binary rebuild vs no rebuild, we compute:

score = (dead_score * 0.4) + (cache_score * 0.4) + (bloat_score * 0.2)
Enter fullscreen mode Exit fullscreen mode

Weighting rationale:

  • Dead tuples = recall degradation → heaviest signal
  • Cache residency = latency + planner reliability
  • Bloat matters, but less across large embeddings

If score < threshold → rebuild just that index.


Result

We stopped rebuilding everything.

We rebuild only when needed.

Smarter. Faster. Cheaper.

And search quality no longer surprises us.

Like the writing? Follow product development at Layers

Top comments (0)