DEV Community

Cover image for halfvec: Half the Bits, Twice the speed?
Abhishek Gautam
Abhishek Gautam

Posted on

halfvec: Half the Bits, Twice the speed?

How we slashed storage in half—one byte at a time

When I first heard about float16 “half‑precision,” my reaction mirrored many of yours: “Sounds like hype—can it really save half the memory without wrecking recall?”. In part 1 we could how RAM hungry the embeddings become as you scale.

Enter Scalar Quantization, the first technique in our compression trilogy. Today, we’ll journey from zero to hero on halfvec, Postgres’s built‑in float16 vector type.


Why Half‑Precision Feels Like “Cheating”—But Isn’t

Imagine shooting photos on your phone. In “high quality” mode, each image might be 12 MB. Switch to “medium”, and it shrinks to 6 MB with barely noticeable loss. Drop to “low”, and you see compression artifacts. Embeddings follow the same pattern:

  • Float32 (32‑bit) = “high quality”
  • Float16 (16‑bit) = “medium”
  • Int8, Binary = “low”

Think of a 32-bit float as a very long ruler with 4,294,967,296 tick marks. Float32 uses 1 sign bit + 8 exponent bits + 23 mantissa bits = 32 bits (4 bytes).

Now a 160bit float is a much shorter ruler - only 65,536 marks. Float16 uses 1 sign + 5 exponent + 10 mantissa bits = 16 bits (2 bytes).

For most embedding dimensions, the extra ticks between 0.000123 and 0.000124 don’t change which document is “closest”; they just waste cache lines.
By keeping the sign bit, five exponent bits, and ten fraction bits, we still capture 99 % of the geometric nuance while halving the payload


Inside halfvec: What Really Happens When You Switch to Float16

When you tell pgvector to use halfvec(1536), you’re simply asking it to store each of your 1,536 dimensions in half‑precision (16 bits) instead of full‑precision (32 bits). Here’s how that plays out behind the scenes—step by step:

1. Storing Your Vectors on Disk

  • Half‑precision (halfvec) Now, each dimension is a 16‑bit (2‑byte) float. That cuts the core payload to 1,536 × 2 = 3,072 bytes, and with the same 8‑byte header you end up with 3,080 bytes per row.

2. Loading into Postgres’s Shared Memory

Postgres manages a fixed pool of memory called shared_buffers to cache table and index pages. With halfvec:

  • The on‑disk pages containing your float16 embeddings are memory‑mapped straight into shared_buffers.
  • There’s no extra copying or buffer transformation—Postgres simply treats those pages as its cache, whether they contain 16‑bit or 32‑bit floats.

In other words, once your halfvec rows exist on disk, they go into RAM “as is.” You’re not paying any runtime penalty to unpack or reorganize them.

3. Building and Querying Your ANN Index

When pgvector builds an ANN index (like HNSW or IVFFlat), it needs to work directly with all your embedding values:

  1. Reading the raw bytes: pgvector reads the same 3,072‑byte slices for each embedding directly from shared memory.
  2. Interpreting them as float16:
  • On x86 servers with AVX‑512 FP16, the CPU can perform distance calculations natively on 16‑bit floats.
  • On platforms without FP16 instructions, the runtime will widen each 16‑bit value into a 32‑bit float in a register before computing.

Because the conversion (if needed) happens in CPU registers and vector units, it’s almost invisible next to the gains from halving your I/O traffic.

Why This Design Is Elegant

  • Zero manual conversion: You never write code to “convert” 32‑bit vectors to 16‑bit. Inserting into a halfvec column automatically casts for you.
  • Index metadata halves, too: All the parts of the ANN index that store numeric values—node coordinates in HNSW, centroids in IVFFlat—shrink by 50 percent.
  • Faster queries for free: Fewer bytes read from disk and fewer pages to cache means less I/O and fewer cache misses, on top of any CPU‑level speedups when working with half‑precision.

The Migration Tale: From vector to halfvec

You maintain a Postgres table:

CREATE TABLE docs (
  id  BIGSERIAL PRIMARY KEY,
  emb VECTOR(1536)
);
Enter fullscreen mode Exit fullscreen mode

One evening, you decide to cut your RAM bill in half—here’s your no‑downtime script.

Step 1: Add the new halfvec column

ALTER TABLE docs
ADD COLUMN emb_half halfvec(1536);
Enter fullscreen mode Exit fullscreen mode

This is a metadata‑only change (< 1 s), so your table remains online.

Step 2: Batch‑copy existing embeddings

Copy in chunks of 100 k rows to avoid WAL bloat:

DO $$
DECLARE
  batch_size INT := 100000;
  min_id INT;
  max_id INT;
BEGIN
  SELECT MIN(id), MAX(id) INTO min_id, max_id FROM docs;
  FOR start_id IN min_id..max_id BY batch_size LOOP
    UPDATE docs
    SET emb_half = emb        -- automatic float4→float2 cast
    WHERE id BETWEEN start_id
                   AND LEAST(start_id + batch_size - 1, max_id);
    -- Optional throttle:
    PERFORM pg_sleep(0.1);
  END LOOP;
END;
$$;
Enter fullscreen mode Exit fullscreen mode

Monitor progress and dead tuples:

SELECT relname, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
WHERE relname = 'docs';
Enter fullscreen mode Exit fullscreen mode

Step 3: Build the new index concurrently

CREATE INDEX CONCURRENTLY idx_docs_hnsw_half
  ON docs
  USING hnsw (emb_half vector_cosine_ops)
  WITH (m = 16, ef_construction = 256);
Enter fullscreen mode Exit fullscreen mode

Track build:

SELECT * FROM pg_stat_progress_create_index;
Enter fullscreen mode Exit fullscreen mode

Step 4: Swap reads

Option A: Rename columns in one transaction:

BEGIN;
ALTER TABLE docs RENAME COLUMN emb TO emb_full;
ALTER TABLE docs RENAME COLUMN emb_half TO emb;
ALTER INDEX idx_docs_hnsw_full RENAME TO idx_docs_hnsw_half;
COMMIT;
Enter fullscreen mode Exit fullscreen mode

Option B: Use a view:

CREATE OR REPLACE VIEW docs_active AS
SELECT id, COALESCE(emb_half, emb) AS emb
FROM docs;
Enter fullscreen mode Exit fullscreen mode

Point your application at docs_active.

Step 5: Cleanup

Once confident, drop the old column:

ALTER TABLE docs DROP COLUMN emb_full;
VACUUM FULL docs;
Enter fullscreen mode Exit fullscreen mode

Putting Numbers on It: Benchmarks That Tell the Story

Official pgvector Benchmark

Dataset: dbpedia-openai-1000k-angular (1 000 000 vectors × 1 536 dimensions)
Source: ANN‑Benchmarks configuration for dbpedia-openai-1000k-angular (arXiv)

Metric fullvec (32‑bit) halfvec (16‑bit) Δ
Table size 7.7 GB 3.9 GB –50 %
HNSW index size 7.7 GB 3.9 GB –50 %
Build time (ef=256) 377 s 163 s –57 %
Recall @ K=10 0.945 0.945 0 %
QPS (ef_search = 40) 627 642 +2.4 %
p99 latency 2.7 ms 1.9 ms –30 %

Insight: Identical recall, faster builds & queries, and 50 % storage savings.


Why halfvec Feels Faster: A Shelf and A Page Analogy

To achieve true millisecond-scale ANN lookups, your entire index must live in RAM. Here’s why halfvec’s 50 % size reduction translates into even greater speed gains:


1. PostgreSQL’s 8 KB Page Model

Postgres stores every table row in fixed-size “heap pages,” 8 KB by default. Rows cannot span pages, so each embedding—plus its row header—must fit entirely within a page:

  • Fullvec (float32)

    • Payload: 1,536 dims × 4 bytes = 6,144 bytes
    • * 8 bytes row header = 6,152 bytes1 vector/page
  • Halfvec (float16)

    • Payload: 1,536 dims × 2 bytes = 3,072 bytes
    • * 8 bytes header = 3,080 bytes2 vectors/page

🥊 Result: halfvec doubles the packing density. Twice as many vectors fit in the same 8 KB page, halving the number of pages you need to load for any given search.


2. Fewer Pages → Fewer I/O and Cache Misses

  1. I/O Operations
  • Every page load from disk (or cold OS page cache) costs ~50–100 µs on NVMe SSDs—and milliseconds on HDDs.
  • With halfvec, your ANN search touches half as many pages, cutting total I/O latency.
  1. Buffer Cache Pressure
  • Postgres’s shared_buffers (and the OS page cache) can hold a finite number of pages.
  • Halfvec indexes consume half the pages, so a higher fraction of your working set stays resident—fewer evictions and fewer page faults.
  1. Page Pre-warming
  • To “pre-warm” an index into RAM, you typically scan all pages (e.g., SELECT count(*) FROM docs;).
  • Half as many pages means pre-warming completes in half the time, getting you to full performance faster.

3. CPU-Level FP16 Support

Modern CPUs can process half-precision floats with minimal overhead, often at the same throughput as single-precision:

  • Intel AVX-512 FP16

    • From 4th-gen Xeon Scalable onward, Intel added native FP16 instructions in the AVX-512 extension, allowing 16-bit operations directly in 512-bit registers ([WikiChip][3]).
    • Distance computations (e.g., dot products, cosine similarity) can run without widening to 32 bits, cutting instruction counts.
  • ARMv8.2+ FP16

    • ARM’s AArch64 architecture offers IEEE-754 binary16 via NEON and SVE, supporting load/store, arithmetic, and conversions on __fp16 types ([developer.arm.com][4]).
    • On Graviton3 (Neoverse-based) cores, FP16 pipelines can even outrun FP32 thanks to narrower data paths and lower power per operation.

4. End-to-End Speed Impact

Putting it all together:

Factor Fullvec (32-bit) Halfvec (16-bit) Impact
Vectors per 8 KB page 1 2 2× fewer pages to load
I/O latency per search N·ν (N/2)·ν ~50 % reduction in cumulative I/O time
Cache hits in shared_buffers H ≈ 2H Fewer evictions → steadier in-RAM performance
CPU cycles per FP op C₃₂ ≲ C₁₆ Up to 1:1 throughput on AVX-512/NEON

Where N = number of pages probed, ν = per-page I/O cost, H = hit ratio, C₃₂/C₁₆ = cycles per FP32/FP16 operation.

The net effect is more than just a 2× speedup: you gain on I/O, cache locality, and—in some architectures—on pure compute throughput. That’s why practitioners often report 30–50 % lower query latencies after switching to halfvec, on top of the storage savings.


Verifying Precision Isn’t Lost

Even though embeddings usually lie in [−1.0, +1.0], it’s wise to sanity‑check:

WITH sample AS (
  SELECT id, emb AS full, emb_half AS half
  FROM docs
  ORDER BY random()
  LIMIT 100
)
SELECT
  avg(abs((full[i] - half[i])::float)) AS avg_error,
  max(abs((full[i] - half[i])::float)) AS max_error
FROM sample, generate_series(1, array_length(sample.full,1)) AS i;
Enter fullscreen mode Exit fullscreen mode

Expected results:

  • avg_error: ≲ 0.00002
  • max_error: ≲ 0.001

These tiny deltas won’t change nearest‑neighbor rankings in practice.


Advanced Tips & Best Practices

  1. Batch‐size tuning
  • 100 k–200 k rows per UPDATE balances WAL throughput and lock duration.

    1. Replica health
  • Monitor pg_stat_replication; throttle batch updates with pg_sleep() if lag spikes.

    1. View‐based rollbacks
  • Use COALESCE(emb_half, emb) views for seamless fallback to full precision.

    1. HNSW parameter tweaks
  • With halfvec, try reducing ef_construction by 10 % or increasing m for marginal recall gains.

    1. Memory settings
  • Set shared_buffers ≈ dataset size.

  • Adjust work_mem for indexing.


Considerations & Caveats

  • Range limits: IEEE‑754 binary16 covers ±6.5×10⁴; verify your data’s min/max if you embed outliers.
  • bfloat16 vs. binary16: halfvec uses binary16—do not mix with bfloat16 weights.
  • ORM compatibility: Some ORMs may not recognize halfvec; plan custom migrations.
  • Replication lag: concurrent CREATE INDEX still logs writes—monitor and throttle.

Stay Tuned for the next part

Top comments (0)