How we slashed storage in half—one byte at a time
When I first heard about float16 “half‑precision,” my reaction mirrored many of yours: “Sounds like hype—can it really save half the memory without wrecking recall?”. In part 1 we could how RAM hungry the embeddings become as you scale.
Enter Scalar Quantization, the first technique in our compression trilogy. Today, we’ll journey from zero to hero on halfvec, Postgres’s built‑in float16 vector type.
Why Half‑Precision Feels Like “Cheating”—But Isn’t
Imagine shooting photos on your phone. In “high quality” mode, each image might be 12 MB. Switch to “medium”, and it shrinks to 6 MB with barely noticeable loss. Drop to “low”, and you see compression artifacts. Embeddings follow the same pattern:
- Float32 (32‑bit) = “high quality”
- Float16 (16‑bit) = “medium”
- Int8, Binary = “low”
Think of a 32-bit float as a very long ruler with 4,294,967,296 tick marks. Float32 uses 1 sign bit + 8 exponent bits + 23 mantissa bits = 32 bits (4 bytes).
Now a 160bit float is a much shorter ruler - only 65,536 marks. Float16 uses 1 sign + 5 exponent + 10 mantissa bits = 16 bits (2 bytes).
For most embedding dimensions, the extra ticks between 0.000123 and 0.000124 don’t change which document is “closest”; they just waste cache lines.
By keeping the sign bit, five exponent bits, and ten fraction bits, we still capture 99 % of the geometric nuance while halving the payload
Inside halfvec: What Really Happens When You Switch to Float16
When you tell pgvector to use halfvec(1536)
, you’re simply asking it to store each of your 1,536 dimensions in half‑precision (16 bits) instead of full‑precision (32 bits). Here’s how that plays out behind the scenes—step by step:
1. Storing Your Vectors on Disk
-
Half‑precision (
halfvec
) Now, each dimension is a 16‑bit (2‑byte) float. That cuts the core payload to 1,536 × 2 = 3,072 bytes, and with the same 8‑byte header you end up with 3,080 bytes per row.
2. Loading into Postgres’s Shared Memory
Postgres manages a fixed pool of memory called shared_buffers
to cache table and index pages. With halfvec:
- The on‑disk pages containing your float16 embeddings are memory‑mapped straight into
shared_buffers
. - There’s no extra copying or buffer transformation—Postgres simply treats those pages as its cache, whether they contain 16‑bit or 32‑bit floats.
In other words, once your halfvec rows exist on disk, they go into RAM “as is.” You’re not paying any runtime penalty to unpack or reorganize them.
3. Building and Querying Your ANN Index
When pgvector builds an ANN index (like HNSW or IVFFlat), it needs to work directly with all your embedding values:
- Reading the raw bytes: pgvector reads the same 3,072‑byte slices for each embedding directly from shared memory.
- Interpreting them as float16:
- On x86 servers with AVX‑512 FP16, the CPU can perform distance calculations natively on 16‑bit floats.
- On platforms without FP16 instructions, the runtime will widen each 16‑bit value into a 32‑bit float in a register before computing.
Because the conversion (if needed) happens in CPU registers and vector units, it’s almost invisible next to the gains from halving your I/O traffic.
Why This Design Is Elegant
-
Zero manual conversion: You never write code to “convert” 32‑bit vectors to 16‑bit. Inserting into a
halfvec
column automatically casts for you. - Index metadata halves, too: All the parts of the ANN index that store numeric values—node coordinates in HNSW, centroids in IVFFlat—shrink by 50 percent.
- Faster queries for free: Fewer bytes read from disk and fewer pages to cache means less I/O and fewer cache misses, on top of any CPU‑level speedups when working with half‑precision.
The Migration Tale: From vector
to halfvec
You maintain a Postgres table:
CREATE TABLE docs (
id BIGSERIAL PRIMARY KEY,
emb VECTOR(1536)
);
One evening, you decide to cut your RAM bill in half—here’s your no‑downtime script.
Step 1: Add the new halfvec
column
ALTER TABLE docs
ADD COLUMN emb_half halfvec(1536);
This is a metadata‑only change (< 1 s), so your table remains online.
Step 2: Batch‑copy existing embeddings
Copy in chunks of 100 k rows to avoid WAL bloat:
DO $$
DECLARE
batch_size INT := 100000;
min_id INT;
max_id INT;
BEGIN
SELECT MIN(id), MAX(id) INTO min_id, max_id FROM docs;
FOR start_id IN min_id..max_id BY batch_size LOOP
UPDATE docs
SET emb_half = emb -- automatic float4→float2 cast
WHERE id BETWEEN start_id
AND LEAST(start_id + batch_size - 1, max_id);
-- Optional throttle:
PERFORM pg_sleep(0.1);
END LOOP;
END;
$$;
Monitor progress and dead tuples:
SELECT relname, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
WHERE relname = 'docs';
Step 3: Build the new index concurrently
CREATE INDEX CONCURRENTLY idx_docs_hnsw_half
ON docs
USING hnsw (emb_half vector_cosine_ops)
WITH (m = 16, ef_construction = 256);
Track build:
SELECT * FROM pg_stat_progress_create_index;
Step 4: Swap reads
Option A: Rename columns in one transaction:
BEGIN;
ALTER TABLE docs RENAME COLUMN emb TO emb_full;
ALTER TABLE docs RENAME COLUMN emb_half TO emb;
ALTER INDEX idx_docs_hnsw_full RENAME TO idx_docs_hnsw_half;
COMMIT;
Option B: Use a view:
CREATE OR REPLACE VIEW docs_active AS
SELECT id, COALESCE(emb_half, emb) AS emb
FROM docs;
Point your application at docs_active
.
Step 5: Cleanup
Once confident, drop the old column:
ALTER TABLE docs DROP COLUMN emb_full;
VACUUM FULL docs;
Putting Numbers on It: Benchmarks That Tell the Story
Official pgvector Benchmark
Dataset: dbpedia-openai-1000k-angular
(1 000 000 vectors × 1 536 dimensions)
Source: ANN‑Benchmarks configuration for dbpedia-openai-1000k-angular
(arXiv)
Metric | fullvec (32‑bit) | halfvec (16‑bit) | Δ |
---|---|---|---|
Table size | 7.7 GB | 3.9 GB | –50 % |
HNSW index size | 7.7 GB | 3.9 GB | –50 % |
Build time (ef=256) | 377 s | 163 s | –57 % |
Recall @ K=10 | 0.945 | 0.945 | 0 % |
QPS (ef_search = 40) | 627 | 642 | +2.4 % |
p99 latency | 2.7 ms | 1.9 ms | –30 % |
Insight: Identical recall, faster builds & queries, and 50 % storage savings.
Why halfvec Feels Faster: A Shelf and A Page Analogy
To achieve true millisecond-scale ANN lookups, your entire index must live in RAM. Here’s why halfvec’s 50 % size reduction translates into even greater speed gains:
1. PostgreSQL’s 8 KB Page Model
Postgres stores every table row in fixed-size “heap pages,” 8 KB by default. Rows cannot span pages, so each embedding—plus its row header—must fit entirely within a page:
-
Fullvec (float32)
- Payload: 1,536 dims × 4 bytes = 6,144 bytes
- * 8 bytes row header = 6,152 bytes → 1 vector/page
-
Halfvec (float16)
- Payload: 1,536 dims × 2 bytes = 3,072 bytes
- * 8 bytes header = 3,080 bytes → 2 vectors/page
🥊 Result: halfvec doubles the packing density. Twice as many vectors fit in the same 8 KB page, halving the number of pages you need to load for any given search.
2. Fewer Pages → Fewer I/O and Cache Misses
- I/O Operations
- Every page load from disk (or cold OS page cache) costs ~50–100 µs on NVMe SSDs—and milliseconds on HDDs.
- With halfvec, your ANN search touches half as many pages, cutting total I/O latency.
- Buffer Cache Pressure
- Postgres’s
shared_buffers
(and the OS page cache) can hold a finite number of pages. - Halfvec indexes consume half the pages, so a higher fraction of your working set stays resident—fewer evictions and fewer page faults.
- Page Pre-warming
- To “pre-warm” an index into RAM, you typically scan all pages (e.g.,
SELECT count(*) FROM docs;
). - Half as many pages means pre-warming completes in half the time, getting you to full performance faster.
3. CPU-Level FP16 Support
Modern CPUs can process half-precision floats with minimal overhead, often at the same throughput as single-precision:
-
Intel AVX-512 FP16
- From 4th-gen Xeon Scalable onward, Intel added native FP16 instructions in the AVX-512 extension, allowing 16-bit operations directly in 512-bit registers ([WikiChip][3]).
- Distance computations (e.g., dot products, cosine similarity) can run without widening to 32 bits, cutting instruction counts.
-
ARMv8.2+ FP16
- ARM’s AArch64 architecture offers IEEE-754 binary16 via NEON and SVE, supporting load/store, arithmetic, and conversions on
__fp16
types ([developer.arm.com][4]). - On Graviton3 (Neoverse-based) cores, FP16 pipelines can even outrun FP32 thanks to narrower data paths and lower power per operation.
- ARM’s AArch64 architecture offers IEEE-754 binary16 via NEON and SVE, supporting load/store, arithmetic, and conversions on
4. End-to-End Speed Impact
Putting it all together:
Factor | Fullvec (32-bit) | Halfvec (16-bit) | Impact |
---|---|---|---|
Vectors per 8 KB page | 1 | 2 | 2× fewer pages to load |
I/O latency per search | N·ν | (N/2)·ν | ~50 % reduction in cumulative I/O time |
Cache hits in shared_buffers | H | ≈ 2H | Fewer evictions → steadier in-RAM performance |
CPU cycles per FP op | C₃₂ | ≲ C₁₆ | Up to 1:1 throughput on AVX-512/NEON |
Where N = number of pages probed, ν = per-page I/O cost, H = hit ratio, C₃₂/C₁₆ = cycles per FP32/FP16 operation.
The net effect is more than just a 2× speedup: you gain on I/O, cache locality, and—in some architectures—on pure compute throughput. That’s why practitioners often report 30–50 % lower query latencies after switching to halfvec, on top of the storage savings.
Verifying Precision Isn’t Lost
Even though embeddings usually lie in [−1.0, +1.0], it’s wise to sanity‑check:
WITH sample AS (
SELECT id, emb AS full, emb_half AS half
FROM docs
ORDER BY random()
LIMIT 100
)
SELECT
avg(abs((full[i] - half[i])::float)) AS avg_error,
max(abs((full[i] - half[i])::float)) AS max_error
FROM sample, generate_series(1, array_length(sample.full,1)) AS i;
Expected results:
-
avg_error
: ≲ 0.00002 -
max_error
: ≲ 0.001
These tiny deltas won’t change nearest‑neighbor rankings in practice.
Advanced Tips & Best Practices
- Batch‐size tuning
-
100 k–200 k rows per UPDATE balances WAL throughput and lock duration.
- Replica health
-
Monitor
pg_stat_replication
; throttle batch updates withpg_sleep()
if lag spikes.- View‐based rollbacks
-
Use
COALESCE(emb_half, emb)
views for seamless fallback to full precision.- HNSW parameter tweaks
-
With halfvec, try reducing
ef_construction
by 10 % or increasingm
for marginal recall gains.- Memory settings
Set
shared_buffers
≈ dataset size.Adjust
work_mem
for indexing.
Considerations & Caveats
- Range limits: IEEE‑754 binary16 covers ±6.5×10⁴; verify your data’s min/max if you embed outliers.
- bfloat16 vs. binary16: halfvec uses binary16—do not mix with bfloat16 weights.
-
ORM compatibility: Some ORMs may not recognize
halfvec
; plan custom migrations. - Replication lag: concurrent CREATE INDEX still logs writes—monitor and throttle.
Stay Tuned for the next part
Top comments (0)