Team Tiger Data for Tiger Data (Creators of TimescaleDB)

Posted on Apr 6 • Originally published at tigerdata.com on Mar 31

pg_textsearch 1.0: How We Built a BM25 Search Engine on Postgres Pages

#database #webdev #postgres #searchengine

Design, implementation, and benchmarks of a native BM25 index for Postgres. Now generally available to all Tiger Cloud customers and freely available via open source.

If you have used Postgres's built-in ts_rank for full-text search at any meaningful scale, you already know the limitations. Ranking quality degrades as your corpus grows. There is no inverse document frequency, so common words carry the same weight as rare ones. There is no term frequency saturation, so a document that mentions "database" 50 times outranks one that mentions it once. There is no efficient top-k path: scoring requires touching every matching row.

Most teams work around this by bolting on Elasticsearch or Typesense as a sidecar. That works, but now you are syncing data between two systems, operating two clusters, and debugging consistency issues when they diverge.

pg_textsearch takes a different approach: real BM25 scoring, built from scratch in C on top of Postgres's own storage layer. You create an index, write a query, and get results ranked by relevance:

CREATE INDEX ON articles USING bm25(content) WITH (text_config = 'english');

SELECT title, content <@> 'database ranking' AS score
FROM articles
ORDER BY content <@> 'database ranking'
LIMIT 10;

The <@> operator returns a BM25 relevance score. Scores are negated so that Postgres's default ascending ORDER BY returns the most relevant results first. The index is stored entirely in standard Postgres pages managed by the buffer cache. It participates in WAL, works with pg_dump and streaming replication, and requires no external storage or special backup procedures.

What shipped in 1.0
From preview to production. In October 2025, we released a preview that held the entire inverted index in shared memory, rebuilt from the heap on restart (preview blog). In the five months and 180+ commits since, the extension has been substantially rewritten:
• Disk-based segments replaced the memory-only architecture
• Block-Max WAND + WAND optimization for fast top-k queries
• Posting list compression with SIMD-accelerated decoding (41% smaller indexes)
• Parallel index builds (138M documents in under 18 minutes)
• 2.4x to 6.5x faster than ParadeDB/Tantivy for 2-4 term queries at 138M scale
• 8.7x higher concurrent throughput
This post covers the architecture, query optimization strategy, and benchmark results. We include a candid discussion of where ParadeDB is faster and a full accounting of current limitations.

Background: Why BM25 in Postgres?

Postgres ships tsvector/tsquery with ts_rank for full-text ranking. ts_rank uses an ad-hoc scoring function that lacks the three properties that make BM25 effective:

Inverse document frequency (IDF): downweights common terms so that rarer, more informative terms drive the ranking.
Term frequency saturation: prevents a document from scoring arbitrarily high by repeating a term many times. A document mentioning "database" 50 times is not 50 times more relevant than one mentioning it once.
Document length normalization: accounts for the fact that a term match in a short document is more informative than the same match in a long one [1].

For applications where ranking quality matters (RAG pipelines, search-driven UIs, hybrid retrieval), this is a material limitation. At scale, ts_rank also has no top-k optimization path: ranking by relevance requires scoring every matching row.

The primary existing BM25 extension for Postgres is ParadeDB/pg_search, which wraps the Tantivy search library written in Rust. Early versions stored the index in auxiliary files outside the WAL; current versions use Postgres pages.

pg_textsearch takes a different approach: rather than wrapping an external search library, the entire search engine (tokenization, compression, query optimization) is built from scratch in C on top of Postgres's storage layer.

Architecture

Fig. 1: pg_textsearch Architecture diagram

The hybrid memtable + segment design

pg_textsearch uses an LSM-tree-inspired architecture [4]. Incoming writes go to an in-memory inverted index (the memtable), which periodically spills to immutable on-disk segments. Segments compact in levels: when a level accumulates enough segments (default 8), they merge into the next level. Fewer segments means fewer posting lists to consult per query term, which directly reduces query latency. This is the same write-optimized-memtable / read-optimized-segment pattern used in RocksDB [5] and other LSM-based engines, adapted here for Postgres's page-based storage.

The write path: memtable

The memtable lives in Postgres shared memory, one per index, accessible to all backends. It contains a string-interning hash table that stores each unique term exactly once; per-term posting lists recording document IDs and term frequencies; and corpus statistics (document count and average document length) maintained incrementally so that BM25 scores can be computed without a separate pass over the index.

When the memtable exceeds a configurable threshold (default: 32M posting entries), it spills to a Level-0 disk segment at transaction commit. A secondary trigger (default: 100K unique terms per transaction) handles large single-transaction loads like bulk imports.

The memtable is rebuilt from the heap on startup. Since the heap is WAL-logged, no data is lost if Postgres crashes before a spill completes. This is analogous to how a write-ahead log protects an LSM memtable, except here the WAL is Postgres's own. The rebuild cost is proportional to the amount of data not yet spilled to segments; for indexes where most data has been spilled, startup is fast.

Fig. 2: pg_textsearch memtable write path

The read path: segments

Segments are immutable and stored in standard Postgres pages. Each segment contains:

A term dictionary: a sorted array of offsets into a string pool, binary-searchable for O(log n) term lookup.
Posting blocks of up to 128 documents each, containing delta-encoded doc IDs, packed term frequencies, and quantized document lengths (fieldnorms). A separate skip index stores one entry per posting block with upper-bound score metadata used by Block-Max WAND optimization (described below).
A fieldnorm table mapping document lengths to 1-byte quantized values using Lucene/Tantivy's SmallFloat encoding [6]. This encoding is exact for lengths 0-39 (covering most short documents); for longer documents, quantization error increases from ~5% to ~11%. In practice, the impact on ranking is smaller than these numbers suggest: BM25 scores depend on the ratio of document length to average document length, which dampens quantization error, and the b parameter (default 0.75) further reduces length's influence.
A doc ID to CTID mapping that translates internal document IDs to Postgres tuple identifiers for heap fetches.

Fig. 3: pg_textsearch segment internal structure

Minimizing page access

Storing data in Postgres pages means every access goes through the buffer manager. Even for pages already in cache, each access involves a buffer table lookup, pin acquisition, and lock handling. That overhead adds up in a scoring loop processing millions of postings. This constraint shaped several design decisions.

Each segment assigns compact 4-byte, segment-local document IDs (0 to N-1), which map to Postgres's 6-byte CTIDs (heap tuple identifiers). After collecting all documents for a segment, doc IDs are reassigned so that doc_id order matches CTID order. Sequential iteration through posting lists then produces sequential access to the CTID mapping, maximizing cache locality. CTIDs themselves are stored as two separate arrays (4-byte page numbers and 2-byte offsets) rather than interleaved 6-byte records, doubling cache line utilization.

The scoring loop works entirely with doc IDs, term frequencies, and fieldnorms. It never touches the CTID arrays. CTIDs are resolved only for the final top-k results in a single batched pass. A top-10 query that scores thousands of candidates resolves ten CTIDs, not thousands.

Postgres integration

Because the index is stored in standard buffer-managed pages, pg_textsearch participates in Postgres infrastructure without special handling: MVCC visibility, proper rollback on abort, WAL and physical replication, pg_dump / pg_upgrade, VACUUM with correct dead-entry removal, and planner hooks that detect the <@> operator and select index scans automatically. Logical replication works in the usual way: row changes are replicated and the index is rebuilt on the subscriber.

Query Optimization: Block-Max WAND

The top-k problem

Naive BM25 evaluation scores every document matching any query term. For a 3-term query on MS-MARCO v2 (138M documents), this means decoding and scoring posting lists with tens of millions of entries. Most applications need only the top 10 or 100 results. The challenge is finding them without scoring everything.

Block-Max WAND

pg_textsearch implements Block-Max WAND (BMW) [2], which uses block-level upper bounds to skip non-contributing posting blocks during top-k evaluation. Lucene adopted a similar approach in version 8.0 [7]. The core idea: maintain the score of the k-th best result seen so far as a threshold, and skip any posting block whose upper-bound score cannot exceed it.

Each 128-document posting block has a corresponding skip entry storing the maximum term frequency in the block and the minimum fieldnorm (the shortest document, which would score highest for a given term frequency). From these two values, BMW can compute a tight upper bound on the block's BM25 contribution without decompressing it. If the upper bound falls below the current threshold, the entire block (all 128 documents) is skipped.

To illustrate: consider a single-term top-10 query on a large corpus. After scanning a few thousand postings, the algorithm has accumulated 10 results with a minimum score of, say, 12.3. It now encounters a block where the upper-bound BM25 score (computed from the block's stored metadata) is 9.1. Since 9.1 < 12.3, no document in this block can enter the top 10, and the entire block is skipped without decompression. For short queries on large corpora, the vast majority of blocks are skipped this way.

Fig. 4: pg_textsearch Block-Max WAND visualization

WAND pivot selection

For multi-term queries, pg_textsearch adds the WAND algorithm [3] for cross-term skipping. Terms are ordered by their current document ID, and the algorithm identifies a pivot term: the first term whose cumulative maximum score exceeds the current threshold. All terms before the pivot advance to at least the pivot's current doc ID, skipping entire ranges of documents across multiple posting lists simultaneously, before block-level BMW bounds are even checked. For multi-term queries, BMW compares the sum of per-term block upper bounds against the threshold, extending the single-term logic described above.

The combination of WAND (cross-term skipping) and BMW (within-list block skipping) is most effective for short queries (1-4 terms), which account for the majority of real-world search traffic. In the full MS-MARCO v1 query set (1,010,916 queries from Bing), 72.6% have 2-4 lexemes after English stemming and stopword removal, with a mean of 3.7 and a mode of 3. The speedup narrows for longer queries, where more blocks contain at least one term with a potentially high-scoring document. Grand et al. [7] observe the same pattern in Lucene's BMW implementation.

Compression and Storage

Posting blocks use a compression scheme designed for fast random-access decoding. Doc IDs are delta-encoded (storing differences between consecutive IDs rather than absolute values), then packed with variable-width bitpacking: the maximum delta in the block determines the bit width, and all deltas use that width. Term frequencies are packed separately with their own bit width. Fieldnorms are the 1-byte SmallFloat values described above.

The bitpack decode path uses branchless direct-indexed uint64 loads rather than a byte-at-a-time accumulator, eliminating branch misprediction in the inner decode loop. Where available, SIMD intrinsics (SSE2 on x86-64, NEON on ARM64) accelerate the mask-and-store step. A scalar fallback handles other platforms.

Compression reduces index size by 41% compared to uncompressed storage. Decode overhead is approximately 6% of query time (measured by profiling), which is more than offset by reduced buffer cache pressure. The scheme prioritizes decode speed over compression ratio.

A note on index size comparisons: pg_textsearch does not store term positions, so it cannot support phrase queries natively (see Limitations). This makes its indexes inherently smaller than engines like Tantivy that store positions by default. The 19-26% size advantage reported in our benchmarks reflects both compression and this feature difference.

Parallel Index Build

For large tables, serial index construction can take hours. pg_textsearch uses Postgres's built-in parallel worker infrastructure to distribute the work.

The leader launches workers and assigns each a range of heap blocks. Workers scan their assigned blocks, tokenize documents via to_tsvector, build local in-memory indexes, and write intermediate segments to temporary BufFiles. The leader then performs an N-way merge of all worker output, writing a single merged segment directly to index pages.

Fig. 5: pg_textsearch Parallel Index Build

Workers run concurrently in the scan/tokenize/build phase; the leader merges sequentially. The expensive part (heap scanning, tokenization, posting list assembly) is CPU-bound and parallelizes well. The merge/write phase is comparatively cheap, so a serial merge captures most of the speedup with minimal complexity. It also produces a single fully-compacted segment that is optimal for query performance.

On MS-MARCO v2 (138M passages), 15 workers complete the build in 17 minutes 37 seconds:

SET max_parallel_maintenance_workers = 15;
SET maintenance_work_mem = '256MB';
CREATE INDEX ON passages USING bm25(content) WITH (text_config = 'english');

Benchmarks

Methodology

All benchmarks use the MS-MARCO passage ranking dataset [8], a standard information retrieval benchmark drawn from real Bing search queries. We compare pg_textsearch against ParadeDB v0.21.6 (which wraps Tantivy). Both extensions use their default configurations; Postgres tuning is specified per experiment. Both systems configure English stemming and stopword removal.

Queries are drawn uniformly from 8 token-count buckets (100 queries per bucket on v1; up to 100 per bucket on v2). Weighted-average metrics use the MS-MARCO v1 lexeme distribution as weights, reflecting real search traffic.

Cache state. All query benchmarks are warm-cache: a warmup pass runs before timing begins, and the working set fits in the OS page cache and shared_buffers for all configurations tested. Results reflect CPU and algorithmic efficiency, not I/O. We have not benchmarked memory-constrained configurations where the index exceeds available cache.

Ranking. Both systems produce BM25 rankings using the same tokenization (English stemming and stopwords). We have not performed a systematic ranking equivalence comparison; both implement standard BM25 with the same default parameters (k1 = 1.2, b = 0.75), but differences in IDF computation and tokenization edge cases may produce different orderings for some queries.

MS-MARCO query length distribution

The following histogram shows the distribution of query lengths in the full MS-MARCO v1 query set (1,010,916 queries), measured in lexemes after English stopword removal and stemming via Postgres to_tsvector('english'):

Fig. 6: MS-MARCO query length histogram

This distribution is broadly consistent with web search query length studies [9, 10]. The MS-MARCO mean of 3.7 lexemes (after stemming/stopword removal) corresponds to roughly 5–6 raw words, consistent with the corpus statistics reported by Nguyen et al. [8]. We use the v1 distribution for weighting throughout as it provides the largest sample.

Results: MS-MARCO v2 (138M passages)

Environment. Dedicated c6i.4xlarge EC2 instance: Intel Xeon Platinum 8375C, 8 cores / 16 threads, 123 GB RAM, NVMe SSD. Postgres 17.4 with shared_buffers = 31 GB. Both indexes fit in the buffer cache.

Index build:

Metric	pg_textsearch	ParadeDB
Index size	17 GB	23 GB
Build time	17 min 37 sec	8 min 55 sec
Documents	138,364,158	138,364,158
Parallel workers	15	14

pg_textsearch index is 26% smaller. ParadeDB builds approximately 2x faster.

Single-client query latency (p50 median, top-10 queries):

Lexemes	pg_textsearch (ms)	ParadeDB (ms)	Speedup
1	5.11	59.83	11.7x
2	9.14	59.65	6.5x
3	20.04	77.62	3.9x
4	41.92	98.89	2.4x
5	67.76	125.38	1.9x
6	102.82	148.78	1.4x
7	159.37	169.65	1.1x
8+	177.95	190.47	1.1x

The same pattern holds: pg_textsearch is fastest on short queries and the systems converge at longer lengths. Weighted by the MS-MARCO v1 query length distribution, the overall p50 is 40.6 ms for pg_textsearch vs. 94.4 ms for ParadeDB, a 2.3x advantage.

Concurrent throughput. We ran pgbench with 16 parallel clients for 60 seconds (after a 5-second warmup). Each client repeatedly executes a query drawn at random from a weighted pool of 1,000 queries:

Metric	pg_textsearch	ParadeDB
Transactions/sec	198.7	22.8
Average latency	81 ms	701 ms
Total transactions (60s)	11,969	1,387

pg_textsearch sustains 8.7x higher throughput under concurrent load.

Results: MS-MARCO v1 (8.8M passages)

On the smaller dataset (GitHub Actions runner, 7 GB RAM, Postgres 17), the advantages are more pronounced: 26x speedup for single-token queries, 14x for 2-token, 7.3x for 4-token. Total sequential execution time for all 800 queries: 6.5 seconds for pg_textsearch vs. 25.2 seconds for ParadeDB. Full results and methodology are available at the benchmarks page.

Discussion

Latency vs. query length

The speedup correlates strongly with query length: 11.7x for single-token queries on v2, narrowing to 1.1x at 8+ tokens. This is the expected behavior of dynamic pruning algorithms like BMW and WAND. Grand et al. [7] observe the same pattern in Lucene's BMW implementation.

The practical significance depends on the workload's query length distribution. 72.6% of MS-MARCO queries have 2-4 lexemes, the range where pg_textsearch shows its largest advantage (6.5x to 2.4x on v2). Weighted by this distribution, the overall speedup is 2.3x on v2 and 3.9x on v1.

Concurrent throughput

The concurrent throughput advantage (8.7x) substantially exceeds the single-client advantage (2.3x weighted p50). pg_textsearch queries execute as C code operating on Postgres buffer pages, with all memory management handled by Postgres's buffer cache. ParadeDB routes queries through Rust/C FFI into Tantivy, which manages its own memory and I/O outside the buffer pool. We have not profiled ParadeDB's internals, so we cannot attribute the concurrency gap to specific causes, but the architectural difference (shared buffer cache vs. separate memory management) is a plausible contributor. ParadeDB's concurrent performance may also improve in future versions.

Where ParadeDB is faster

Index build time. ParadeDB builds indexes 1.6-2x faster across both datasets. Tantivy's indexer is highly optimized Rust code with its own I/O management, not constrained by Postgres's page-based storage. Build time is a one-time cost per index (or per REINDEX); it does not affect query performance.

Long queries. At 7+ lexemes, the two systems converge. On v2, the 8+ lexeme p50 is 178 ms for pg_textsearch vs. 190 ms for ParadeDB. These long queries represent ~3.7% of the MS-MARCO distribution.

Index size caveat. pg_textsearch indexes are 19-26% smaller, but this comparison is not apples-to-apples: pg_textsearch does not store term positions, while ParadeDB stores positions to support phrase queries.

Benchmark limitations

All measurements are warm-cache on datasets that fit in memory. The 100-query sample per bucket provides directional results but limited statistical power for tail latencies. ParadeDB v0.21.6 was current at time of testing; future versions may improve. We compare against ParadeDB because it is the primary Postgres-native BM25 alternative; standalone engines like Elasticsearch operate in a different deployment model. We have not benchmarked write-heavy workloads with concurrent queries.

Limitations

We want to be clear about what pg_textsearch does not support in 1.0.

No phrase queries. The index stores term frequencies but not term positions, so it cannot natively evaluate queries like "database system" as a phrase. Phrase matching can be done with a post-filter:

SELECT * FROM (
  SELECT * FROM documents
  ORDER BY content <@> 'database system'
  LIMIT 100 -- over-fetch to compensate for post-filter
) sub
WHERE content ILIKE '%database system%'
LIMIT 10;

OR-only query semantics. All query terms are implicitly OR'd. A query for "database system" matches documents containing either term. We plan to add AND/OR/NOT operators via a dedicated boolean query syntax in a post-1.0 release.

No highlighting or snippet generation. Use Postgres's ts_headline() on the result set for highlighting.

No expression indexing. Each BM25 index covers a single text column. Workaround: create a generated column concatenating multiple fields.

Partition-local statistics. Each partition maintains its own IDF and average document length. Cross-partition queries return scores computed independently per partition.

No background compaction. Segment compaction runs synchronously during memtable spill. Write-heavy workloads may observe compaction latency. Background compaction is planned.

PL/pgSQL requires explicit index names. The implicit text <@> 'query' syntax relies on planner hooks that do not fire inside PL/pgSQL, DO blocks, or stored procedures. Use to_bm25query('query', 'index_name') explicitly. This is a practical limitation many developers will hit.

shared_preload_libraries required. pg_textsearch must be listed in shared_preload_libraries, requiring a server restart to install. On Tiger Cloud, this is handled automatically.

No fuzzy matching or typo tolerance. pg_textsearch uses Postgres's standard text search configurations for tokenization and stemming but does not provide built-in fuzzy matching. Typo-tolerant search requires a separate approach (e.g., pg_trgm).

What's Next

Planned work for post-1.0 releases:

Boolean query operators: AND, OR, NOT via a dedicated query syntax
Background compaction: decouple compaction from the write path
Expression index support: index computed expressions, not just bare columns
Dictionary compression: front-coding for terms, reducing dictionary size
Improved write concurrency: better throughput for sustained insert-heavy workloads

Try It

pg_textsearch requires Postgres 17 or 18. The fastest way to try it is on Tiger Cloud, where it is already installed and configured. No setup, no shared_preload_libraries. Create a service and run the example below.

For self-hosted installations, pre-built binaries for Linux and macOS (amd64, arm64) are available on the GitHub Releases page. Add it to shared_preload_libraries and restart:

shared_preload_libraries = 'pg_textsearch'

Source code and full documentation: github.com/timescale/pg_textsearch

Part 2 of this series covers getting started with pg_textsearch, hybrid search with pgvectorscale, and production patterns.

References

[1] Robertson et al. "Okapi at TREC-3." 1994. See also: Robertson, Zaragoza. "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in IR, 3(4):333-389, 2009.

[2] Ding, Suel. "Faster top-k document retrieval using block-max indexes." SIGIR 2011, pp. 993-1002.

[3] Broder et al. "Efficient query evaluation using a two-level retrieval process." CIKM 2003, pp. 426-434.

[4] O'Neil et al. "The log-structured merge-tree (LSM-tree)." Acta Informatica, 33(4):351-385, 1996.

[5] Facebook. "RocksDB: A Persistent Key-Value Store for Fast Storage Environments." https://rocksdb.org/

[6] SmallFloat encoding: Apache Lucene SmallFloat.java. Tantivy uses an equivalent implementation.

[7] Grand et al. "From MAXSCORE to Block-Max Wand: The Story of How Lucene Significantly Improved Query Evaluation Performance." ECIR 2020.

[8] Nguyen et al. "MS MARCO: A Human Generated MAchine Reading COmprehension Dataset." 2016.

[9] Statista. "Distribution of online search queries in the US, February 2020, by number of search terms."

[10] Dean. "We Analyzed 306M Keywords." Backlinko, 2024.

DEV Community