DEV Community

Maya S.
Maya S.

Posted on

基于 Cloudberry 的全文检索方案对比:ParadeDB BM25 vs GIN vs ZomboDB

Top comments (1)

Collapse
 
maya_sun_29e7bf629e5dd7b3 profile image
Maya S.

Apache Cloudberry is an advanced and mature open-source Massively Parallel Processing (MPP) database, derived from the open-source version of the Pivotal Greenplum Database® but built on a more modern PostgreSQL kernel and with more advanced enterprise capabilities. Cloudberry can serve as a data warehouse and can also be used for large-scale analytics and AI/ML workloads.

This benchmark is based on the latest Apache Cloudberry main branch 3.0 (a development branch that has not yet been officially released). It presents a systematic performance comparison of three full-text search implementations within the PostgreSQL ecosystem: ParadeDB BM25 (tested via pg_search), Postgres GIN (trigram), and ZomboDB.

The benchmark covers key metrics, including index build time, query performance, and memory consumption, with the goal of assessing their real-world behavior in large-scale text retrieval workloads. The results aim to offer practical advice to Cloudberry users and database developers when selecting an appropriate full-text search solution within the PostgreSQL ecosystem.


Test Environment

All tests were conducted on a system equipped with a 4-core CPU and 16 GB of memory, running Apache Cloudberry version 3.0 (main branch).
The dataset used in this evaluation is All The News 2.1 (dropbox.com/s/cn2utnr5ipathhh/all-...), a publicly available news corpus containing 2,688,878 articles sourced from 27 U.S. publications.

The searchable fields include:
date, year, month, day, author, title, article, url, section, and publication.

The table schema is defined as follows:

CREATE TABLE news (
  id bigserial PRIMARY KEY,
  pub_date timestamp,
  pub_year integer,
  pub_month char(5),
  pub_day integer,
  author text,
  title text,
  article text,
  url text,
  section text,
  publication text
) DISTRIBUTED BY (id);

postgres=# select count(*) from news;
  count  
---------
 2688878
(1 row)
Enter fullscreen mode Exit fullscreen mode

1. ParadeDB BM25 Index

ParadeDB BM25 is a full-text search extension developed by ParadeDB.
It is built on a vectorized BM25 ranking algorithm, enabling efficient relevance-based retrieval over large text corpora.
Within Cloudberry, it can be integrated via the pg_search interface and is characterized by high query performance and low memory footprint.

However, the current version does not yet support JOIN operations.

CREATE INDEX news_article_bm25_idx ON news USING bm25 (id, article)
WITH (key_field = 'id', text_fields = '{"article": {"fast": true}}');
Enter fullscreen mode Exit fullscreen mode

2. GIN (trigram) Index

Postgres GIN (trigram) is the built-in full-text search option in Apache Cloudberry, implemented through PostgreSQL’s core pg_trgm module.

It supports similarity matching and text search using trigram indexes, requires no external dependencies, and offers high integration and ease of deployment.

As such, it serves as one of Cloudberry’s default, out-of-the-box search mechanisms.

CREATE INDEX gin_idx ON news USING gin (article gin_trgm_ops);
Enter fullscreen mode Exit fullscreen mode

3. ZomboDB Index

ZomboDB is an extension that integrates PostgreSQL with Elasticsearch, allowing users to execute Elasticsearch-powered full-text searches directly within SQL queries.

It combines PostgreSQL’s transactional consistency with Elasticsearch’s rich full-text and boolean search capabilities, making it particularly suitable for workloads that require complex search logic or external search integration.

At present, ZomboDB for Apache Cloudberry has not yet been open-sourced and remains in the planning stage.

This evaluation uses a compatible build solely to assess its potential performance within the Cloudberry environment.

CREATE INDEX news_zdb_idx ON news USING zombodb ((news.*))
WITH (url='http://localhost:9200/');
Enter fullscreen mode Exit fullscreen mode

Index Build Time Comparison


Performance Benchmark Results

Word-Level Search Latency

Search Term ParadeDB BM25 GIN (trigram) ZomboDB Result Count
‘the’ 71.07 s 99.15 s 78.37 s ≈ 2.4 M
‘table’ 23.55 s 92.80 s 27.50 s ≈ 74 K
‘abcd’ 0.09 s 0.11 s 0.21 s 152

Boolean (Composite) Queries

Query Condition ParadeDB BM25 GIN (trigram) ZomboDB Result Count
‘table OR data’ 25.66 s 50.74 s* 43.66 s ≈ 327 K
‘table AND data’ 4.08 s 69.80 s 4.95 s ≈ 9.7 K

Detailed Analysis

High-frequency term (“the”)

  • ParadeDB BM25 — 71.07 s (fastest)
  • ZomboDB — 78.37 s
  • GIN (trigram) — 99.15 s (slowest)

Medium-frequency term (“table”)

  • ParadeDB BM25 — 23.55 s (fastest)
  • ZomboDB — 27.50 s
  • GIN (trigram) — 92.80 s (slowest)

Low-frequency term (“abcd”)

  • All three performed similarly; differences were negligible.
  • GIN (trigram) was marginally faster (0.11 s).

Boolean query performance

  • AND queries → ZomboDB showed a clear lead (4.08 s).
  • OR queries → ParadeDB BM25 was fastest (25.66 s).
  • GIN (trigram) had the slowest OR query performance.

Memory Usage During Query Execution

Index Type Approx. Memory Usage
ParadeDB BM25 ≈ 120 MB
GIN (trigram) ≈ 1.4 GB
ZomboDB ≈ 120 MB

Conclusions

Each of the three full-text search solutions has their own dist :

  • ZomboDB: Offers the fastest index build time, making it well-suited for environments that require frequent index rebuilding or rapid deployment. It delivers stable query performance and supports complex Boolean search operations, making it ideal for applications that prioritize indexing speed and extensibility.

  • GIN (trigram): Provides moderate index build times but exhibits weaker performance in query execution—particularly with high-frequency terms and composite OR queries—while consuming significantly more memory. Its key advantages lie in simplicity and tight integration with PostgreSQL, which makes it suitable for lightweight workloads or use cases that depend on native PostgreSQL extensions.

  • ParadeDB BM25: Delivers the best overall query performance, especially for relevance-based searches, high-frequency terms, and complex queries. It also maintains low memory usage. However, its index build process takes longer, and current feature limitations—such as the lack of JOIN support—should be considered during system design.

In conclusion, ParadeDB BM25 is best suited for full-text search scenarios that emphasize text relevance and query performance; ZomboDB is more appropriate for applications requiring complex Boolean logic and extended search capabilities; and GIN (trigram) is most fitting for lightweight, low-frequency text matching tasks.

These benchmarking results provide valuable insights for Cloudberry users and PostgreSQL developers when selecting an appropriate full-text search solution within the PostgreSQL ecosystem.


Welcome to Apache Cloudberry