GBase 8a Full-Text Index: Features, Queries, and Configuration

#backend #database #sql #tutorial

The full-text index built into GBase 8a enables indexing and searching across all text-type columns, with support for boolean expressions, proximity searches, and online index updates. This guide covers the feature set, real‑world query examples, and the configuration file that controls every aspect of text processing in a gbase database.

Core Features

Index all text-type columns in a table.
Queries can run while the index is being built — no downtime required.
Incrementally add new data to an existing index with UPDATE INDEX, avoiding full rebuilds:

  UPDATE INDEX index_name ON table_name;

Query Examples

Combine logical operators and the NEAR function to express precise search conditions.

Boolean and Phrase Queries

-- Must contain both "TianJin" AND "ltd"
SELECT * FROM t1 WHERE contains(memo, '"TianJin" & "ltd"');
-- Space defaults to AND
SELECT * FROM t1 WHERE contains(memo, 'TianJin ltd');
-- Contains "张三" OR "TianJin"
SELECT * FROM t1 WHERE contains(memo, '"张三" | "TianJin"');
-- Contains "张三" OR "TianJin" but NOT "人"
SELECT * FROM t1 WHERE contains(memo, '"张三" | "TianJin" - "人"');

NEAR Function: Word Distance and Order

NEAR((term1, term2), num [, order])

term: search words separated by commas, treated as AND; each must match exactly.
num: maximum word distance (integer), inclusive of the matched terms.
order (optional): 0 for any sequence (default), 1 to enforce the specified word order.

Configuring the Index and Tokenizer

The behavior of the full‑text engine is controlled through a configuration file located at:

/opt/gbase/192.168.163.3/gcluster/server/lib/gbase/plugin/gbfti/cfg/

Key Parameters

Parameter	Description
multisegmask	Tokenization mode: 0 natural (default), 1 numeric n‑gram, 2 English n‑gram
mixedcase	Case sensitivity: 0 insensitive (default), 1 sensitive
step	N‑gram step: 0 uses default (trigram), >0 sets actual step, max 127
dict	Enable dictionary‑based tokenization (requires path)
hitflush	Maximum data volume processed per tokenization run
dictSlotPerUnit	Dictionary hash bucket count — larger values speed up word lookup at the cost of memory
quickUpdate	0 off (default); 1 enables parallel file writes, suitable for large documents and vocabularies
segThreads	Number of tokenizer threads
sortThreads	Number of sorting threads
outThreads	Number of output threads
maxDocPerUnit	Maximum rows per index segment
maxLineSize	Maximum text length per row
reduceMemMode	0 keeps index resident in memory (default); 1 flushes to disk to save memory with slightly higher latency
dictDynamicLoad	Toggle dynamic dictionary loading
maxMatch	Maximum concurrent search operations
maxThreadPerTask	Maximum per‑search‑task parallelism
dsoPath	Path to the tokenizer shared library
outCharset	Character set emitted by the tokenizer

Tuning these parameters lets you strike the right balance between search performance and resource consumption in your gbase database environment, keeping GBASE’s GBase 8a full‑text engine running efficiently.