Search Index Architecture for Privacy-First Engines

#nextjs #react #frontend #privacy

Elasticsearch vs. PostgreSQL Full-Text: Search Engine Indexing Deep Dive

Every search engine needs an index. The question is: which technology builds it?

PostgreSQL Full-Text Search

Built-in since PostgreSQL 8.3. Surprisingly capable.

Pros:

Zero additional infrastructure
ACID compliance (index stays consistent with data)
Good for < 1M documents
Supports German stemming, compound word handling
tsvector/tsquery is fast for simple needs

Cons:

No distributed search
Limited faceting and aggregation
Relevance scoring is basic
No built-in synonym support

Elasticsearch

The industry standard for search.

Pros:

Distributed by design (shards, replicas)
Advanced relevance scoring (BM25, custom functions)
Real-time indexing
Faceting, aggregation, geospatial
Plugin ecosystem (synonyms, analyzers)

Cons:

Additional infrastructure (cluster management)
Memory hungry (heap size tuning required)
Eventually consistent (not ACID)
Complex query DSL learning curve

What We Chose (and Why)

We use both:

PostgreSQL: Primary data store, user accounts, query logs (minimal), metadata
Elasticsearch: Search index only, rebuilt from PostgreSQL nightly

This hybrid gives us ACID for critical data and search performance for queries. If Elasticsearch fails, we can rebuild from PostgreSQL. If PostgreSQL is slow for search, Elasticsearch handles it.

German Language Challenges

German search is harder than English:

Compound words ("Datenschutzgrundverordnung")
Umlaut normalization (ä → ae or a?)
Case inflection
Dialect variations

Our solution: Custom Elasticsearch analyzer chain:

ICU tokenizer (handles compound words)
German stemmer (Snowball)
Umlaut normalizer (ä → a, ö → o, ü → u)
Synonym filter (DSGVO → Datenschutzgrundverordnung)

Graham Miranda builds search infrastructure at Graham Miranda UG (Berlin, HRB 36794).

DEV Community