DEV Community

ahmet gedik
ahmet gedik

Posted on

PostgreSQL Full-Text Search for Multi-Language Video Content

The Multi-Language Search Challenge

When your video platform indexes content from Poland, the Netherlands, Sweden, Norway, and Austria, a single-language search engine will not cut it. At ViralVidVault, we needed search that correctly stems Polish words, handles Swedish compound nouns, and still works for English queries. PostgreSQL's built-in full-text search delivered all of this without adding Elasticsearch to the stack.

Setting Up Language-Aware tsvectors

The key is storing a pre-computed tsvector column with the appropriate language dictionary per video:

ALTER TABLE videos ADD COLUMN search_vector tsvector;
ALTER TABLE videos ADD COLUMN content_language TEXT DEFAULT 'english';

-- Map regions to PostgreSQL dictionaries
-- PL=simple (no Polish dict in default PG), NL=dutch, SE=swedish, NO=norwegian, AT=german
UPDATE videos SET content_language = CASE
    WHEN region = 'NL' THEN 'dutch'
    WHEN region = 'SE' THEN 'swedish'
    WHEN region = 'NO' THEN 'norwegian'
    WHEN region IN ('AT', 'DE') THEN 'german'
    ELSE 'english'
END;

UPDATE videos SET search_vector =
    setweight(to_tsvector(content_language::regconfig, coalesce(title, '')), 'A') ||
    setweight(to_tsvector('simple', coalesce(channel_title, '')), 'B') ||
    setweight(to_tsvector(content_language::regconfig, coalesce(description, '')), 'C');
Enter fullscreen mode Exit fullscreen mode

The 'simple' dictionary for channel names is intentional — channel names are proper nouns that should not be stemmed. Using the language-specific dictionary for titles and descriptions means "löpning" (Swedish for "running") correctly matches "löpa" and "springer" (Dutch/German for "runner") gets stemmed properly.

GIN Index and Trigger

CREATE INDEX idx_videos_fts ON videos USING GIN(search_vector);

CREATE OR REPLACE FUNCTION videos_search_update() RETURNS trigger AS $$
BEGIN
    NEW.search_vector :=
        setweight(to_tsvector(
            COALESCE(NEW.content_language, 'english')::regconfig,
            coalesce(NEW.title, '')
        ), 'A') ||
        setweight(to_tsvector('simple', coalesce(NEW.channel_title, '')), 'B') ||
        setweight(to_tsvector(
            COALESCE(NEW.content_language, 'english')::regconfig,
            coalesce(NEW.description, '')
        ), 'C');
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_videos_fts
    BEFORE INSERT OR UPDATE OF title, description, channel_title, content_language
    ON videos
    FOR EACH ROW EXECUTE FUNCTION videos_search_update();
Enter fullscreen mode Exit fullscreen mode

The trigger fires on relevant column changes, keeping the search vector in sync without application code.

Cross-Language Search Queries

Users search in their own language. A Dutch user searching for "muziek" should find Dutch music videos. An English user searching "music" should find English-tagged content. Here is the query we use:

-- Search with automatic language detection
SELECT video_id, title, channel_title, region,
       ts_rank(search_vector, query) AS rank
FROM videos,
     to_tsquery('simple', 'music') AS query
WHERE search_vector @@ query
ORDER BY rank DESC, views DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode

Using the simple configuration for the query means it matches against all language vectors without stemming the search term. For language-specific searches, pass the dictionary:

-- Dutch-specific search: "muziek" stems correctly
SELECT video_id, title, ts_rank(search_vector, query) AS rank
FROM videos,
     to_tsquery('dutch', 'muziek') AS query
WHERE search_vector @@ query
ORDER BY rank DESC;
Enter fullscreen mode Exit fullscreen mode

Comparison with SQLite FTS5

ViralVidVault started on SQLite FTS5. Here is what drove the migration:

Feature SQLite FTS5 PostgreSQL FTS
Language dictionaries English only (default) 30+ built-in
Weighted ranking Manual BM25 config Native A/B/C/D weights
Compound word handling None Language-aware splitting
Diacritics Manual unaccent Dictionary-handled
Index updates Rebuild or merge Trigger-based, live

For a European-focused platform, the multi-language support alone justified the switch. Swedish compound nouns like "fotbollsspelare" (football player) get tokenized correctly with the Swedish dictionary. FTS5 treats it as one opaque token.

Performance on European Content

With ~12,000 videos across 7 regions indexed with GIN:

  • Single-term search (any language): 1-3ms
  • Multi-term AND query: 3-5ms
  • Cross-language search (simple config): 2-4ms
  • Language-specific stemmed search: 2-4ms
  • Full reindex after bulk insert: ~1.5 seconds

The GIN index adds about 3MB of storage but makes the difference between a table scan and an instant lookup.


This article is part of the Building ViralVidVault series. Check out ViralVidVault to see these techniques in action.

Top comments (0)