The Multi-Language Search Challenge
When your video platform indexes content from Poland, the Netherlands, Sweden, Norway, and Austria, a single-language search engine will not cut it. At ViralVidVault, we needed search that correctly stems Polish words, handles Swedish compound nouns, and still works for English queries. PostgreSQL's built-in full-text search delivered all of this without adding Elasticsearch to the stack.
Setting Up Language-Aware tsvectors
The key is storing a pre-computed tsvector column with the appropriate language dictionary per video:
ALTER TABLE videos ADD COLUMN search_vector tsvector;
ALTER TABLE videos ADD COLUMN content_language TEXT DEFAULT 'english';
-- Map regions to PostgreSQL dictionaries
-- PL=simple (no Polish dict in default PG), NL=dutch, SE=swedish, NO=norwegian, AT=german
UPDATE videos SET content_language = CASE
WHEN region = 'NL' THEN 'dutch'
WHEN region = 'SE' THEN 'swedish'
WHEN region = 'NO' THEN 'norwegian'
WHEN region IN ('AT', 'DE') THEN 'german'
ELSE 'english'
END;
UPDATE videos SET search_vector =
setweight(to_tsvector(content_language::regconfig, coalesce(title, '')), 'A') ||
setweight(to_tsvector('simple', coalesce(channel_title, '')), 'B') ||
setweight(to_tsvector(content_language::regconfig, coalesce(description, '')), 'C');
The 'simple' dictionary for channel names is intentional — channel names are proper nouns that should not be stemmed. Using the language-specific dictionary for titles and descriptions means "löpning" (Swedish for "running") correctly matches "löpa" and "springer" (Dutch/German for "runner") gets stemmed properly.
GIN Index and Trigger
CREATE INDEX idx_videos_fts ON videos USING GIN(search_vector);
CREATE OR REPLACE FUNCTION videos_search_update() RETURNS trigger AS $$
BEGIN
NEW.search_vector :=
setweight(to_tsvector(
COALESCE(NEW.content_language, 'english')::regconfig,
coalesce(NEW.title, '')
), 'A') ||
setweight(to_tsvector('simple', coalesce(NEW.channel_title, '')), 'B') ||
setweight(to_tsvector(
COALESCE(NEW.content_language, 'english')::regconfig,
coalesce(NEW.description, '')
), 'C');
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_videos_fts
BEFORE INSERT OR UPDATE OF title, description, channel_title, content_language
ON videos
FOR EACH ROW EXECUTE FUNCTION videos_search_update();
The trigger fires on relevant column changes, keeping the search vector in sync without application code.
Cross-Language Search Queries
Users search in their own language. A Dutch user searching for "muziek" should find Dutch music videos. An English user searching "music" should find English-tagged content. Here is the query we use:
-- Search with automatic language detection
SELECT video_id, title, channel_title, region,
ts_rank(search_vector, query) AS rank
FROM videos,
to_tsquery('simple', 'music') AS query
WHERE search_vector @@ query
ORDER BY rank DESC, views DESC
LIMIT 20;
Using the simple configuration for the query means it matches against all language vectors without stemming the search term. For language-specific searches, pass the dictionary:
-- Dutch-specific search: "muziek" stems correctly
SELECT video_id, title, ts_rank(search_vector, query) AS rank
FROM videos,
to_tsquery('dutch', 'muziek') AS query
WHERE search_vector @@ query
ORDER BY rank DESC;
Comparison with SQLite FTS5
ViralVidVault started on SQLite FTS5. Here is what drove the migration:
| Feature | SQLite FTS5 | PostgreSQL FTS |
|---|---|---|
| Language dictionaries | English only (default) | 30+ built-in |
| Weighted ranking | Manual BM25 config | Native A/B/C/D weights |
| Compound word handling | None | Language-aware splitting |
| Diacritics | Manual unaccent | Dictionary-handled |
| Index updates | Rebuild or merge | Trigger-based, live |
For a European-focused platform, the multi-language support alone justified the switch. Swedish compound nouns like "fotbollsspelare" (football player) get tokenized correctly with the Swedish dictionary. FTS5 treats it as one opaque token.
Performance on European Content
With ~12,000 videos across 7 regions indexed with GIN:
- Single-term search (any language): 1-3ms
- Multi-term AND query: 3-5ms
- Cross-language search (simple config): 2-4ms
- Language-specific stemmed search: 2-4ms
- Full reindex after bulk insert: ~1.5 seconds
The GIN index adds about 3MB of storage but makes the difference between a table scan and an instant lookup.
This article is part of the Building ViralVidVault series. Check out ViralVidVault to see these techniques in action.
Top comments (0)