Why Fuzzy Autocomplete Matters for European Content
At ViralVidVault, we index video titles from 7 European regions. Users searching for a Swedish video might type "goteborg" instead of "Goteborg" or misspell a Dutch title. Standard LIKE queries fail hard on typos. PostgreSQL trigrams handle this gracefully.
Setting Up pg_trgm
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE TABLE videos (
id SERIAL PRIMARY KEY,
video_id VARCHAR(16) UNIQUE NOT NULL,
title TEXT NOT NULL,
channel_title TEXT,
region VARCHAR(4) NOT NULL DEFAULT 'GB',
view_count BIGINT DEFAULT 0,
virality_score FLOAT DEFAULT 0,
fetched_at TIMESTAMP DEFAULT NOW()
);
-- GIN trigram indexes for fast fuzzy matching
CREATE INDEX idx_videos_title_trgm ON videos USING GIN (title gin_trgm_ops);
CREATE INDEX idx_videos_channel_trgm ON videos USING GIN (channel_title gin_trgm_ops);
Region-Weighted Autocomplete
The unique challenge for a European video vault: a search should prefer results from the user's browsed region but still include cross-region matches. Here's the function:
CREATE OR REPLACE FUNCTION autocomplete_videos(
query TEXT,
user_region VARCHAR(4) DEFAULT 'GB',
max_results INT DEFAULT 10
)
RETURNS TABLE (
video_id VARCHAR,
title TEXT,
channel_title TEXT,
region VARCHAR,
view_count BIGINT,
score FLOAT
) AS $$
BEGIN
RETURN QUERY
SELECT v.video_id, v.title, v.channel_title, v.region, v.view_count,
(
similarity(v.title, query) * 0.6
+ similarity(COALESCE(v.channel_title, ''), query) * 0.2
-- Boost same-region results
+ CASE WHEN v.region = user_region THEN 0.15 ELSE 0 END
-- Boost viral content
+ LEAST(v.virality_score / 100.0 * 0.05, 0.05)
) AS score
FROM videos v
WHERE v.title % query
OR v.channel_title % query
OR lower(v.title) LIKE lower(query) || '%'
ORDER BY score DESC, v.view_count DESC
LIMIT max_results;
END;
$$ LANGUAGE plpgsql STABLE;
Usage examples:
-- Dutch user searching with a typo
SELECT * FROM autocomplete_videos('muziek videoclp', 'NL', 8);
-- Matches "Muziek Videoclip" despite the missing 'i'
-- Swedish user, prefix match
SELECT * FROM autocomplete_videos('svensk', 'SE', 8);
-- Boosts Swedish results in the ranking
Python API Endpoint
Wire it to a FastAPI endpoint:
from fastapi import FastAPI, Query, Request
import asyncpg
app = FastAPI()
pool: asyncpg.Pool = None
@app.on_event("startup")
async def startup():
global pool
pool = await asyncpg.create_pool("postgresql://user:pass@localhost/viralvidvault")
@app.get("/api/autocomplete")
async def autocomplete(
q: str = Query(min_length=2, max_length=100),
region: str = Query(default="GB", max_length=4),
):
rows = await pool.fetch(
"SELECT * FROM autocomplete_videos($1, $2, $3)",
q, region.upper(), 8,
)
return {
"query": q,
"region": region,
"suggestions": [
{
"video_id": r["video_id"],
"title": r["title"],
"channel": r["channel_title"],
"region": r["region"],
"views": r["view_count"],
"score": round(r["score"], 3),
}
for r in rows
],
}
Frontend with Debouncing
const searchInput = document.getElementById('search-input');
let timer;
searchInput.addEventListener('input', (e) => {
clearTimeout(timer);
timer = setTimeout(async () => {
const q = e.target.value.trim();
if (q.length < 2) return;
// Detect user's region from page context
const region = document.documentElement.dataset.region || 'GB';
const res = await fetch(`/api/autocomplete?q=${encodeURIComponent(q)}®ion=${region}`);
const data = await res.json();
const dropdown = document.getElementById('autocomplete-list');
dropdown.innerHTML = data.suggestions.map(s => `
<a href="/watch/${s.video_id}" class="suggestion">
<span class="title">${s.title}</span>
<span class="meta">${s.channel} · ${s.region}</span>
</a>
`).join('');
}, 200);
});
Performance on European Content
On ViralVidVault's dataset with titles in English, Polish, Dutch, Swedish, and Norwegian:
| Approach | Avg Latency | Fuzzy Matching |
|---|---|---|
ILIKE '%query%' |
38ms | No |
pg_trgm GIN |
2.5ms | Yes |
pg_trgm + region boost |
3.1ms | Yes + regional |
Trigram similarity works across Latin-alphabet languages without any language-specific configuration. Polish characters like "ś", "ż", and "ó" are handled naturally because trigrams operate on raw character sequences.
Tuning the Similarity Threshold
The default threshold of 0.3 works for English but may need adjustment for shorter European words:
-- For Dutch and Swedish titles, lower threshold catches more matches
SET pg_trgm.similarity_threshold = 0.2;
-- Test threshold impact
SELECT title, similarity(title, 'zweed') as score
FROM videos
WHERE title % 'zweed'
ORDER BY score DESC;
At ViralVidVault, we settled on 0.22 as the sweet spot -- low enough to catch European language typos, high enough to avoid irrelevant noise. You can also make this dynamic per-region by adjusting the threshold in the application layer before each query.
The combination of trigram indexes and region-weighted scoring gives users a search experience that feels intelligent. Misspellings, partial words, and cross-language queries all return relevant results in under 5 milliseconds.
This article is part of the Building ViralVidVault series. Check out ViralVidVault to see these techniques in action.
Top comments (0)