A deep dive into one of the most clever tricks in database engineering - explained so simply, your grandmother could follow along.
The Problem: Computers Are Too Exact
Imagine you're searching for a restaurant called "Maharaja Palace" on an app. You type "Maharja Palce" - two small typos. A traditional search engine looks at your input and says:
"No results found."
Zero. Nothing. Because the text doesn't exactly match.
This is the fundamental problem with how computers search for text. They compare strings character by character, and if even one letter is out of place, the match fails. Humans, on the other hand, are remarkably good at understanding that "Maharja" probably means "Maharaja". We do it instinctively, without even thinking.
Fuzzy search is the art of teaching computers to think a little more like humans - to find things that are close enough, not just exactly right.
And in PostgreSQL, the most powerful tool for this is called Trigram Search, powered by an extension called pg_trgm.
Part 1: What Is a Trigram? (The Building Block of Everything)
Before we dive into how PostgreSQL searches, we need to understand one simple concept: the trigram.
A trigram is just a group of 3 consecutive characters from a word.
That's it. Three letters in a row.
Let's break the word "cat" into trigrams:
Word: "cat"
PostgreSQL pads the word with spaces for edge handling: " cat "
Trigrams:
" c"
" ca"
"cat"
"at "
"t "
💡 PostgreSQL secretly adds two spaces before and one space after every word. This is to give the beginning and end of words their own identity, so "cat" and "concatenate" don't blur together.
Let's try a longer word: "elephant"
Word: " elephant "
Trigrams:
" e"
" el"
"ele"
"lep"
"eph"
"pha"
"han"
"ant"
"nt "
Every word gets chopped into these little 3-character windows, sliding one character at a time across the word from left to right.
Think of it like a magnifying glass that's 3 characters wide, slowly sliding across a word and taking a snapshot at every position.
Part 2: Why Trigrams? Why Not 2 or 4 Letters?
This is a fair question. Why exactly three?
Bigrams (2 letters): Too short. The bigram "ca" appears in "cat", "car", "can", "castle", "scar", "occasion"... it's so common it doesn't carry much meaning. Matching would produce too many false positives.
4-grams or longer: Too specific. A 4-gram like "elep" only really appears in "elephant" and close relatives. If you make a typo in that 4-gram, you lose the match entirely. Too brittle.
Trigrams (3 letters): The sweet spot. Common enough to appear across related words, specific enough to carry meaning. A single typo only destroys a few trigrams, leaving most of the others intact to still find a match.
It's the Goldilocks zone of string matching.
Part 3: The Magic - How Similarity Is Calculated
Here's where the real cleverness lives.
Once every word is broken into trigrams, PostgreSQL measures how similar two words are by counting how many trigrams they share.
The formula is:
Similarity = (Number of shared trigrams) / (Total unique trigrams across both words)
This gives a number between 0 (completely different) and 1 (identical).
Let's work through a real example
Say we're comparing "night" and "knight".
Step 1: Generate trigrams for "night"
" n", " ni", "nig", "igh", "ght", "ht "
→ 6 trigrams
Step 2: Generate trigrams for "knight"
" k", " kn", "kni", "nig", "igh", "ght", "ht "
→ 7 trigrams
Step 3: Find the shared trigrams
Shared: "nig", "igh", "ght", "ht "
→ 4 shared trigrams
Step 4: Calculate similarity
Total unique trigrams = 6 + 7 - 4 = 9 (subtract shared ones to avoid counting them twice)
Shared = 4
Similarity = 4 / 9 = 0.44
So "night" and "knight" are about 44% similar according to trigrams. That seems low - they sound the same! But trigrams are measuring character patterns, not pronunciation. The extra "kn" at the start of "knight" introduces several new unique trigrams that "night" doesn't have.
Now let's try a typo: "niight" vs "night"
Trigrams for "niight":
" n", " ni", "nii", "iig", "igh", "ght", "ht "
Trigrams for "night":
" n", " ni", "nig", "igh", "ght", "ht "
Shared: " n", " ni", "igh", "ght", "ht " → 5 shared
Total unique: 7 + 6 - 5 = 8
Similarity = 5/8 = 0.625
Despite the typo ("niight"), the similarity is still 62.5% - high enough that a search would still find the match. This is the magic of trigrams.
Part 4: The Threshold - Deciding What "Close Enough" Means
Now that we can calculate similarity as a number, we need to decide: how similar is similar enough?
PostgreSQL uses a default threshold of 0.3 (30%). If two strings share at least 30% of their trigrams, they're considered a match.
You can think of it like a teacher grading papers:
- Score ≥ 0.3 → Pass (it's a match)
- Score < 0.3 → Fail (not a match)
You can adjust this threshold depending on how strict or lenient you want your search to be:
-- Make it stricter (only very close matches)
SET pg_trgm.similarity_threshold = 0.5;
-- Make it more lenient (catch more distant matches, but more noise)
SET pg_trgm.similarity_threshold = 0.2;
Practical guidance:
| Threshold | Best For |
|-----------|----------|
| 0.1 – 0.2 | Very forgiving search, names from other languages |
| 0.3 | General purpose (default, works well most of the time) |
| 0.5 | Names, product codes where near-exact match is needed |
| 0.7+ | Almost exact matching only |
Part 5: Under the Hood - How PostgreSQL Actually Executes the Search
Understanding the math is one thing. But how does PostgreSQL actually run this efficiently when you have millions of rows? Recalculating trigrams for every row on every query would be painfully slow.
The answer is: the GIN Index.
What is a GIN Index?
GIN stands for Generalized Inverted Index. The name sounds intimidating, but the concept is simple.
Think of it like the index at the back of a book. Instead of reading every page to find where "photosynthesis" is mentioned, you flip to the back, find "photosynthesis" alphabetically, and it tells you exactly which pages contain it.
A GIN index does the same thing for trigrams:
GIN Index Structure (simplified):
" ca" → [row 5, row 12, row 89, row 203]
" co" → [row 3, row 7, row 44]
"cat" → [row 5, row 89]
"cof" → [row 3]
"ofe" → [row 3]
...
For every possible trigram, the index stores a list of which rows contain that trigram. When you search for "coffee", PostgreSQL:
- Breaks "coffee" into its trigrams:
" co","cof","off","ffe","fee","ee " - Looks up each trigram in the GIN index
- Finds rows that contain several of these trigrams (intersection)
- Calculates the exact similarity score only for those candidate rows
- Returns results above the threshold
Instead of scanning millions of rows, it scans a short list of candidates. This is what makes trigram search fast - often hundreds of times faster than a brute-force scan.
Visualizing the Query Flow
Your Query: "SELECT * WHERE name % 'cofee'"
│
â–¼
Break 'cofee' into trigrams
[" co", "cof", "ofe", "fee", "ee "]
│
â–¼
Look up each trigram in GIN index
→ Candidate rows: [3, 7, 44, 89, 102]
│
â–¼
Calculate similarity() for each candidate
→ Row 3: 0.71 ✓ (above threshold)
→ Row 7: 0.18 ✗
→ Row 44: 0.65 ✓
→ Row 89: 0.22 ✗
→ Row 102: 0.80 ✓
│
â–¼
Return rows 3, 44, 102 - ordered by score
Without the GIN index, step 2 would require checking every row. With it, only a handful of candidates get the expensive similarity calculation.
Part 6: Setting It Up in PostgreSQL
Enough theory - here's how you actually use it.
Step 1: Enable the Extension
CREATE EXTENSION IF NOT EXISTS pg_trgm;
This is a one-time setup per database. The extension comes bundled with PostgreSQL - no installation needed.
Step 2: Create Your Table and Index
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL
);
-- Create a GIN index on the name column
CREATE INDEX idx_products_name_trgm
ON products
USING GIN (name gin_trgm_ops);
The gin_trgm_ops part tells PostgreSQL: "use trigram logic for this index".
Step 3: Query Using Similarity
-- The % operator: returns rows above the similarity threshold
SELECT * FROM products
WHERE name % 'cofee maker';
-- The similarity() function: returns the score, lets you sort by closeness
SELECT name, similarity(name, 'cofee maker') AS score
FROM products
WHERE name % 'cofee maker'
ORDER BY score DESC;
Sample output:
name | score
-------------------+-------
Coffee Maker Pro | 0.82
Coffee Maker Mini | 0.78
Coffee Mug Set | 0.31
Step 4: Tune the Threshold (Optional)
-- See the current threshold
SHOW pg_trgm.similarity_threshold;
-- Change it for this session
SET pg_trgm.similarity_threshold = 0.4;
Part 7: Real-World Use Cases
1. Search Autocomplete / "Did you mean?"
-- User typed "postgress" - find the closest match
SELECT name, similarity(name, 'postgress') AS score
FROM technologies
WHERE name % 'postgress'
ORDER BY score DESC
LIMIT 5;
2. Deduplication - Finding Near-Duplicate Records
-- Find customer records that might be the same person
SELECT a.name, b.name, similarity(a.name, b.name) AS score
FROM customers a
JOIN customers b ON a.id < b.id
WHERE similarity(a.name, b.name) > 0.7
ORDER BY score DESC;
3. Product Search with Typo Tolerance
-- Search even with misspellings
SELECT id, name, price,
similarity(name, $1) AS relevance
FROM products
WHERE name % $1
OR description % $1
ORDER BY relevance DESC
LIMIT 20;
4. Combining with Full-Text Search for Best Results
-- Use full-text search for speed + trigrams for typo tolerance
SELECT name, ts_rank(tsv, query) AS text_rank,
similarity(name, 'databse') AS fuzzy_rank
FROM articles,
plainto_tsquery('english', 'database') query
WHERE tsv @@ query
OR name % 'databse'
ORDER BY (ts_rank(tsv, query) + similarity(name, 'databse')) DESC;
Part 8: The word_similarity and strict_word_similarity Functions
PostgreSQL's pg_trgm also provides two variants of the similarity function that handle multi-word strings differently.
| Function | What It Does |
|---|---|
similarity(a, b) |
Compares the full strings as a whole |
word_similarity(a, b) |
Checks if a is similar to any word in b
|
strict_word_similarity(a, b) |
Like word_similarity but requires whole-word boundaries |
Example:
SELECT similarity('word', 'two words');
-- Returns: 0.18 (compares full strings)
SELECT word_similarity('word', 'two words');
-- Returns: 0.8 (finds that 'word' closely matches 'words')
This is invaluable when users search for a single product word inside a longer product name.
Part 9: Performance Tips
Always Create the GIN Index Before Querying Large Tables
Without the index, PostgreSQL does a full sequential scan - checking every row. On a table with 10 million rows, this can take minutes. With the index, the same query can complete in milliseconds.
-- Check if your query is using the index
EXPLAIN ANALYZE
SELECT * FROM products WHERE name % 'laptop';
Look for Index Scan using idx_products_name_trgm in the output. If you see Seq Scan instead, your index isn't being used - check that it was created correctly.
Index Maintenance
GIN indexes can become bloated over time with many inserts and updates. Periodically run:
VACUUM ANALYZE products;
Consider a GIST Index for Certain Workloads
-- Alternative index type
CREATE INDEX idx_products_name_gist
ON products
USING GIST (name gist_trgm_ops);
| Index Type | Build Speed | Query Speed | Update Speed | Choose When |
|---|---|---|---|---|
| GIN | Slower | Faster | Slower | Read-heavy workloads |
| GiST | Faster | Slower | Faster | Write-heavy workloads |
Part 10: Limitations - When Trigrams Don't Work Well
Trigrams are powerful but not magic. They struggle in a few situations:
Short Strings (Less Than 3 Characters)
A 1 or 2 character string can't form a meaningful set of trigrams. Searching for "AI" or "DB" will produce poor results.
-- Workaround: Fall back to ILIKE for very short queries
SELECT * FROM tags
WHERE CASE
WHEN length('AI') < 3 THEN name ILIKE '%AI%'
ELSE name % 'AI'
END;
Very Different Languages
Trigrams work on character patterns. For languages with different scripts (Arabic, Chinese, Thai), the trigram approach still works but may need different tokenization strategies.
Phonetic Similarity
Trigrams match visual similarity, not sound similarity. "Night" and "knight" sound identical but have low trigram similarity because they look different. For phonetic matching, consider the fuzzystrmatch extension with soundex or metaphone.
CREATE EXTENSION fuzzystrmatch;
SELECT * FROM names
WHERE metaphone(name, 10) = metaphone('knight', 10);
-- This would match "night" and "knight"
Summary: The Trigram Story in One Paragraph
Every word in your database gets sliced into overlapping 3-character chunks called trigrams. These trigrams are stored in a GIN index - like a giant lookup table mapping each trigram to the rows containing it. When you search for a word, PostgreSQL slices that word into trigrams too, quickly finds candidate rows via the index, then calculates a similarity score (0 to 1) based on how many trigrams the two words share. If that score passes a threshold (default 0.3), it's considered a match. The result: fast, typo-tolerant, human-friendly search - baked right into your database, no external search engine required.
Quick Reference Cheat Sheet
-- Setup
CREATE EXTENSION pg_trgm;
CREATE INDEX idx ON table USING GIN (column gin_trgm_ops);
-- Basic search
SELECT * FROM table WHERE column % 'search term';
-- With score
SELECT *, similarity(column, 'search term') AS score
FROM table WHERE column % 'search term'
ORDER BY score DESC;
-- Adjust threshold
SET pg_trgm.similarity_threshold = 0.3;
-- Word-level similarity
SELECT word_similarity('search', 'long string with search inside');
-- Check index usage
EXPLAIN ANALYZE SELECT * FROM table WHERE column % 'term';
Fuzzy search is one of those features that, once you add it to your application, users can't imagine living without. That tolerant, forgiving search experience - the one that doesn't punish you for a stray keystroke - is powered by these elegant little 3-character windows sliding across your data, quietly doing the math.
Top comments (0)