Dhananjay Haridas

Posted on Jun 3

Fuzzy Text Search in PostgreSQL: How Trigrams Make Computers "Almost Right"

A deep dive into one of the most clever tricks in database engineering - explained so simply, your grandmother could follow along.

The Problem: Computers Are Too Exact

Imagine you're searching for a restaurant called "Maharaja Palace" on an app. You type "Maharja Palce" - two small typos. A traditional search engine looks at your input and says:

"No results found."

Zero. Nothing. Because the text doesn't exactly match.

This is the fundamental problem with how computers search for text. They compare strings character by character, and if even one letter is out of place, the match fails. Humans, on the other hand, are remarkably good at understanding that "Maharja" probably means "Maharaja". We do it instinctively, without even thinking.

Fuzzy search is the art of teaching computers to think a little more like humans - to find things that are close enough, not just exactly right.

And in PostgreSQL, the most powerful tool for this is called Trigram Search, powered by an extension called pg_trgm.

Part 1: What Is a Trigram? (The Building Block of Everything)

Before we dive into how PostgreSQL searches, we need to understand one simple concept: the trigram.

A trigram is just a group of 3 consecutive characters from a word.

That's it. Three letters in a row.

Let's break the word "cat" into trigrams:

Word: "cat"
PostgreSQL pads the word with spaces for edge handling: "  cat  "

Trigrams:
  "  c"
  " ca"
  "cat"
  "at "
  "t  "

💡 PostgreSQL secretly adds two spaces before and one space after every word. This is to give the beginning and end of words their own identity, so "cat" and "concatenate" don't blur together.

Let's try a longer word: "elephant"

Word: "  elephant "

Trigrams:
  "  e"
  " el"
  "ele"
  "lep"
  "eph"
  "pha"
  "han"
  "ant"
  "nt "

Every word gets chopped into these little 3-character windows, sliding one character at a time across the word from left to right.

Think of it like a magnifying glass that's 3 characters wide, slowly sliding across a word and taking a snapshot at every position.

Part 2: Why Trigrams? Why Not 2 or 4 Letters?

This is a fair question. Why exactly three?

Bigrams (2 letters): Too short. The bigram "ca" appears in "cat", "car", "can", "castle", "scar", "occasion"... it's so common it doesn't carry much meaning. Matching would produce too many false positives.
4-grams or longer: Too specific. A 4-gram like "elep" only really appears in "elephant" and close relatives. If you make a typo in that 4-gram, you lose the match entirely. Too brittle.
Trigrams (3 letters): The sweet spot. Common enough to appear across related words, specific enough to carry meaning. A single typo only destroys a few trigrams, leaving most of the others intact to still find a match.

It's the Goldilocks zone of string matching.

Part 3: The Magic - How Similarity Is Calculated

Here's where the real cleverness lives.

Once every word is broken into trigrams, PostgreSQL measures how similar two words are by counting how many trigrams they share.

The formula is:

Similarity = (Number of shared trigrams) / (Total unique trigrams across both words)

This gives a number between 0 (completely different) and 1 (identical).

Let's work through a real example

Say we're comparing "night" and "knight".

Step 1: Generate trigrams for "night"

"  n", " ni", "nig", "igh", "ght", "ht "

→ 6 trigrams

Step 2: Generate trigrams for "knight"

"  k", " kn", "kni", "nig", "igh", "ght", "ht "

→ 7 trigrams

Step 3: Find the shared trigrams

Shared: "nig", "igh", "ght", "ht "

→ 4 shared trigrams

Step 4: Calculate similarity

Total unique trigrams = 6 + 7 - 4 = 9   (subtract shared ones to avoid counting them twice)
Shared = 4

Similarity = 4 / 9 = 0.44

So "night" and "knight" are about 44% similar according to trigrams. That seems low - they sound the same! But trigrams are measuring character patterns, not pronunciation. The extra "kn" at the start of "knight" introduces several new unique trigrams that "night" doesn't have.

Now let's try a typo: "niight" vs "night"

Trigrams for "niight":

"  n", " ni", "nii", "iig", "igh", "ght", "ht "

Trigrams for "night":

"  n", " ni", "nig", "igh", "ght", "ht "

Shared: " n", " ni", "igh", "ght", "ht " → 5 shared

Total unique: 7 + 6 - 5 = 8

Similarity = 5/8 = 0.625

Despite the typo ("niight"), the similarity is still 62.5% - high enough that a search would still find the match. This is the magic of trigrams.

Part 4: The Threshold - Deciding What "Close Enough" Means

Now that we can calculate similarity as a number, we need to decide: how similar is similar enough?

PostgreSQL uses a default threshold of 0.3 (30%). If two strings share at least 30% of their trigrams, they're considered a match.

You can think of it like a teacher grading papers:

Score ≥ 0.3 → Pass (it's a match)
Score < 0.3 → Fail (not a match)

You can adjust this threshold depending on how strict or lenient you want your search to be:

-- Make it stricter (only very close matches)
SET pg_trgm.similarity_threshold = 0.5;

-- Make it more lenient (catch more distant matches, but more noise)
SET pg_trgm.similarity_threshold = 0.2;

Practical guidance:
| Threshold | Best For |
|-----------|----------|
| 0.1 – 0.2 | Very forgiving search, names from other languages |
| 0.3 | General purpose (default, works well most of the time) |
| 0.5 | Names, product codes where near-exact match is needed |
| 0.7+ | Almost exact matching only |

Part 5: Under the Hood - How PostgreSQL Actually Executes the Search

Understanding the math is one thing. But how does PostgreSQL actually run this efficiently when you have millions of rows? Recalculating trigrams for every row on every query would be painfully slow.

The answer is: the GIN Index.

What is a GIN Index?

GIN stands for Generalized Inverted Index. The name sounds intimidating, but the concept is simple.

Think of it like the index at the back of a book. Instead of reading every page to find where "photosynthesis" is mentioned, you flip to the back, find "photosynthesis" alphabetically, and it tells you exactly which pages contain it.

A GIN index does the same thing for trigrams:

GIN Index Structure (simplified):

" ca" → [row 5, row 12, row 89, row 203]
" co" → [row 3, row 7, row 44]
"cat" → [row 5, row 89]
"cof" → [row 3]
"ofe" → [row 3]
...

For every possible trigram, the index stores a list of which rows contain that trigram. When you search for "coffee", PostgreSQL:

Breaks "coffee" into its trigrams: " co", "cof", "off", "ffe", "fee", "ee "
Looks up each trigram in the GIN index
Finds rows that contain several of these trigrams (intersection)
Calculates the exact similarity score only for those candidate rows
Returns results above the threshold

Instead of scanning millions of rows, it scans a short list of candidates. This is what makes trigram search fast - often hundreds of times faster than a brute-force scan.

Visualizing the Query Flow

Your Query: "SELECT * WHERE name % 'cofee'"
                    │
                    ▼
         Break 'cofee' into trigrams
         [" co", "cof", "ofe", "fee", "ee "]
                    │
                    ▼
         Look up each trigram in GIN index
         → Candidate rows: [3, 7, 44, 89, 102]
                    │
                    ▼
         Calculate similarity() for each candidate
         → Row 3: 0.71 ✓  (above threshold)
         → Row 7: 0.18 ✗
         → Row 44: 0.65 ✓
         → Row 89: 0.22 ✗
         → Row 102: 0.80 ✓
                    │
                    ▼
         Return rows 3, 44, 102 - ordered by score

Without the GIN index, step 2 would require checking every row. With it, only a handful of candidates get the expensive similarity calculation.

Part 6: Setting It Up in PostgreSQL

Enough theory - here's how you actually use it.

Step 1: Enable the Extension

CREATE EXTENSION IF NOT EXISTS pg_trgm;

This is a one-time setup per database. The extension comes bundled with PostgreSQL - no installation needed.

Step 2: Create Your Table and Index

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL
);

-- Create a GIN index on the name column
CREATE INDEX idx_products_name_trgm
    ON products
    USING GIN (name gin_trgm_ops);

The gin_trgm_ops part tells PostgreSQL: "use trigram logic for this index".

Step 3: Query Using Similarity

-- The % operator: returns rows above the similarity threshold
SELECT * FROM products
WHERE name % 'cofee maker';

-- The similarity() function: returns the score, lets you sort by closeness
SELECT name, similarity(name, 'cofee maker') AS score
FROM products
WHERE name % 'cofee maker'
ORDER BY score DESC;

Sample output:

name               | score
-------------------+-------
Coffee Maker Pro   | 0.82
Coffee Maker Mini  | 0.78
Coffee Mug Set     | 0.31

Step 4: Tune the Threshold (Optional)

-- See the current threshold
SHOW pg_trgm.similarity_threshold;

-- Change it for this session
SET pg_trgm.similarity_threshold = 0.4;

Part 7: Real-World Use Cases

1. Search Autocomplete / "Did you mean?"

-- User typed "postgress" - find the closest match
SELECT name, similarity(name, 'postgress') AS score
FROM technologies
WHERE name % 'postgress'
ORDER BY score DESC
LIMIT 5;

2. Deduplication - Finding Near-Duplicate Records

-- Find customer records that might be the same person
SELECT a.name, b.name, similarity(a.name, b.name) AS score
FROM customers a
JOIN customers b ON a.id < b.id
WHERE similarity(a.name, b.name) > 0.7
ORDER BY score DESC;

3. Product Search with Typo Tolerance

-- Search even with misspellings
SELECT id, name, price,
       similarity(name, $1) AS relevance
FROM products
WHERE name % $1
   OR description % $1
ORDER BY relevance DESC
LIMIT 20;

4. Combining with Full-Text Search for Best Results

-- Use full-text search for speed + trigrams for typo tolerance
SELECT name, ts_rank(tsv, query) AS text_rank,
       similarity(name, 'databse') AS fuzzy_rank
FROM articles,
     plainto_tsquery('english', 'database') query
WHERE tsv @@ query
   OR name % 'databse'
ORDER BY (ts_rank(tsv, query) + similarity(name, 'databse')) DESC;

Part 8: The `word_similarity` and `strict_word_similarity` Functions

PostgreSQL's pg_trgm also provides two variants of the similarity function that handle multi-word strings differently.

Function	What It Does
`similarity(a, b)`	Compares the full strings as a whole
`word_similarity(a, b)`	Checks if `a` is similar to any word in `b`
`strict_word_similarity(a, b)`	Like `word_similarity` but requires whole-word boundaries

Example:

SELECT similarity('word', 'two words');
-- Returns: 0.18 (compares full strings)

SELECT word_similarity('word', 'two words');
-- Returns: 0.8 (finds that 'word' closely matches 'words')

This is invaluable when users search for a single product word inside a longer product name.

Part 9: Performance Tips

Always Create the GIN Index Before Querying Large Tables

Without the index, PostgreSQL does a full sequential scan - checking every row. On a table with 10 million rows, this can take minutes. With the index, the same query can complete in milliseconds.

-- Check if your query is using the index
EXPLAIN ANALYZE
SELECT * FROM products WHERE name % 'laptop';

Look for Index Scan using idx_products_name_trgm in the output. If you see Seq Scan instead, your index isn't being used - check that it was created correctly.

Index Maintenance

GIN indexes can become bloated over time with many inserts and updates. Periodically run:

VACUUM ANALYZE products;

Consider a `GIST` Index for Certain Workloads

-- Alternative index type
CREATE INDEX idx_products_name_gist
    ON products
    USING GIST (name gist_trgm_ops);

Index Type	Build Speed	Query Speed	Update Speed	Choose When
GIN	Slower	Faster	Slower	Read-heavy workloads
GiST	Faster	Slower	Faster	Write-heavy workloads

Part 10: Limitations - When Trigrams Don't Work Well

Trigrams are powerful but not magic. They struggle in a few situations:

Short Strings (Less Than 3 Characters)

A 1 or 2 character string can't form a meaningful set of trigrams. Searching for "AI" or "DB" will produce poor results.

-- Workaround: Fall back to ILIKE for very short queries
SELECT * FROM tags
WHERE CASE
    WHEN length('AI') < 3 THEN name ILIKE '%AI%'
    ELSE name % 'AI'
END;

Very Different Languages

Trigrams work on character patterns. For languages with different scripts (Arabic, Chinese, Thai), the trigram approach still works but may need different tokenization strategies.

Phonetic Similarity

Trigrams match visual similarity, not sound similarity. "Night" and "knight" sound identical but have low trigram similarity because they look different. For phonetic matching, consider the fuzzystrmatch extension with soundex or metaphone.

CREATE EXTENSION fuzzystrmatch;

SELECT * FROM names
WHERE metaphone(name, 10) = metaphone('knight', 10);
-- This would match "night" and "knight"

Summary: The Trigram Story in One Paragraph

Every word in your database gets sliced into overlapping 3-character chunks called trigrams. These trigrams are stored in a GIN index - like a giant lookup table mapping each trigram to the rows containing it. When you search for a word, PostgreSQL slices that word into trigrams too, quickly finds candidate rows via the index, then calculates a similarity score (0 to 1) based on how many trigrams the two words share. If that score passes a threshold (default 0.3), it's considered a match. The result: fast, typo-tolerant, human-friendly search - baked right into your database, no external search engine required.

Quick Reference Cheat Sheet

-- Setup
CREATE EXTENSION pg_trgm;
CREATE INDEX idx ON table USING GIN (column gin_trgm_ops);

-- Basic search
SELECT * FROM table WHERE column % 'search term';

-- With score
SELECT *, similarity(column, 'search term') AS score
FROM table WHERE column % 'search term'
ORDER BY score DESC;

-- Adjust threshold
SET pg_trgm.similarity_threshold = 0.3;

-- Word-level similarity
SELECT word_similarity('search', 'long string with search inside');

-- Check index usage
EXPLAIN ANALYZE SELECT * FROM table WHERE column % 'term';

Fuzzy search is one of those features that, once you add it to your application, users can't imagine living without. That tolerant, forgiving search experience - the one that doesn't punish you for a stray keystroke - is powered by these elegant little 3-character windows sliding across your data, quietly doing the math.

DEV Community

Fuzzy Text Search in PostgreSQL: How Trigrams Make Computers "Almost Right"

The Problem: Computers Are Too Exact

Part 1: What Is a Trigram? (The Building Block of Everything)

Part 2: Why Trigrams? Why Not 2 or 4 Letters?

Part 3: The Magic - How Similarity Is Calculated

Let's work through a real example

Now let's try a typo: "niight" vs "night"

Part 4: The Threshold - Deciding What "Close Enough" Means

Part 5: Under the Hood - How PostgreSQL Actually Executes the Search

What is a GIN Index?

Visualizing the Query Flow

Part 6: Setting It Up in PostgreSQL

Step 1: Enable the Extension

Step 2: Create Your Table and Index

Step 3: Query Using Similarity

Step 4: Tune the Threshold (Optional)

Part 7: Real-World Use Cases

1. Search Autocomplete / "Did you mean?"

2. Deduplication - Finding Near-Duplicate Records

3. Product Search with Typo Tolerance

4. Combining with Full-Text Search for Best Results

Part 8: The `word_similarity` and `strict_word_similarity` Functions

Part 9: Performance Tips

Always Create the GIN Index Before Querying Large Tables

Index Maintenance

Consider a `GIST` Index for Certain Workloads

Part 10: Limitations - When Trigrams Don't Work Well

Short Strings (Less Than 3 Characters)

Very Different Languages

Phonetic Similarity

Summary: The Trigram Story in One Paragraph

Quick Reference Cheat Sheet

Top comments (0)

The Problem: Computers Are Too Exact

Part 1: What Is a Trigram? (The Building Block of Everything)

Part 2: Why Trigrams? Why Not 2 or 4 Letters?

Part 3: The Magic - How Similarity Is Calculated

Let's work through a real example

Now let's try a typo: "niight" vs "night"

Part 4: The Threshold - Deciding What "Close Enough" Means

Part 5: Under the Hood - How PostgreSQL Actually Executes the Search

What is a GIN Index?

Visualizing the Query Flow

Part 6: Setting It Up in PostgreSQL

Step 1: Enable the Extension

Step 2: Create Your Table and Index

Step 3: Query Using Similarity

Step 4: Tune the Threshold (Optional)

Part 7: Real-World Use Cases

1. Search Autocomplete / "Did you mean?"

2. Deduplication - Finding Near-Duplicate Records

3. Product Search with Typo Tolerance

4. Combining with Full-Text Search for Best Results

Part 8: The word_similarity and strict_word_similarity Functions

Part 9: Performance Tips

Always Create the GIN Index Before Querying Large Tables

Index Maintenance

Consider a GIST Index for Certain Workloads

Part 10: Limitations - When Trigrams Don't Work Well

Short Strings (Less Than 3 Characters)

Very Different Languages

Phonetic Similarity

Summary: The Trigram Story in One Paragraph

Quick Reference Cheat Sheet

Part 8: The `word_similarity` and `strict_word_similarity` Functions

Consider a `GIST` Index for Certain Workloads