How We Build User Search at Outside: A Case Study in Database Indexing

#backenddevelopment #postgres #sql

Intro

In our effort to create user search functionality for Outside Activity Feed, we needed to implement a robust and scalable search mechanism. Our user base, which totals over 11 million users, presented a few challenges for search performance and quality.

My colleague @pjhoberman tipped me to experiment with PostgreSQL’s pg_trgm extension for trigram similarity scoring, which is instrumental in handling fuzzy text matching.

Exploring Trigram Similarity

To understand how trigram similarity scoring works, particularly the WORD_SIMILARITY function, I created a query to visualize the process:

with data(t) as (
values
    ('Blair Braverman')
)

select 
    t as "text", 
    show_trgm('blai') as "query trigrams", 
    show_trgm(t) as "document trigrams", 
    cardinality(array_intersect( show_trgm('blai'), show_trgm(t))) as "common trgms",
    cardinality(show_trgm('blai')) as "query trigrams count",
    word_similarity('blai', t)
from data

This query generates and compares trigrams from both the query string and document to evaluate similarity.

Your similarity threshold can be checked using:

SHOW pg_trgm.word_similarity_threshold;
0.6

Records with a similarity score above this threshold will appear in search results.

Phase 1: Getting Trigram Similarity Search to Work

In the first iteration, we implemented a trigram similarity search using the pg_trgm index on relevant fields. The <% operator is used to compare the similarity between two text values, where a higher similarity score indicates that the values are more alike.

The query looked like this:

SELECT
     ...
FROM
    "socialprofile"
WHERE
    'Blai' <% "socialprofile"."username"
    OR 'Blai' <% "socialprofile"."display_name"
ORDER BY
    WORD_SIMILARITY('Blai', "socialprofile"."username") DESC,
    WORD_SIMILARITY('Blai', "socialprofile"."display_name") DESC;

Output of EXPLAIN

Indexes were hit, and performance was acceptable as a beta release.

Phase 2: Addressing New Requirements

As product requirements evolved, the search results needed to include:

Only users who have acknowledged the privacy policy.
Displaying users' real first and last names to foster authentic engagement.

To meet these requirements, I created a partial index to optimize queries for users who acknowledged the privacy policy. The query also requires a JOIN from two other tables.

CREATE INDEX idx_profile_privacy_policy_acknowledged_at_not_null
ON profile (privacy_policy_acknowledged_at)
WHERE privacy_policy_acknowledged_at IS NOT NULL;

The query leverages the new partial index:

    SELECT
       ...
    FROM
        socialprofile
    JOIN
        profile ON profile.uuid = socialprofile.uuid
        AND profile.privacy_policy_acknowledged_at IS NOT NULL  
    JOIN
        auth_user ON profile.user_id = auth_user.id
    WHERE
        'Blai' <% socialprofile.username
        OR 'Blai' <% auth_user.first_name
        OR 'Blai' <% auth_user.last_name

With the partial index, the number of searchable users was reduced to around 2,000+, improving search latency to under 200ms. However, as we introduced Activity Feed to the world, our searchable user base quickly grew to 200,000+. It was apparent that the query does not scale.

Looking back at my query, I realized that while the partial index is being used, I had lost the trigram similarity index!

Phase 3: Optimization using multi-column indexing and materialized view

I stumbled across this article where the author has run into a similar issue. Turns out doing multiple trigram similarity comparison on multiple columns is slow, and his solution was to build an index that is a concatenation of the columns that we want to search for. In this case we combine the author's username and first name/last name into the column to search against:

blair-braverman Blair Braverman

The author did not use materialized views in the end. However for us, since we query multiple tables, we cannot build the multi-column index without a materialized view.

So the materialized view and index creation looks something like this:

-- Create the materialized view
CREATE MATERIALIZED VIEW mv_search AS
WITH ack_profiles AS (
    SELECT user_id, uuid, privacy_policy_acknowledged_at
    FROM user_profile
    WHERE privacy_policy_acknowledged_at IS NOT NULL
)
SELECT
    sp.ID,
    ...
    (COALESCE(sp.username, '') || ' ' ||
    COALESCE(au.first_name, '') || ' ' ||
    COALESCE(au.last_name, '')) AS search_text,
FROM ack_profiles ap
JOIN socialprofile sp
    ON sp.uuid = ap.uuid
    AND sp.type = 'user'
JOIN auth_user au
    ON ap.user_id = au.id;

-- Add trigram similarity index on the search text
CREATE INDEX idx_mv_search_text_trigram_idx ON mv_search USING gist (search_text gist_trgm_ops);

The query itself is simple:

SELECT
 ...
FROM mv_search
WHERE
    %s <% search_text

One of the downsides of this solution is that search data is not live. This requires us to maintain an extra process to REFRESH the materialized view. However this is a worthy trade off to make given the performance gains. Currently we are clocking less than 200ms latency.

What's next?

To paraphrase a commenter on Stack Overflow, "if your index is slow, you're likely doing something wrong". So far I've tested the solution with about 11M records, and the pg_trgm index does deliver on its promise. I have seen an problem that we'll need to solve down the road that is caused by ORDER BY and LIMIT when the text being searched is too common, for example the name "John". The query seems to struggle with finding the top 100 "John" out of 100,000. I would also like to introduce caching, and play with the similarity search and ranking a little bit more. Thanks for reading and let me know your thoughts!