This was originally posted oh my blog where I post all things tech and sysadmin and the like!.
What if I told you that searching millions of records didn't have to be complex, and take minutes to complete? Most organizations jump ship to a dedicated searching engine like Elastic Search, but they greatly underestimate the power of PostgreSQL.
Let's suppose we have a table with 1.1 million records, and we want to apply a text search vector to it. We care about the author and publisher names. In order to create the vectors, we need to take out old data, and put it into a new column that has a vector.
Time to break it down! Our author and publisher names are just strings, for example author is "Jack Sparrow", publisher is "A Movie Corporation" - we want to be able to start typing "jac" or "mov cor" and have it pickup Jack Sparrow and/or A Movie Corporation. Makes sense! The issue we face is we don't want to have two columns, we need an index so we can rapidly search, so we make a new column in the database table, let's call this column "tsv" for short.
First, add the column, then apply a GIN index on it.
ALTER TABLE searchable ADD COLUMN tsv tsvector;
CREATE INDEX searchable_gin_idx ON searchable USING GIN(tsv);
Now, let's update our table records accordingly, you can use this SQL to create an English TSVector and instantly apply it to the TSV field.
UPDATE searchable SET tsv = ( to_tsvector('english', author) || to_tsvector('english', publisher ) );
This will begin to build a simple GIN index using our concatenated author and publisher. We will now be able to perform both simple and complex fast search queries. It's the same data, without compromising speed. For example, if we want to find all items containing 'jac parr' as the author, we can do:
SELECT * FROM searchable WHERE tsv @@ to_tsquery('jac:* parr:*');
Instantly, we're presented with a ton of results. It's interesting to note, that even though we've attached a wildcard to the query, PostgreSQL is still able to rapidly search and get our result set. We could further apply a ts_rank_cd in the ORDER BY clause to rank them according to weights.
Speed and consistency matters.
If we run explain analyze on it, we can see that the query took 0.013ms to plan, and 0.067ms to execute.
If we were to run the same query without a GIN index, we see it took 0.121ms to plan, and 132.14ms to execute.
As you can see, this is a major performance boost, and we get the same results we did previously, much faster this time around. Thus, another reason you may not need a searching product - if you optimize your queries and database table with the proper indexes you can leverage the full power of PostgreSQL.
Top comments (0)