DEV Community: Tudor Golubenco

Create a search engine with PostgreSQL: Postgres vs Elasticsearch

Tudor Golubenco — Mon, 31 Jul 2023 10:49:03 +0000

In Part 1, we delved into the capabilities of PostgreSQL's full-text search and explored how advanced search features such as relevancy boosters, typo-tolerance, and faceted search can be implemented. In this part, we'll compare it with Elasticsearch.

First, let's note that Postgres and Elasticsearch are generally not in competition with each other. In fact, it's very common to see them together in architecture diagrams, often in a configuration like this:

In this architecture, the source of truth for the data lives in Postgres, which serves the transactional CRUD operations. The data is continuously synced to Elasticsearch, either via something like Postgres logical replication events (change-data-capture) or by the application itself via custom code. During this data replication, denormalization might be required. The search functionality, including facets and aggregations, is served from Elasticsearch.

While this architecture is as common as it is for very good reasons, it does have a few challenges:

Dealing with two types of stores means more operational burden and higher infrastructure costs.
Keeping the data in sync is more challenging than you might think. I'm planning a dedicated blog for this problem because it's quite interesting. Let's just say, it's pretty hard to get it completely right.
The data replication is, at best, near real-time, meaning that there can be consistency issues in the search service.

Point 2 is generally solvable via engineering effort and careful dedicated code. From the existing tools, PGSync is an open source project that aims to specifically solve this problem. ZomboDB is an interesting Postgres extension that tackles point 2 (and I think partially point 3), by controlling and querying Elasticsearch through Postgres. I haven't yet tried either of these two projects, so I can't comment on their trade-offs, but I wanted to mention them.

And yes, a data platform like Xata solves most of points 1 and 2, by taking that complexity and offering it as a service, together with other goodies.

That said, if the Postgres full-text search functionality is enough for your use case, making use of it promises to significantly simplify your architecture and application. In this version, Postgres serves both the CRUD app needs and the full-text search needs:

This means you don't need to operate two types of stores, no more data replication, no more denormalization, no more eventual consistency. The search engine built into Postgres happens to support ACID transactions, joins between tables, constraints (e.g. not null or unique), referential integrity (foreign keys), and all the other Postgres goodies that make application development simpler.

Therefore, it's no wonder the Hacker News thread for our part 1 blog post had a lively discussion about the pros and cons of this approach. Can we go for the Postgres-only solution, or does the best-tool-for-the-job argument wins?

We're going to compare the convenience, search relevancy, performance, and scalability of the two options.

DIY versus built-in

As we showed in part 1, you can replicate a lot of the Elasticsearch functionality in Postgres, even more advanced things like relevancy boosters, typo-tolerance, suggesters/autocomplete, or semantic/vector search. However, it's not always straight forward.

An example where it's not too simple is with typo-tolerance (called fuzziness in Elasticsearch). It's not available out-of-the-box in Postgres, but you can implement it with the following steps:

index all lexemes (words) from all documents in a separate table
for each word in the query, use similarity or Levenshtein distance to search in this table
modify the search query to include any words that are found

While the above is quite doable, in dedicated search engines like Elasticsearch, you can enable typo-tolerance with a simple flag:

// POST /recipes/_search
{
  "query": {
    "multi_match": {
      "query": "biscaits",
      "fuzziness": 1
    }
  }
}

Search relevancy: BM25 and TF-IDF

The default ranking algorithm for keyword search in Elasticsearch is BM25. With the release of Elasticsearch 5.0 in 2016, it dethroned TF-IDF as the default ranking algorithm. Postgres doesn't support either of them, mainly because its ranking functions (explained in here) don't have access to global word frequency data which is needed by these algorithms. To see how relevant (pun intended) or not so relevant that might be, let's look at the ranking functions and algorithms from simple to complex:

ts_rank (Postgres function) - ranks based on the term frequency. In other words, it does the “TF” (term frequency) part of TF-IDF. The principle is that if you are searching for a word, the more often that word shows up in the matching document, the higher the score. In addition to using simple TF, Postgres provides ways to normalize the term frequency into a score. For instance, one approach is to divide it by the document length.
ts_rank_cd (Postgres function) - rank + cover density. In addition to the term frequency, this function also takes into account the “cover density”, meaning the proximity of the terms in the document.
TF-IDF - term frequency + inverse document frequency. In addition to the term frequency, this algorithm “penalizes” words that are very common in the overall data set. So if the word “egg” matches, but that word is super common because we have a recipes dataset, it is valued less compared to other words in the query.
BM25 - this algorithm is based on a probabilistic model of relevancy. While the TF-IDF formula is mostly based on intuition and practical experiments, BM25 is the result of more formal mathematical research. If you're curious about the said mathematical research, I recommend this talk that makes it accessible. Interestingly, the resulting BM25 formula is not all that different from TF-IDF but it incorporates a couple more concepts: the frequency saturation and the document length. Ultimately, this gives better results over a wider range of document types.

There's no question that BM25 is a more advanced relevancy algorithm than what ts_rank or ts_rank_cd use. BM25 uses more input signals, it's based on better heuristics, and it typically doesn't require tuning.

One practical effect of BM25 is that it automatically penalizes the very common words (”the”, “in”, “or”, etc.), also called “stop words”, which means that they don't need to be excluded from the index. This is why the Postgres english configuration for to_tsvector removes the stop words (details here in part 1), but the Elasticsearch standard analyzer doesn't. It doesn't need to.

While BM25 is superior, there are some pro-Postgres arguments to be considered:

if you aggressively exclude the stop words, like the english configuration in Postgres does, that compensates for the lack of IDF in some cases.
in practice there might be stronger signals of relevancy in the data itself (upvotes, reviews, etc). See the section on boosters from part 1 on how to make use of them in Postgres.

Could BM25 or TF-IDF be implemented on top of the existing Postgres functionality? Actually, yes. See this blog post that uses ts_stats and ts_debug to compute TF-IDF. It's not very simple, but possible (as usual with Postgres).

Performance and scalability considerations

Let's start by noting that the two systems couldn't be more different:

PostgreSQL has a single master and multiple read replicas, Elasticsearch has horizontal scalability via sharding.
Postgres is relational, supports joining tables, has ACID transactions, and offers constraints, while Elasticsearch is document oriented and offers consistency guarantees only per document.
Postgres is row oriented, while Elasticsearch has an internal column store in the form of doc values.
Postgres is native C code, while Elasticsearch runs on the JVM.
Postgres has a connection-oriented wire protocol; Elasticsearch has a REST-like DSL over HTTP.

All of these impact performance and scalability, and it's no surprise then that the two tend to shine in different areas: PostgreSQL is commonly used as a primary data store, whereas Elasticsearch is usually utilized as a secondary store, particularly for search and analytics on time-series data such as logs. And yet, they do overlap on the use case of full-text search, which is the point of this blog post.

I was curious to know at roughly what amount of data Postgres slows down compared to Elasticsearch. On the movies dataset (34K rows) that we used in part 1, all queries were reasonably fast (<300 ms). So for the testing here, I chose a larger data set: a recipes dataset from Kaggle, containing 2.3M recipes. The commands to load the CSV file in PostgreSQL can be found in this gist. For Elasticsearch, I've loaded the same CSV file using this tool.

After loading the data, I started by running searches similar to the ones used in part 1:

SELECT title, ts_rank(search, websearch_to_tsquery('english', 'darth vader')) rank
   FROM recipes WHERE search @@ websearch_to_tsquery('english','darth vader')
   ORDER BY rank DESC limit 10;
        title         |    rank
----------------------+------------
 Darth Vader Biscuits | 0.09910322
 Cloud 9 Pancakes     | 0.09910322
(2 rows)

Time: 100.468 ms

For Elasticsearch I've used the following to run the search:

// POST /recipes/_search
{
  "query": {
    "query_string": {
      "query": "darth AND vader"
    }
  }
}

I ran each query five times and recorded the best and worst times. Typically, the first query of a kind was the slowest because the following queries benefited from having the relevant pages already in memory. While this approach is rather unscientific, and you should conduct your own benchmarking on your data before drawing definitive conclusions, it should be sufficient for drawing some initial conclusions.

Here are the results on a few queries:

query	Elasticsearch worst time (ms)	Elasticsearch best time (ms)	Postgres worst time (ms)	Postgres best time (ms)
darth vader	52	4	100	3
chicken nuggets	85	10	313	13
pancake	60	4	618	157
curacao	286	7	230	10
mix	67	5	25182	8267

As you can see, Postgres performs well on some queries such as "darth vader" or "curacao," responding within milliseconds. However, on other queries like "pancake" or "mix," it performs significantly worse than Elasticsearch, with response times measured in seconds. It gets as bad as 25 seconds latency! What's going on here?

The difference lies in how many rows match the query terms. Searching for “darth vader” in a recipes dataset matches 2 rows. But searching “mix” in a recipes dataset matches a million rows (literally, 1,038,914 to be precise). Since we order by rank, Postgres needs to call the ts_rank function for each of the million rows. The Postgres docs even warn about this:

Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches.

Indeed, the issue is from ranking. If we're only interested in matching and we order by an (indexed) column, it is fast:

SELECT title FROM recipes
  WHERE search @@ websearch_to_tsquery('english','mix')
  ORDER BY title ASC LIMIT 10;

Time: 24.681 ms

But we're working on the assumption that ranking is necessary for a good search experience. One idea is to use what I call "sampling": before computing the ranks, take a sample of 10K rows that match. The assumption is that if your query matches so many documents, the ranking is likely to be ineffective anyway, so it's better to prioritize the response time.

The SQL to do this looks like this:

WITH search_sample AS (
    SELECT title, search FROM recipes
  WHERE search @@ websearch_to_tsquery('english','mix')
  LIMIT 10000)
SELECT title, ts_rank(search, websearch_to_tsquery('english', 'mix')) rank
  FROM search_sample
  ORDER BY rank DESC limit 10;

Re-running the tests with this sample approach gives us closer results:

query	Elasticsearch worst time (ms)	Elasticsearch best time (ms)	Postgres worst time (ms)	Postgres best time (ms)
darth vader	52	4	100	3
chicken nuggets	85	10	195	14
pancake	60	4	145	13
curacao	286	7	225	11
mix	67	5	400	144

Much better! Of course, we did sacrifice on the relevancy, which might or might not be ok in your case.

Here are some conclusions and more considerations on the topic of performance and scalability:

on search use cases over smaller datasets (<100K rows) both systems will perform well, but a Postgres-only solution will require less resources.
on medium datasets (a few million rows), Elasticsearch is already faster, however, Postgres can perform within a 200ms latency budged if you use the sampling trick explained above.
when the number of documents is really large, for example logs or other time series, Elasticsearch has the additional advantage of horizontal scalability.
if you need a lot of aggregations or analytics (e.g. display a dashboard full of graphs) and the data set is large enough, Elasticsearch's columnar store will give it an advantage.
giving Postgres an extra workload can affect the performance of your main instance. The solution is to move the searching to a replica, but then you lose some of the consistency guarantees.

Semantic and hybrid search

Both part 1 and this blog post focused on keyword searching techniques. However, in the last few years, semantic/vector search has taken the world of search by storm, so I feel like I need to touch on this aspect as well in comparing the two.

Semantic search leverages language models to generate embeddings for each document. Embeddings are arrays of numbers that represent the text on a number of dimensions. Pieces of text that have similar embeddings have a similar meaning. In other words, semantic search can “search by meaning”, rather than “by keywords”. This is quite exciting now, because large language models (LLMs) give us very accurate understanding of meaning. It means you don't have to maintain list of synonyms or add different keywords to your documents to match how your users are searching.

Postgres supports vector search via the pgvector extension, while Elasticsearch has it built-in via the KNN search. You can find benchmarks on ann-benchmarks (look for pgvector and luceneknn) but keep in mind that both implementations are under active development and their performance is being improved.

While exciting, it turns out that semantic search alone doesn't really work great on the typical search experiences that we have today - at least not on a majority of datasets. If you are curious, I recently wrote a comparison between keyword and semantic search for the particular use case of selecting the context for ChatGPT.

For search use cases like the recipes one in this blog post, hybrid search might give better results: use a combination of keyword and semantic search to improve the ranking.

Elastic has recently announced their “Elasticsearch Relevance Engine”, which includes hybrid search. In Postgres, given that it's all building blocks, you can combine the full-text search functionality and pgvector. I'm looking forward to diving deeper into this topic as well, but I'll leave that for a follow-up blog post.

Conclusion

Choosing between a Postgres-only architecture and a Postgres + Elasticsearch architecture will depend on your use case and scale.

For example, if you have a table or list in your application on which you support CRUD operations and you want to add full-text search functionality to it, Postgres will likely work well for you for quite some time.

On the other hand, if you have a large data set search and search relevancy is critical to your application (for example, in e-commerce), using a dedicated search engine like Elasticsearch is going to perform better both in latency and relevancy.

In many cases, it might make sense to start with the simpler Postgres-only approach, but be ready to pivot to the Postgres + Elasticsearch architecture when needed.

If you read this far, you might want to give Xata a try. It offers both Postgres and Elasticsearch in the same data platform, and can also handle the syncing between them with no extra effort. If you have any feedback on this blog post, or are interested in the follow-up blog posts, you can follow us on Twitter or join us in Discord.

Create an advanced search engine with PostgreSQL

Tudor Golubenco — Mon, 24 Jul 2023 10:45:24 +0000

This is part 1 of a blog mini-series, in which we explore the full-text search functionality in PostgreSQL and investigate how much of the typical search engine functionality we can replicate. In part 2, we'll do a comparison between PostgreSQL's full-text search and Elasticsearch.

If you want to follow along and try out the sample queries (which we recommend; it's more fun that way), the code samples are executed against the Wikipedia Movie Plots data set from Kaggle. To import it, download the CSV file, then create this table:

CREATE TABLE movies(
    ReleaseYear int,
    Title text,
    Origin text,
    Director text,
    Casting text,
    Genre text,
    WikiPage text,
    Plot text);

And import the CSV file like this:

\COPY movies(ReleaseYear, Title, Origin, Director, Casting, Genre, WikiPage, Plot)
    FROM 'wiki_movie_plots_deduped.csv' DELIMITER ',' CSV HEADER;

The dataset contains 34,000 movie titles and is about 81 MB in CSV format.

PostgreSQL full-text search primitives

The Postgres approach to full-text search offers building blocks that you can combine to create your own search engine. This is quite flexible but it also means it generally feels lower-level compared to search engines like Elasticsearch, Typesense, or Mellisearch, for which full-text search is the primary use case.

The main building blocks, which we'll cover via examples, are:

The tsvector and tsquery data types
The match operator @@ to check if a tsquery matches a tsvector
Functions to rank each match (ts_rank, ts_rank_cd)
The GIN index type, an inverted index to efficiently query tsvector

We'll start by looking at these building blocks and then we'll get into more advanced topics, covering relevancy boosters, typo-tolerance, and faceted search.

tsvector

The tsvector data type stores a sorted list of lexemes. A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as s or ing in English). Here is an example, using the to_tsvector function to parse an English phrase into a tsvector.

SELECT * FROM unnest(to_tsvector('english',
    'I''m going to make him an offer he can''t refuse. Refusing is not an option.'));

 lexeme | positions | weights
--------+-----------+---------
 go     | {3}       | {D}
 m      | {2}       | {D}
 make   | {5}       | {D}
 offer  | {8}       | {D}
 option | {17}      | {D}
 refus  | {12,13}   | {D,D}
(6 rows)

As you can see, stop words like “I”, “to” or “an” are removed, because they are too common to be useful for search. The words are normalized and reduced to their root (e.g. “refuse” and “Refusing” are both transformed into “refus”). The punctuation signs are ignored. For each word, the positions in the original phrase are recorded (e.g. “refus” is the 12th and the 13th word in the text) and the weights (which are useful for ranking and we'll discuss later).

In the example above, the transformation rules from words to lexemes are based on the english search configuration. Running the same query with the simple search configuration results in a tsvector that includes all the words as they were found in the text:

SELECT * FROM unnest(to_tsvector('simple',
    'I''m going to make him an offer he can''t refuse. Refusing is not an option.'));

  lexeme  | positions | weights
----------+-----------+---------
 an       | {7,16}    | {D,D}
 can      | {10}      | {D}
 going    | {3}       | {D}
 he       | {9}       | {D}
 him      | {6}       | {D}
 i        | {1}       | {D}
 is       | {14}      | {D}
 m        | {2}       | {D}
 make     | {5}       | {D}
 not      | {15}      | {D}
 offer    | {8}       | {D}
 option   | {17}      | {D}
 refuse   | {12}      | {D}
 refusing | {13}      | {D}
 t        | {11}      | {D}
 to       | {4}       | {D}
(16 rows)

As you can see, “refuse” and “refusing” now result in different lexemes. The simple configuration is particularly useful when you have columns that contain labels or tags.

PostgreSQL has built-in configurations for a pretty good set of languages. You can see the list by running:

SELECT cfgname FROM pg_ts_config;

Notably, however, there is no configuration for CJK (Chinese-Japanese-Korean), which is worth keeping in mind if you need to create a search query in those languages. While the simple configuration should work in practice quite well for unsupported languages, I'm not sure if that is enough for CJK.

tsquery

The tsquery data type is used to represent a normalized query. A tsquery contains search terms, which must be already-normalized lexemes, and may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators. There are functions like to_tsquery, plainto_tsquery, and websearch_to_tsquery that are helpful in converting user-written text into a proper tsquery, primarily by normalizing words appearing in the text.

To get a feeling of tsquery, let's see a few examples using websearch_to_tsquery:

SELECT websearch_to_tsquery('english', 'the darth vader');
 websearch_to_tsquery
----------------------
'darth' & 'vader'

That is a logical AND, meaning that the document needs to contain both "darth" and "vader" in order to match. You can do logical OR as well:

SELECT websearch_to_tsquery('english', 'darth OR vader');
 websearch_to_tsquery
----------------------
 'darth' | 'vader'

And you can exclude words:

SELECT websearch_to_tsquery('english', 'darth vader -wars');
   websearch_to_tsquery
---------------------------
 'darth' & 'vader' & !'war'

Also, you can represent phrase searches:

SELECT websearch_to_tsquery('english', '"the darth vader son"');
     websearch_to_tsquery
------------------------------
 'darth' <-> 'vader' <-> 'son'

This means: “darth”, followed by “vader”, followed by “son”.

Note, however, that the “the” word is ignored, because it's a stop word as per the english search configuration. This can be an issue on phrases like this:

SELECT websearch_to_tsquery('english', '"do or do not, there is no try"');
 websearch_to_tsquery
----------------------
 'tri'
(1 row)

Oops, almost the entire phrase was lost. Using the simple config gives the expected result:

SELECT websearch_to_tsquery('simple', '"do or do not, there is no try"');
                           websearch_to_tsquery
--------------------------------------------------------------------------
 'do' <-> 'or' <-> 'do' <-> 'not' <-> 'there' <-> 'is' <-> 'no' <-> 'try'

You can check whether a tsquery matches a tsvector by using the match operator @@.

SELECT websearch_to_tsquery('english', 'darth vader') @@
    to_tsvector('english',
        'Darth Vader is my father.');

?column?
----------
 t

While the following example doesn't match:

SELECT websearch_to_tsquery('english', 'darth vader -father') @@
    to_tsvector('english',
        'Darth Vader is my father.');

?column?
----------
 f

GIN

Now that we've seen tsvector and tsquery at work, let's look at another key building block: the GIN index type is what makes it fast. GIN stands for Generalized Inverted Index. GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items. This means that GIN can be used for more than just text search, notably for JSON querying.

You can create a GIN index on a set of columns, or you can first create a column of type tsvector, to include all the searchable columns. Something like this:

ALTER TABLE movies ADD search tsvector GENERATED ALWAYS AS
    (to_tsvector('english', Title) || ' ' ||
   to_tsvector('english', Plot) || ' ' ||
   to_tsvector('simple', Director) || ' ' ||
     to_tsvector('simple', Genre) || ' ' ||
   to_tsvector('simple', Origin) || ' ' ||
   to_tsvector('simple', Casting)
) STORED;

And then create the actual index:

CREATE INDEX idx_search ON movies USING GIN(search);

You can now perform a simple test search like this:

SELECT title FROM movies WHERE search @@ websearch_to_tsquery('english','darth vader');

                        title
--------------------------------------------------
 Star Wars Episode IV: A New Hope (aka Star Wars)
 Return of the Jedi
 Star Wars: Episode III – Revenge of the Sith
(3 rows)

To see the effects of the index, you can compare the timings of the above query with and without the index. The GIN index takes it from over 200 ms to about 4 ms on my computer.

ts_rank

So far, we've seen how ts_vector and ts_query can match search queries. However, for a good search experience, it is important to show the best results first - meaning that the results need to be sorted by relevancy.

Taking it directly from the docs:

PostgreSQL provides two predefined ranking functions, which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur. However, the concept of relevancy is vague and very application-specific. Different applications might require additional information for ranking, e.g., document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.

The two ranking functions mentioned are ts_rank and ts_rank_cd. The difference between them is that while they both take into account the frequency of the term, ts_rank_cd also takes into account the proximity of matching lexemes to each other.

To use them in a query, you can do something like this:

SELECT title,
       ts_rank(search, websearch_to_tsquery('english', 'darth vader')) rank
  FROM movies
  WHERE search @@ websearch_to_tsquery('english','darth vader')
  ORDER BY rank DESC
  LIMIT 10;

                      title                       |    rank
--------------------------------------------------+-------------
 The Empire Strikes Back                          |  0.26263964
 Star Wars Episode IV: A New Hope (aka Star Wars) |  0.18902963
 Star Wars: Episode III – Revenge of the Sith     |  0.10292397
 Rogue One: A Star Wars Story (film)              |  0.10049681
 Return of the Jedi                               |  0.09910346
 American Honey                                   |  0.09910322

One thing to note about ts_rank is that it needs to access the search column for each result. This means that if the WHERE condition matches a lot of rows, PostgreSQL needs to visit them all in order to do the ranking, and that can be slow. To exemplify, the above query returns in 5-7 ms on my computer. If I modify the query to do search for darth OR vader, it returns in about 80 ms, because there are now over 1000 matching result that need ranking and sorting.

Relevancy tuning

While relevancy based on word frequency is a good default for the search sorting, quite often the data contains important indicators that are more relevant than simply the frequency.

Here are some examples for a movies dataset:

Matches in the title should be given higher importance than matches in the description or plot.
More popular movies can be promoted based on ratings and/or the number of votes they receive.
Certain categories can be boosted more, considering user preferences. For instance, if a particular user enjoys comedies, those movies can be given a higher priority.
When ranking search results, newer titles can be considered more relevant than very old titles.

This is why dedicated search engines typically offer ways to use different columns or fields to influence the ranking. Here are example tuning guides from Elastic, Typesense, and Meilisearch.

If you want a visual demo of the impact of relevancy tuning, here is a quick 4 minutes video about it:

Numeric, date, and exact value boosters

While Postgres doesn't have direct support for boosting based on other columns, the rank is ultimately just a sort expression, so you can add your own signals to it.

For example, if you want to add a boost for the number of votes, you can do something like this:

SELECT title,
  ts_rank(search, websearch_to_tsquery('english', 'jedi'))
    -- numeric booster example
    + log(NumberOfVotes)*0.01
 FROM movies
 WHERE search @@ websearch_to_tsquery('english','jedi')
 ORDER BY rank DESC LIMIT 10;

The logarithm is there to smoothen the impact, and the 0.01 factor brings the booster to a comparable scale with the ranking score.

You can also design more complex boosters, for example, boost by the rating, but only if the ranking has a certain number of votes. To do this, you can create a function like this:

create function numericBooster(rating numeric, votes numeric, voteThreshold numeric)
    returns numeric as $$
        select case when votes < voteThreshold then 0 else rating end;
$$ language sql;

And use it like this:

SELECT title,
  ts_rank(search, websearch_to_tsquery('english', 'jedi'))
    -- numeric booster example
    + numericBooster(Rating, NumberOfVotes, 100)*0.005
 FROM movies
 WHERE search @@ websearch_to_tsquery('english','jedi')
 ORDER BY rank DESC LIMIT 10;

Let's take another example. Say we want to boost the ranking of comedies. You can create a valueBooster function that looks like this:

create function valueBooster (col text, val text, factor integer)
    returns integer as $$
        select case when col = val then factor else 0 end;
$$ language sql;

The function returns a factor if the column matches a particular value and 0 instead. Use it in a query like this:

SELECT title, genre,
   ts_rank(search, websearch_to_tsquery('english', 'jedi'))
   -- value booster example
   + valueBooster(Genre, 'comedy', 0.05) rank
FROM movies
   WHERE search @@ websearch_to_tsquery('english','jedi')                                                                                                 ORDER BY rank DESC LIMIT 10;
                      title                       |               genre                |        rank
--------------------------------------------------+------------------------------------+---------------------
 The Men Who Stare at Goats                       | comedy                             |  0.1107927106320858
 Clerks                                           | comedy                             |  0.1107927106320858
 Star Wars: The Clone Wars                        | animation                          | 0.09513916820287704
 Star Wars: Episode I – The Phantom Menace 3D     | sci-fi                             | 0.09471701085567474
 Star Wars: Episode I – The Phantom Menace        | space opera                        | 0.09471701085567474
 Star Wars: Episode II – Attack of the Clones     | science fiction                    | 0.09285612404346466
 Star Wars: Episode III – Revenge of the Sith     | science fiction, action            | 0.09285612404346466
 Star Wars: The Last Jedi                         | action, adventure, fantasy, sci-fi |  0.0889768898487091
 Return of the Jedi                               | science fiction                    | 0.07599088549613953
 Star Wars Episode IV: A New Hope (aka Star Wars) | science fiction                    | 0.07599088549613953
(10 rows)

Column weights

Remember when we talked about the tsvector lexemes and that they can have weights attached? Postgres supports 4 weights, named A, B, C, and D. A is the biggest weight while D is the lowest and the default. You can control the weights via the setweight function which you would typically call when building the tsvector column:

ALTER TABLE movies ADD search tsvector GENERATED ALWAYS AS
   (setweight(to_tsvector('english', Title), 'A') || ' ' ||
   to_tsvector('english', Plot) || ' ' ||
   to_tsvector('simple', Director) || ' ' ||
   to_tsvector('simple', Genre) || ' ' ||
   to_tsvector('simple', Origin) || ' ' ||
   to_tsvector('simple', Casting)
) STORED;

Let's see the effects of this. Without setweight, a search for jedi returns:

SELECT title, ts_rank(search, websearch_to_tsquery('english', 'jedi')) rank
   FROM movies
   WHERE search @@ websearch_to_tsquery('english','jedi')
   ORDER BY rank DESC;
                      title                       |    rank
--------------------------------------------------+-------------
 Star Wars: The Clone Wars                        |  0.09513917
 Star Wars: Episode I – The Phantom Menace        |  0.09471701
 Star Wars: Episode I – The Phantom Menace 3D     |  0.09471701
 Star Wars: Episode III – Revenge of the Sith     | 0.092856124
 Star Wars: Episode II – Attack of the Clones     | 0.092856124
 Star Wars: The Last Jedi                         |  0.08897689
 Return of the Jedi                               | 0.075990885
 Star Wars Episode IV: A New Hope (aka Star Wars) | 0.075990885
 Clerks                                           |  0.06079271
 The Empire Strikes Back                          |  0.06079271
 The Men Who Stare at Goats                       |  0.06079271
 How to Deal                                      |  0.06079271
(12 rows)

And with the setweight on the title column:

SELECT title, ts_rank(search, websearch_to_tsquery('english', 'jedi')) rank
   FROM movies
   WHERE search @@ websearch_to_tsquery('english','jedi')
   ORDER BY rank DESC;
                      title                       |    rank
--------------------------------------------------+-------------
 Star Wars: The Last Jedi                         |   0.6361112
 Return of the Jedi                               |   0.6231253
 Star Wars: The Clone Wars                        |  0.09513917
 Star Wars: Episode I – The Phantom Menace        |  0.09471701
 Star Wars: Episode I – The Phantom Menace 3D     |  0.09471701
 Star Wars: Episode III – Revenge of the Sith     | 0.092856124
 Star Wars: Episode II – Attack of the Clones     | 0.092856124
 Star Wars Episode IV: A New Hope (aka Star Wars) | 0.075990885
 The Empire Strikes Back                          |  0.06079271
 Clerks                                           |  0.06079271
 The Men Who Stare at Goats                       |  0.06079271
 How to Deal                                      |  0.06079271
(12 rows)

Note how the movie titles with “jedi” in their name have jumped to the top of the list, and their rank has increased.

It's worth pointing out that having only four weight “classes” is somewhat limiting, and that they need to be applied when computing the tsvector.

Typo-tolerance / fuzzy search

PostgreSQL doesn't support fuzzy search or typo-tolerance directly, when using tsvector and tsquery. However, working on the assumptions that the typo is in the query part, we can implement the following idea:

index all lexemes from the content in a separate table
for each word in the query, use similarity or Levenshtein distance to search in this table
modify the query to include any words that are found
perform the search

Here is how it works. First, use ts_stats to get all words in a materialized view:

CREATE MATERIALIZED VIEW unique_lexeme AS
   SELECT word FROM ts_stat('SELECT search FROM movies');

Now, for each word in the query, check if it is in the unique_lexeme view. If it's not, do a fuzzy-search in that view to find possible misspellings of it:

SELECT * FROM unique_lexeme
   WHERE levenshtein_less_equal(word, 'pregant', 2) < 2;

   word
----------
 premant
 pregrant
 pregnant
 paegant

In the above we use the Levenshtein distance because that's what search engines like Elasticsearch use for fuzzy search.

Once you have the candidate list of words, you need to adjust the query include them all.

Faceted search

Faceted search is popular especially on e-commerce sites because it helps customers to iteratively narrow their search. Here is an example from amazon.com:

The above can implemented by manually defining categories and then adding them as WHERE conditions to the search. Another approach is to create the categories algorithmically based on the existing data. For example, you can use the following to create a “Decade” facet:

SELECT ReleaseYear/10*10 decade, count(Title) cnt FROM movies
  WHERE search @@ websearch_to_tsquery('english','star wars')
  GROUP BY decade ORDER BY cnt DESC;

decade | cnt
--------+-----
   2000 |  39
   2010 |  31
   1990 |  29
   1950 |  28
   1940 |  26
   1980 |  22
   1930 |  13
   1960 |  11
   1970 |   7
   1910 |   3
   1920 |   3
(11 rows)

This also provides counts of matches for each decade, which you can display in brackets.

If you want to get multiple facets in a single query, you can combine them, for example, by using CTEs:

WITH releaseYearFacets AS (
  SELECT 'Decade' facet, (ReleaseYear/10*10)::text val, count(Title) cnt
  FROM movies
  WHERE search @@ websearch_to_tsquery('english','star wars')
  GROUP BY val ORDER BY cnt DESC),
genreFacets AS (
  SELECT 'Genre' facet, Genre val, count(Title) cnt FROM movies
  WHERE search @@ websearch_to_tsquery('english','star wars')
  GROUP BY val ORDER BY cnt DESC LIMIT 5)
SELECT * FROM releaseYearFacets UNION SELECT * FROM genreFacets;

 facet  |   val   | cnt
--------+---------+-----
 Decade | 1910    |   3
 Decade | 1920    |   3
 Decade | 1930    |  13
 Decade | 1940    |  26
 Decade | 1950    |  28
 Decade | 1960    |  11
 Decade | 1970    |   7
 Decade | 1980    |  22
 Decade | 1990    |  29
 Decade | 2000    |  39
 Decade | 2010    |  31
 Genre  | comedy  |  21
 Genre  | drama   |  35
 Genre  | musical |   9
 Genre  | unknown |  13
 Genre  | war     |  15
(16 rows)

The above should work quite well on small to medium data sets, however it can become slow on very large data sets.

Conclusion

We've seen the PostgreSQL full-text search primitives, and how we can combine them to create a pretty advanced full-text search engine, which also happens to support things like joins and ACID transactions. In other words, it has functionality that the other search engines typically don't have.

There are more advanced search topics that would be worth covering in detail:

suggesters / auto-complete
exact phrase matching
hybrid search (semantic + keyword) by combining with pg-vector

Each of these would be worth their own blog post (coming!), but by now you should have an intuitive feeling about them: they are quite possible using PostgreSQL, but they require you to do the work of combining the primitives and in some cases the performance might suffer on very large datasets.

In part 2, we'll make a detailed comparison with Elasticsearch, looking to answer the question on when is it worth it to implement search into PostgreSQL versus adding Elasticsearch to your infrastructure and syncing the data. If you want to be notified when this gets published, you can follow us on Twitter or join our Discord.

Postgres schema changes are still a PITA

Tudor Golubenco — Sat, 08 Jul 2023 21:23:42 +0000

We software engineers don’t agree on much, but we agree on this one: database schema changes are a pain in the a**.

Part of my job at Xata is to talk with as many developers as possible - from fresh bootcamp graduates, to indie developers, to principal engineers working in large teams. We talk about databases in general, what issues they face, the tools they use, and so on.

From the people that we’ve talked with, almost everyone said that schema changes and schema management are one of their least favorite parts of working with databases. While this sentiment is pretty universal, the reasons that they bring up are not always the same. Small companies or developers working on hobby projects, for example, have to make lots of changes as their applications grow and they discover new requirements. They’d like their schema changes workflow to be as simple and straight-forward as pull requests on GitHub.

For larger companies, with more data and high-traffic tables, schema changes may happen less often, but they still need to worry about things like downtime caused by locking. They require long internal guides on performing schema changes correctly (e.g. GitLab, PayPal), custom tools (e.g. Meta, Square), and they often document incidents or near-incidents caused by schema migrations (e.g. GitHub, Doctolib, GoCardless).

Across the board, developers complain about schema changes affecting their velocity: they require more communication, more steps, come with backwards compatibility concerns. As a result, some developers never modify and never remove columns, they only add new ones. This creates “schema-debt”- which creates bugs, confuses new team members, and keeps compatibility code around way past its use-by date.

Issues with schema changes

The following are specific to PostgreSQL, because this is the database system we use at Xata, but some of them apply to other database systems as well.

Locking gotchas

PostgreSQL has many different types of locks and most ALTER TABLE statements (but not all) take the table ACCESS EXCLUSIVE lock, which conflicts with all other lock types. This means that the table is essentially inaccessible and it’s important to keep this lock for the minimum possible duration.

Even when the lock is taken for a short time, it can still make the table seem unavailable. During normal operations, reads and writes run on your tables concurrently. However, when an ALTER TABLE query requiring ACCESS EXCLUSIVE runs, Postgres needs to create an opening where no other queries or transactions are running. To achieve this, all queries issued after the ALTER TABLE are put in a queue to run after the ALTER TABLE completes. Here's the issue: if your table has a running query or a running transaction, your schema change cannot begin. While Postgres is waiting to run your schema change, all of the queries that you would've assumed would run instantly are now being queued up. This gives your table the appearance of being unavailable.

The trick to avoid the above is to explicitly take the lock before running the ALTER TABLE and set a lock_timeout to avoid queueing queries for a long period of time. If the timeout hits, you keep retrying until the lock succeeds, and only then perform the ALTER TABLE.

There are other gotchas as well, like using the magic keyword CONCURRENTLY when adding indexes, but that doesn’t work inside transactions. When adding a constraint like NOT NULL, Postgres needs to first check that there are no NULLs in the table, and it has to do that while holding the lock. You can work around it by adding a CHECK CONSTRAINT with NOT VALID, which means that the constraint is applied to new rows but not to old ones. Then you can run VALIDATE later, which doesn’t need the ACCESS EXCLUSIVE lock because it assumes new rows are respecting the constraint.

In short, PostgreSQL offers good ways to control locking and avoid keeping locks for a long time, but it’s fair to say that it is a minefield; it’s easy to make mistakes, and every mistake can be costly. For this reason, teams need to have internal docs on how to do it correctly, and be extra careful during reviews. Staging systems are helpful to catch some of the gotchas, as long as they have the exact same data as the prod database and similar levels of traffic, which is typically not the case.

Application deploys and the 6 stages of rename

The previous section mostly applies to teams that have large tables, but this one applies to any team that cares about things not breaking. It’s also not Postgres specific.

Schema changes don’t happen in isolation, they are part of a new feature or fix that includes application code changes. If we could deploy the schema change and the application code to all application servers at the exact same moment, this wouldn’t be an issue. However, in practice, there’s a window of time where the old code needs to work with the new schema, or the other way around.

The strategy to do things correctly depends on the change you’d like to make. For example, if you add a column: you would first perform the schema change, backfill the data, then deploy the application code. After, you would apply constraints, which itself can be a multiple-step process.

If you want to remove a column, it’s the other way around. You first make sure no code is accessing that column, then you drop it from the schema.

What about renames? You would follow these steps (hat tip to the PlanetScale docs):

Create a new column with the new name.
Update and deploy the application to write data to both columns.
Backfill missing data from the old column to the new column.
Optionally, add constraints like NOT NULL to the new column once all the data is backfilled.
Update the application to only use the new column, and remove any references to the old column name.
Drop the old column.

In summary, in order to make schema changes correctly, even ignoring locking issues, you often need a multi-step process. This is both slowing you down and has a large surface area for expensive mistakes to be made.

Rollback? Rollback your expectations

If there’s one thing scarier than performing schema changes, it’s the thought that you might have to undo them under time constraints.

If schema changes require careful consideration and a multi-step process, rolling them back is not any simpler. You typically have to carefully apply the same steps in the reverse order.

Because this is complicated and takes a long time, most of us tend to not test rollbacks before starting a schema migration. This makes it doubly scary; you’re operating an incident, with your database in an unknown state, and now you need to run through a set of untested steps in production.

For the disciplined developers out there who did test their rollbacks, you may still be stuck waiting a long time for your rollback to complete. This can leave you staring at a terminal for minutes or hours, waiting for your application to come back online for your users.

What if we weren’t so afraid of schema changes?

One of my favorite parts about working on Xata is that we can take this type of workflow issues and reconsider them from first principles.

Let’s imagine that we have a magic wand. How would your ideal schema management system look like? This is mine:

There is a standard workflow that is followed every time, regardless of the schema type.
Schema changes are lock-free or take table locks for minimal amounts of time.
Schema changes don’t cause downtime by breaking application code.
You can apply all types of changes in a single-step, or at most a couple of steps.
Schema changes are undoable, and the undo is quick.

In short, a system that makes schema changes standardized, zero-downtime, lock-free, one-step, and undoable.

Is this possible?

The first insight is that when it comes to schema changes, we put a lot of onus on the application code in order to keep the database simple. If we move the complexity on the database side, we implement it once and all applications benefit from it.

Instead of having to make the application code forwards/backwards compatible, we make the database schema backwards compatible. The database could serve both the old schema (before the change) and the new schema (after the change) simultaneously. You can apply the schema change, and both old code and new code can work in parallel until the rolling upgrade is complete.

Putting it on a timeline, it would look like this:

The schema change is started, for example when the PR is merged. It can take some time until the new schema is available, so the application deployment doesn’t yet start.
Once the new schema is ready, the application deployment starts. It can be a rolling restart, so the old and the new version of the application might coexist for a period of time. That is ok because the old schema is still available.
Once the application deployment is completed, the old schema can be deleted. However, you might want to leave it live for a while longer, in case a rollback is needed.

If a rollback is needed while the old schema is still available, you can safely roll-back the application code. From its point of view, the schema was never modified.

During the time period in which both the old and the new schema are valid, the system keeps temporary hidden columns, and uses views to represent the old and the new schema. New inserts and updates are automatically “upgraded” or “downgraded” between the two schemas, so both the old and the new versions of the application can work normally.

With the above, the basic column operations (add/remove/rename) all become standardized and safe. No more having to schedule schema changes, no more avoiding renames, or postponing deleting unused columns for weeks “just to be sure”.

Adding constraints is more challenging to implement, because the old and the new schema can be in conflict. For example, let’s say you are adding a NOT NULL constraint on a column. If a new INSERT comes over the old schema with a NULL value, it needs to be accepted, because it respects the old schema. However, the resulting row cannot be exposed in the new schema, because it wouldn’t respect the constraint. In this case, it’s best to hide the new row in the new schema, and block the migration from completing until this issue is solved.

From what I know, there is only one project that tries something close to this: the relatively recent Reshape. It uses Postgres views to expose the two versions of the schema and triggers to upgrade/downgrade the new data. It doesn’t do the constraints part as described above, but shows that this approach is possible. Combined with the Xata pull request based workflow, I think the ideal system described above is possible!

Next Steps

If you’re feeling the pain of schema changes and think that there must be a better way, we’d love to talk to you! We plan to work on this in the form of an open-source project for PostgreSQL, and if you want to be notified when we release it, you can follow us on Twitter, join us on Discord, or simply sign up for Xata.

On the performance of REPLICA IDENTITY FULL in Postgres

Tudor Golubenco — Wed, 07 Jun 2023 11:49:39 +0000

At Xata, we’re relying on the Postgres logical replication events quite a bit. Having a reliable stream with everything that has changed in the database is really useful in a number of cases, including data syncing
to Elasticsearch, web-hooks, attachments clean-up, and more. One of the configuration parameters for logical replication is the so called “replica identity”:

A published table must have a “replica identity” configured in order to be
able to replicate UPDATE and DELETE operations, so that appropriate rows
to update or delete can be identified on the subscriber side. By default, this
is the primary key, if there is one.

You’d typically want to use the primary key or an index as the “replica identity” and avoid setting the “replica identity” to full. The reason is this quote from the Postgres docs (emphasis mine):

If the table does not have any suitable key, then it can be set to replica identity “full”, which means the entire row becomes the key. This, however, is very inefficient and should only be used as a fallback if no other solution is possible.

That sounds quite discouraging, however, there are some good reasons for which you might still want to consider enabling it (see next section), so it’s worth digging a bit deeper and analyzing what exactly happens when you enable it and how it affects performance. This is what this blog post is about.

Ok, but why would you enable it?

In our case there were two things that convinced us that we might want to consider REPLICA IDENTITY FULL:

On UPDATEs and DELETEs, Postgres only includes the old values of the “identity columns”. So if you want to get the values after the change as well as the values before the change, setting replica identity to “full” tricks Postgres into passing the old values as well.
UPDATE WAL events don’t include TOASTed columns, unless that column has changed in the current operation. In practice this means that if you have column values over 8KB, you will miss them from the update events. Again, by setting the replica identity to FULL, we make sure they are always in the WAL event.

To illustrate the above, let’s look at an example, using wal2json for convenience. An update event looks something like this:

{
   "kind": "update",
   "schema": "xata",
   "table": "foo",
   "columnnames": ["id", "name", "age"],
   "columntypes": ["text", "text", "integer"],
   "columnvalues": ["michael", "Michael Scott", 42],
   "oldkeys": {
     "keynames": ["id"],
     "keytypes": ["text"],
     "keyvalues": ["michael"]
   }
}

The above was generated after a command like this one:

UPDATE foo SET age=age+1 WHERE id='michael';

The table has a primary key (the id column) and that is used as the replica identity.

Things to note:

Under columnvalues the new values for the id, name, and agecolumns are included.
Under oldkeys only the id column is included, because that’s what the REPLICA IDENTITY is set to.

The table, however, has another column, named description which is not included at all in the event. Why? Because the value of it is large and it is TOASTed. If the UPDATE command would have changed the description, then it would have been in the columnvalues but because it was not touched, it is skipped.

Now let’s compare it with the same event generated with REPLICA IDENTITY FULL:

ALTER TABLE foo REPLICA IDENTITY FULL;

The event for the same UPDATE command above looks like this:

{
    "kind": "update",
    "schema": "xata",
    "table": "foo",
    "columnnames": ["id", "name", "age"],
    "columntypes": ["text", "text", "integer"],
    "columnvalues": ["michael", "Michael Scott", 43],
    "oldkeys": {
      "keynames": ["id", "name", "age", "description"],
      "keytypes": ["text", "text", "integer", "text"],
      "keyvalues": ["michael", "Michael Scott", 42, "<redacted very large description>"]
    }
}

Things to note:

The columnvalues looks the same. It contains all the columns and their new values, but not the TOASTed column/value pairs.
The oldkeys now contains really all the columns with their values before the update, including the TOAST value for description.

It should be said that with the replica identity set to the primary key or another unique index, you get enough data out in the replication stream to completely sync a Postgres replica in sync. This means that you can overcome the limitations by keeping state outside of Postgres. However, the amount of work that you need to do outside of Postgres might become significant both in terms of resources needed and added code (compared to a single ALTER TABLE statement).

Impact on the replica and changes in Postgres 16

The quote from the beginning of the blog post with the “very inefficient” wording is actually from the — currently stable — Postgres 15 docs , but it was changed in the Postgres 16 docs — currently under development:

If the table does not have any suitable key, then it can be set to replica identity FULL, which means the entire row becomes the key. When replica identity FULL is specified, indexes can be used on the subscriber side for searching the rows. […] If there are no such suitable indexes, the search on the subscriber side can be very inefficient, therefore replica identity FULL should only be used as a fallback if no other solution is possible.

The above hints that the “very inefficient” part refers to the subscriber that needs to identify the rows that were updated / deleted, and that this is being improved in version 16 so that it’s able to use indexed columns even if they are not set explicitly in the replica identity.

What is not clear from the docs above is that also in previous versions of Postgres, if the table has a primary key, it is used by the subscriber to find the affected row. You can see the code here, it looks for a replica index, and if it’s not found, it uses the PK instead.

/*
 * Get replica identity index or if it is not defined a primary key.
 *
 * If neither is defined, returns InvalidOid
 */
static Oid
GetRelationIdentityOrPK(Relation rel)
{
    Oid         idxoid;

    idxoid = RelationGetReplicaIndex(rel);

    if (!OidIsValid(idxoid))
        idxoid = RelationGetPrimaryKeyIndex(rel);

    return idxoid;
}

This is good news because it means that the “very inefficient” case is only relevant to tables that have no primary key.

With Postgres 16, the “very inefficient” case is reduced even further, because now Postgres can use other indexes beyond the PK, even if they are not explicitly set in the REPLICA IDENTITY. The patch for this change can be found here. After the patch, the GetRelationIdentityOrPK function stays the same, but the caller routine makes another effort into finding a suitable index to use:

    /*
     * Simple case, we already have a primary key or a replica identity index.
     */
    idxoid = GetRelationIdentityOrPK(localrel);
    if (OidIsValid(idxoid))
    return idxoid;

    if (remoterel->replident == REPLICA_IDENTITY_FULL)
    {
       /*
        * We are looking for one more opportunity for using an index. If
        * there are any indexes defined on the local relation, try to pick a
        * suitable index.
        *
        * The index selection safely assumes that all the columns are going
        * to be available for the index scan given that remote relation has
        * replica identity full.
        *
        * Note that we are not using the planner to find the cheapest method
        * to scan the relation as that would require us to either use lower
        * level planner functions which would be a maintenance burden in the
        * long run or use the full-fledged planner which could cause
        * overhead.
        */
        return FindUsableIndexForReplicaIdentityFull(localrel, attrMap);
    }

Impact on the size of the WAL events

While the worst potential impact of REPLICA IDENTITY FULL is on the replica, as we’ve seen above, it does also impact the primary because it requires more data to be stored in the WAL and it causes it to send more data over the network to the subscribers. This means more IO bandwidth consumed, more CPU usage, and more RAM usage.

How much more? It depends on the type of write traffic you have and how many large values you have, as well as how often they are updated.

Let’s look at it by the event type:

Inserts

Inserts are not impacted at all, because there are no “old values” and the large values are included regardless of the REPLICA IDENTITY setting. Easy.

Updates

If REPLICA IDENTITY is set to FULL, Postgres needs to keep the old values for all columns while executing the transaction and then write them to the WAL.

If there are TOAST values in the row, they need to be loaded in memory, decompressed, and then logged to the WAL. The relevant code that loads them to the memory is this, with most of the hard work happening in toast_flatten_tuple.

    if (replident == REPLICA_IDENTITY_FULL)
    {
        /*
         * When logging the entire old tuple, it very well could contain
         * toasted columns. If so, force them to be inlined.
         */
        if (HeapTupleHasExternal(tp))
        {
            *copy = true;
            tp = toast_flatten_tuple(tp, desc);
        }
        return tp;
    }

In conclusion, the size of the WAL events for updates is ~doubled if there are no TOASTed values, and potentially way more than doubled if there are TOASTed values.

Deletes

For deletes normally only the key is logged in the WAL, and by enabling REPLICA IDENTITY FULL, all columns become part of the key, meaning that the whole row is logged and sent over the wire on delete.

Benchmarking

Based on the analysis above, we were guessing that the cost of REPLICA IDENTITY FULL would be worth it in our case, so decided to put it to test. We wanted to simulate something close to the worst case scenario but still realistic, so we have extended our benchmarking test suite to have a test with a lot of large values and performing a lot of updates. The test didn’t push the resources to the limit but was heavy enough for us to measure the impact.

Here are the result of running the benchmark test before and after enabling REPLICA IDENTITY FULL.

The CPU time showed a modest increase in peak times from 30% to about 35%. There was a similar modest increase in the peak replication slot lag.

Summary

If you’re using logical replication between instances, the impact from REPLICA IDENTITY FULL will likely be manageable as long as the replicated tables have a Primary Key. If you are on Postgres 16 already, having no Primary Key but another unique index might also be ok.

In any case, there will be impact in the amount of data written to the WAL and sent over the network. The more UPDATE and DELETE traffic that you have, the more significant this impact will be.

If you find the above interesting and want to work on similar challenges, make sure to check out the Xata careers page.

Semantic Search With Xata, OpenAI, TypeScript, and Deno

Tudor Golubenco — Tue, 21 Mar 2023 21:47:19 +0000

At the same time we launched our ChatGPT integration, we also added a new vector type to Xata, making it easy to store embeddings. Additionally, we have added a new vectorSearch endpoint, which performs similarity search on embeddings.

Let’s take a quick tour to see how you can use these new capabilities to implement semantic search. We’re going to use the OpenAI embeddings API, TypeScript and Deno. This tutorial assumes prior knowledge of TypeScript, but no prior knowledge of Xata, Deno, or OpenAI.

What is semantic search?

Instead of just matching keywords, as traditional search engines do, semantic search attempts to understand the context and intent of the query and the relationships between the words used in it.

For example, let’s say you have the following sample sentences:

"The quick brown fox jumps over the lazy dog story"
“The vehicle collided with the tree”
“The cat growled towards the dog”
"Lorem ipsum dolor sit amet, consectetur adipiscing elit”
“The sunset painted the sky with vibrant colors”

If you search for “sample text in latin”, traditional keyword search won’t match the “lorem ipsum” text, but semantic search will (and we’re going to demo it in this article).

Similarly, if you search for “the kitty hissed at the puppy”, semantic search will see that the phrase has the same meaning as “the cat growled towards the dog”, even if they use none of the same words. Or, for another example, “vanilla sky” should bring up the “The sunset painted the sky with vibrant colors” sentence. Pretty cool, right? This is now quite possible thanks large-language models and vector search.

A quick intro to embeddings

From a data point of view, embeddings are simply arrays of floats. They are the output of ML models, where the input can be a word, a sentence, a paragraph of text, an image, an audio file, a video file, and so on.

Each number in the array of floats represents the input text on a particular dimension, which depends on the model. This means that the more similar the embeddings are, the more “similar” the input objects are. I’ve put “similar” in quotes because it depends on the model what kind of similar it means. When it comes to text, it’s usually about “similar in meaning”, even if different words, expressions, or languages are used.

Reducing complex data to an array of numbers representing its qualities turns out to be very useful for a number of use cases. Think of reverse image search, recommendations algorithms for video and music, product recommendations, categorizing products, and so on.

Vector type in Xata

If you want to follow along, start with steps:

Sign up or sign into Xata here (the usage from this tutorial fits well within the free tier, so you don’t need to set up billing)
Create a database named vectors
Create a table named Sentences
Add two columns:
- sentence of type string
- embedding of type vector. Use 1536 as the dimension

When you are done, the schema should look like this:

Initialize the Xata project

To get ready for running the typescript code, install the Xata CLI:

npm install -g @xata.io/cli@latest

Run xata auth login to authenticate the CLI. This will open a browser window and prompt you to generate a new API key. Give it any name you’d like.

xata auth login

Create a folder for the code:

mkdir sentences
cd sentences

And run xata init to connect it to the Xata DB:

xata init

The Xata CLI will ask you to select the database and then ask you how to generate the types. Select Generate TypeScript code with Deno imports. Use default settings for the rest of the questions.

Prepare OpenAI and Deno

Create an OpenAI account and generate a key. Note that you need to set up billing for OpenAI in order for to run these examples, but the cost will be tiny (under $1).

Add the OpenAI key to the .env file which was created by the xata init command above. Your .env should look something like this:

# API key used by the CLI and the SDK
# Make sure your framework/tooling loads this file on startup to have it available for the SDK
XATA_API_KEY=xau_<redacted>
OPENAI_API_KEY=sk-<redacted>

Install the Deno CLI. See this page for the various install options, on macOs with Homebrew it is:

brew install deno

Load data

It’s now the time to write a bit of TypeScript code. Create a loadWithEmbeddings.ts file top level in your project with the following contents:

import { Configuration, OpenAIApi } from "npm:openai";
import { getXataClient } from "./src/xata.ts";
import { config as dotenvConfig } from "<https://deno.land/x/dotenv@v1.0.1/mod.ts>";

dotenvConfig({ export: true });

const openAIConfig = new Configuration({
  apiKey: Deno.env.get("OPENAI_API_KEY"),
});
const openAI = new OpenAIApi(openAIConfig);
const xata = getXataClient();

const sentences: string[] = [
  "The quick brown fox jumps over the lazy dog story",
  "The vehicle collided with the tree",
  "The cat growled towards the dog",
  "Lorem ipsum dolor sit amet, consectetur adipiscing elit",
  "The sunset painted the sky with vibrant colors",
];

for (const sentence of sentences) {
  const resp = await openAI.createEmbedding({
    input: sentence,
    model: "text-embedding-ada-002",
  });
  const [{ embedding }] = resp.data.data;

  xata.db.Sentences.create({
    sentence,
    embedding,
  });
}

Step by step, this is what the script is doing:

Uses dotenv to load the .env file into the current environment. This makes sure the XATA_API_KEY and the OPENAI_API_KEY are available to the rest of the script.
Initializes the OpenAI and Xata client libraries.
Defines the test data in the sentences array.
For each sentence , calls the OpenAI embeddings API to get the embedding for it.
Inserts a record containing the sentence and the embedding into Xata.

Execute the script like this:

deno run --allow-net --allow-env --allow-read --allow-run ./loadWithEmbeddings.ts

If you visit the Xata UI now, you should see the data loaded, together with the embeddings.

Run search queries

Let’s write another simple script that performs a search based on an input query. Name this script search.ts.

import { Configuration, OpenAIApi } from "npm:openai";
import { getXataClient } from "./src/xata.ts";
import { config as dotenvConfig } from "<https://deno.land/x/dotenv@v1.0.1/mod.ts>";

dotenvConfig({ export: true });

const openAIConfig = new Configuration({
  apiKey: Deno.env.get("OPENAI_API_KEY"),
});
const openAI = new OpenAIApi(openAIConfig);
const xata = getXataClient();

if (Deno.args.length !== 1) {
  console.log("Please provide a search query");
  console.log(
    "Example: deno run --allow-net --allow-env search.ts 'the quick brown fox'"
  );
  Deno.exit(1);
}

const query = Deno.args[0];
const resp = await openAI.createEmbedding({
  input: query,
  model: "text-embedding-ada-002",
});
const [{ embedding }] = resp.data.data;

const results = await xata.db.Sentences.vectorSearch("embedding", embedding);

for (const result of results) {
  console.log(result.getMetadata().score, "|", result.sentence);
}

Here is what’s going on in the script:

The beginning is similar as for the previous script, using dotenv to load the .env file and then initializing the client libraries for OpenAI and Xata
Read the query as the first argument passed to the script
Use the OpenAI embeddings API to create embeddings for the query
Run a vector search using the Xata Vector Search API. This finds vectors that are similar to the provided embedding
Print the results, together with the similarity score

To run the script, execute it like this:

$ deno run --allow-net --allow-env --allow-read --allow-run \\
  ./search.ts 'sample text in latin'

1.8154079 | Lorem ipsum dolor sit amet, consectetur adipiscing elit
1.7424928 | The quick brown fox jumps over the lazy dog story
1.7360129 | The cat growled towards the dog
1.7311659 | The sunset painted the sky with vibrant colors
1.7038174 | The vehicle collided with the tree

As you can see, searching for “sample text in latin” results in the “Lorem ipsum” text as we hoped. You can also try some variations, for example, “sample sentence” still brings the “Lorem ipsum” one as the top result:

$ deno run --allow-net --allow-env --allow-read --allow-run \\
  ./search.ts 'sample sentence'

1.805396 | Lorem ipsum dolor sit amet, consectetur adipiscing elit
1.7715557 | The quick brown fox jumps over the lazy dog story
1.7608802 | The sunset painted the sky with vibrant colors
1.7573793 | The cat growled towards the dog
1.7493906 | The vehicle collided with the tree

The scores on the left are numbers between 0 and 2 which indicate how close each sentence is to the provided query. If you run with a sentence that exits in the data, you’ll get a score very close to 2:

$ deno run --allow-net --allow-env --allow-read --allow-run \\
  ./search.ts 'The quick brown fox jumps over the lazy dog story'

1.9999993 | The quick brown fox jumps over the lazy dog story
1.8063612 | The cat growled towards the dog
1.769694 | Lorem ipsum dolor sit amet, consectetur adipiscing elit
1.7673006 | The vehicle collided with the tree
1.7605586 | The sunset painted the sky with vibrant colors

Now let’s try the “vanilla sky” query:

$ deno run --allow-net --allow-env --allow-read --allow-run \\
  ./search.ts 'vanilla sky'

1.8137007 | The sunset painted the sky with vibrant colors
1.7584264 | Lorem ipsum dolor sit amet, consectetur adipiscing elit
1.7579571 | The quick brown fox jumps over the lazy dog story
1.7505519 | The vehicle collided with the tree
1.738231 | The cat growled towards the dog

Bingo, top result matches what we expected.

Another one to try:

$ deno run --allow-net --allow-env --allow-read --allow-run \\
  ./search.ts 'The car crashed into the oak.'

1.907266 | The vehicle collided with the tree
1.7763984 | The cat growled towards the dog
1.7755028 | The quick brown fox jumps over the lazy dog story
1.7745116 | The sunset painted the sky with vibrant colors
1.7570261 | Lorem ipsum dolor sit amet, consectetur adipiscing elit

Again, the sentence with the same meaning shows up at the top with a score close to 2.

Conclusion

The large-language-models are powerful tools that open up new use cases. Semantic search is one of these use cases, and we’ve seen how the Xata vector search can be used to implement it. You can also use it to build recommendation engines, or finding similar entries in a knowledge-base, or questions in a Q&A website.

If you’re running this tutorial, please join us on the Xata Discord and let us know what you are building!

Semantic or keyword search for finding ChatGPT context. Who searched it better?

Tudor Golubenco — Wed, 08 Mar 2023 11:46:02 +0000

Last week we’ve added a Q&A bot that answers questions from our documentation. This leverages the ChatGPT tech to answer questions from the Xata documentation, even though the OpenAI GPT model was never trained on the Xata docs.

The way we do this is by using an approach suggested by Simon Willison in this blog post. The same approach can be found also in an OpenAI cookbook. The idea is the following:

Run a text search against the documentation to find the content that is most relevant to the question asked by the user.
Produce a prompt with this general form:

With these rules: {rules}
And this text: {context}
Given the above text, answer the question: {question}
Answer:

Send the prompt to the ChatGPT API and let the model complete the answer.

We found out that this works quite well and, combined with a relatively low model temperature (the concept of temperature is explained in this blog post), this tends to produce correct results and code snippets, as long as the answer can be found in the documentation.

A key limitation to this approach is that the prompt that you build in the second step above needs to have max 4000 tokens (~3000 words). This means that the first step, the text search to select the most relevant documents, becomes really important. If the search step does a good job and provides the right context, ChatGPT tends to also do a good job in producing a correct and to-the-point result.

So what’s the best way to find the most relevant pieces of content in the documentation? The OpenAI cookbook, as well as Simon’s blog, use what is called semantic search. Semantic search leverages the language model to generate embeddings for both the question and the content. Embeddings are arrays of numbers that represent the text on a number of dimension. Pieces of text that have similar embeddings have a similar meaning. This means a good strategy is to find the pieces of content that the most similar embeddings to the question embeddings.

Another possible strategy, based on the more classical keyword search, looks like this:

Ask ChatGPT to extract the keywords from the question, with a prompt like this:

Extract keywords for a search query from the text provided. 
Add synonyms for words where a more common one exists.

Use the provided keywords to run a free-text-search and pick the top results

Putting it in a single diagram, the two methods look like this:

We have tried both on our documentation and have noticed some pros and cons.

Let’s start by comparing a few results. Both are ran against the same database, and they both use the ChatGPT gpt-3.5-turbo model. As there is randomness involved, I ran each question 2-3 times and picked what looked to me like the best result.

Question: How do I install the Xata CLI?

Answer with vector search:

Answer with keyword search:

Verdict: Both versions provided the correct answer, however the vector search one is a bit more complete. They both found the correct docs page for it, but I think our highlights-based heuristic selected a shorter chunk of text in case of the keyword strategy. Winner: vector search.

Score: 1-0

Question: How do you use Xata with Deno?

Answer with vector search:

Answer with keyword search:

Verdict: Disappointing result for vector search, who somehow missed the dedicated Deno page in our docs. It did find some other Deno relevant content, but not the page that contained the very useful example. Winner: keyword search.

Score: 1-1

Question: How can I import a CSV file with custom column types?

With vector search:

With keyword search:

Verdict: Both have found the right page (”Import a CSV file”), but the keyword search version managed to get a more complete answer. I did run this multiple times to make sure it’s not a fluke. I think the difference comes from how the text fragment is selected (neighbouring the keywords in case of keyword search, from the beginning of the page in case of vector search). Winner: keyword search.

Score: 1-2

Question: How can I filter a table named Users by the email column?

With vector search:

With keyword search:

Verdict: The vector search did better on this one, because it found the “Filtering” page on which there were more examples that ChatGPT could use to compose the answer. The keyword search answer is subtly broken, because it uses “query” instead of “filter” for the method name. Winner: vector search.

Score: 2-2

Question: What is Xata?

With vector search:

With keyword search:

Verdict: This one is a draw, because both answers are quite good. The two picked different pages to summarize in an answer, but both did a good job and I can’t pick a winner.

Score: 3-3

Configuration and tuning

This is a sample Xata request used for keyword search:

// POST https://workspace-id.eu-west-1.xata.sh/db/docs:main/tables/search/ask
{
    "question": "What is Xata?",
    "rules": [
        "Do not answer questions about pricing or the free tier. Respond that Xata has several options available, please check https://xata.io/pricing for more information.",
        "If the user asks a how-to question, provide a code snippet in the language they asked for with TypeScript as the default.",
        "Only answer questions that are relating to the defined context or are general technical questions. If asked about a question outside of the context, you can respond with \"It doesn't look like I have enough information to answer that. Check the documentation or contact support.\"",
        "Results should be relevant to the context provided and match what is expected for a cloud database.",
        "If the question doesn't appear to be answerable from the context provided, but seems to be a question about TypeScript, Javascript, or REST APIs, you may answer from outside of the provided context.",
    "If you answer with Markdown snippets, prefer the GitHub flavour.",
    "Your name is DanGPT"
    ],
    "searchType": "keyword",
    "search": {
        "fuzziness": 1,
        "target": [
            "slug",
            {
                "column": "title",
                "weight": 4
            },
            "content",
            "section",
            {
                "column": "keywords",
                "weight": 4
            }
        ],
        "boosters": [
            {
                "valueBooster": {
                    "column": "section",
                    "value": "guide",
                    "factor": 18
                }
            }
        ]
    }
}

And this what we use for vector search:

// POST https://workspace-id.eu-west-1.xata.sh/db/docs:main/tables/search/ask
{
    "question": "How do I get a record by id?",
    "rules": [
        "Do not answer questions about pricing or the free tier. Respond that Xata has several options available, please check https://xata.io/pricing for more information.",
        "If the user asks a how-to question, provide a code snippet in the language they asked for with TypeScript as the default.",
        "Only answer questions that are relating to the defined context or are general technical questions. If asked about a question outside of the context, you can respond with \"It doesn't look like I have enough information to answer that. Check the documentation or contact support.\"",
        "Results should be relevant to the context provided and match what is expected for a cloud database.",
        "If the question doesn't appear to be answerable from the context provided, but seems to be a question about TypeScript, Javascript, or REST APIs, you may answer from outside of the provided context.",
        "Your name is DanGPT"
    ],
    "searchType": "vector",
    "vectorSearch": {
        "column": "embeddings",
        "contentColumn": "content",
        "filter": {
            "section": "guide"
        }
    }
}

As you can see, the keyword search version has more settings, configuring fuzziness and boosters and column weights. The vector search only uses a filter. I would call this a plus for keyword search: you have more dials to tune the search and therefore get better answers. But it’s also more work, and the results from vector search are quite good without this tuning.

In our case, we already have tuned the keyword search for our, well, docs search functionality. So it wasn’t necessarily extra work, and while playing with ChatGPT we discovered improvements to our docs and search as well. Also, Xata just happens to have a very nice UI for tuning your keyword search, so the work wasn’t hard to begin with (planning a separate blog post about that).

There is no reason for which vector search couldn’t also have boosters and column weights and the like, but we don’t have it yet in Xata and I don’t know of any other solution that makes that as easy as we make keyword search tuning. And, in general, there is more prior art to keyword search, but it is quite possible that vector search will catch up.

For now, I’m going to call keyword search a winner on this one.

Score: 3-4

Convenience

Our documentation already had a search function, dog-fooding Xata, so that was quite simple to extend to a chat bot. Xata now also supports vector search natively, but using it required adding embeddings for all the documentation pages and figuring out a good chunking strategy. We have used the OpenAI embeddings API for producing the text embeddings, which had a minimal cost. Winner: Keyword search

Score 3-5

Latency

The keyword search approach needs an extra round-trip to the ChatGPT API. This adds in terms of latency to the result started to be streamed in the UI. By my measurements, this adds around 1.8s extra time

With vector search:

With keyword search:

Note: The total and the content download times here are not relevant, because they mostly depend on how long the generated response is. Look at the “Waiting for server response” bar (the green one) to compare.

Winner: Vector search

Score: 4-5

Cost

The keyword search version needs to do an extra API call to the ChatGPT API, on the other hand, the vector search version needs to produce embeddings for all the documents in the database plus the question. Unless we’re talking about a lot of documents, I’m going to call this a tie.

Score: 5-6

Conclusion

The score is tight! In our case we have gone with using the keyword search for now, mostly because we have more ways of tuning it and as a result of that it generates slightly better answers for our set of test questions. Also, any improvements that we make to search automatically benefit both the search and the chat use cases. As we’re improving our vector search capabilities with more tuning options, we might switch to vector search, or a hybrid approach, in the future.

If you’d like to set up a similar chat bot for your own documentation, or any kind of knowledge base, you can easily implement the above using the Xata ask endpoint. Create an account for free and join us on Discord. I’d be happy to personally help you get it up and running!