<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tudor Golubenco</title>
    <description>The latest articles on DEV Community by Tudor Golubenco (@tsg).</description>
    <link>https://dev.to/tsg</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F878205%2F93b402a6-03ef-4cd6-a9fd-0b264ad4c596.jpeg</url>
      <title>DEV Community: Tudor Golubenco</title>
      <link>https://dev.to/tsg</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tsg"/>
    <language>en</language>
    <item>
      <title>Create a search engine with PostgreSQL: Postgres vs Elasticsearch</title>
      <dc:creator>Tudor Golubenco</dc:creator>
      <pubDate>Mon, 31 Jul 2023 10:49:03 +0000</pubDate>
      <link>https://dev.to/xata/create-a-search-engine-with-postgresql-postgres-vs-elasticsearch-495k</link>
      <guid>https://dev.to/xata/create-a-search-engine-with-postgresql-postgres-vs-elasticsearch-495k</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/xata/create-an-advanced-search-engine-with-postgresql-23e5"&gt;Part 1&lt;/a&gt;, we delved into the capabilities of PostgreSQL's full-text search and explored how advanced search features such as relevancy boosters, typo-tolerance, and faceted search can be implemented. In this part, we'll compare it with Elasticsearch.&lt;/p&gt;

&lt;p&gt;First, let's note that Postgres and Elasticsearch are generally not in competition with each other. In fact, it's very common to see them together in architecture diagrams, often in a configuration like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F34fj1oodv2q69eqb1r9j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F34fj1oodv2q69eqb1r9j.png" alt="Typical architecture with PostgreSQL and Elasticsearch" width="646" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this architecture, the source of truth for the data lives in Postgres, which serves the transactional CRUD operations. The data is continuously synced to Elasticsearch, either via something like Postgres logical replication events (change-data-capture) or by the application itself via custom code. During this data replication, denormalization might be required. The search functionality, including facets and aggregations, is served from Elasticsearch.&lt;/p&gt;

&lt;p&gt;While this architecture is as common as it is for very good reasons, it does have a few challenges:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Dealing with two types of stores means more operational burden and higher infrastructure costs.&lt;/li&gt;
&lt;li&gt;Keeping the data in sync is more challenging than you might think. I'm planning a dedicated blog for this problem because it's quite interesting. Let's just say, it's pretty hard to get it completely right.&lt;/li&gt;
&lt;li&gt;The data replication is, at best, &lt;strong&gt;near&lt;/strong&gt; real-time, meaning that there can be consistency issues in the search service.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Point 2 is generally solvable via engineering effort and careful dedicated code. From the existing tools, &lt;a href="https://pgsync.com/" rel="noopener noreferrer"&gt;PGSync&lt;/a&gt; is an open source project that aims to specifically solve this problem. &lt;a href="https://github.com/zombodb/zombodb" rel="noopener noreferrer"&gt;ZomboDB&lt;/a&gt; is an interesting Postgres extension that tackles point 2 (and I think partially point 3), by controlling and querying Elasticsearch &lt;em&gt;through&lt;/em&gt; Postgres. I haven't yet tried either of these two projects, so I can't comment on their trade-offs, but I wanted to mention them.&lt;/p&gt;

&lt;p&gt;And yes, a data platform like &lt;a href="https://xata.io" rel="noopener noreferrer"&gt;Xata&lt;/a&gt; solves most of points 1 and 2, by taking that complexity and offering it as a service, together with other goodies.&lt;/p&gt;

&lt;p&gt;That said, if the Postgres full-text search functionality is enough for your use case, making use of it promises to significantly simplify your architecture and application. In this version, Postgres serves both the CRUD app needs and the full-text search needs:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzcfiebf8p7fervkpeu7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzcfiebf8p7fervkpeu7.png" alt="Postgres-only search architecture" width="407" height="301"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This means you don't need to operate two types of stores, no more data replication, no more denormalization, no more eventual consistency. The search engine built into Postgres happens to support ACID transactions, joins between tables, constraints (e.g. not null or unique), referential integrity (foreign keys), and all the other Postgres goodies that make application development simpler.&lt;/p&gt;

&lt;p&gt;Therefore, it's no wonder the &lt;a href="https://news.ycombinator.com/item?id=36699016" rel="noopener noreferrer"&gt;Hacker News thread&lt;/a&gt; for our &lt;a href="https://xata.io/blog/postgres-full-text-search-engine" rel="noopener noreferrer"&gt;part 1 blog post&lt;/a&gt; had a lively discussion about the pros and cons of this approach. Can we go for the Postgres-only solution, or does the best-tool-for-the-job argument wins?&lt;/p&gt;

&lt;p&gt;We're going to compare the convenience, search relevancy, performance, and scalability of the two options.&lt;/p&gt;

&lt;h2&gt;
  
  
  DIY versus built-in
&lt;/h2&gt;

&lt;p&gt;As we showed in part 1, you can replicate a lot of the Elasticsearch functionality in Postgres, even more advanced things like relevancy boosters, typo-tolerance, suggesters/autocomplete, or semantic/vector search. However, it's not always straight forward.&lt;/p&gt;

&lt;p&gt;An example where it's not too simple is with typo-tolerance (called fuzziness in Elasticsearch). It's not available out-of-the-box in Postgres, but you can &lt;a href="https://xata.io/blog/postgres-full-text-search-engine#typo-tolerance-fuzzy-search" rel="noopener noreferrer"&gt;implement it with the following steps&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;index all &lt;em&gt;lexemes (words)&lt;/em&gt; from all documents in a separate table&lt;/li&gt;
&lt;li&gt;for each word in the query, use similarity or Levenshtein distance to search in this table&lt;/li&gt;
&lt;li&gt;modify the search query to include any words that are found&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While the above is quite doable, in dedicated search engines like Elasticsearch, you can enable typo-tolerance with a simple flag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;POST&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/recipes/_search&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"multi_match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"biscaits"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"fuzziness"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Search relevancy: BM25 and TF-IDF
&lt;/h2&gt;

&lt;p&gt;The default ranking algorithm for keyword search in Elasticsearch is BM25. With the release of Elasticsearch 5.0 in 2016, it dethroned TF-IDF as the default ranking algorithm. Postgres doesn't support either of them, mainly because its ranking functions (explained in &lt;a href="https://xata.io/blog/postgres-full-text-search-engine#tsrank" rel="noopener noreferrer"&gt;here&lt;/a&gt;) don't have access to global word frequency data which is needed by these algorithms. To see how &lt;em&gt;relevant (pun intended) or&lt;/em&gt; not so relevant that might be, let's look at the ranking functions and algorithms from simple to complex:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ts_rank&lt;/code&gt; (Postgres function) - ranks based on the term frequency. In other words, it does the “TF” (term frequency) part of TF-IDF. The principle is that if you are searching for a word, the more often that word shows up in the matching document, the higher the score. In addition to using simple TF, Postgres provides ways to normalize the term frequency into a score. For instance, &lt;a href="https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING" rel="noopener noreferrer"&gt;one approach&lt;/a&gt; is to divide it by the document length.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ts_rank_cd&lt;/code&gt; (Postgres function) - rank + cover density. In addition to the term frequency, this function also takes into account the “cover density”, meaning the proximity of the terms in the document.&lt;/li&gt;
&lt;li&gt;TF-IDF - term frequency + inverse document frequency. In addition to the term frequency, this algorithm “penalizes” words that are very common in the overall data set. So if the word “egg” matches, but that word is super common because we have a recipes dataset, it is valued less compared to other words in the query.&lt;/li&gt;
&lt;li&gt;BM25 - this algorithm is based on a probabilistic model of relevancy. While the &lt;a href="https://lucene.apache.org/core/8_8_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html" rel="noopener noreferrer"&gt;TF-IDF formula&lt;/a&gt; is mostly based on intuition and practical experiments, BM25 is the result of more &lt;a href="https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf" rel="noopener noreferrer"&gt;formal mathematical research&lt;/a&gt;. If you're curious about the said mathematical research, I recommend this &lt;a href="https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25" rel="noopener noreferrer"&gt;talk&lt;/a&gt; that makes it accessible. Interestingly, the resulting BM25 formula is not all that different from TF-IDF but it incorporates a couple more concepts: the frequency saturation and the document length. Ultimately, this gives better results over a wider range of document types.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There's no question that BM25 is a more advanced relevancy algorithm than what &lt;code&gt;ts_rank&lt;/code&gt; or &lt;code&gt;ts_rank_cd&lt;/code&gt; use. BM25 uses more input signals, it's based on better heuristics, and it typically doesn't require tuning.&lt;/p&gt;

&lt;p&gt;One practical effect of BM25 is that it automatically penalizes the very common words (”the”, “in”, “or”, etc.), also called “stop words”, which means that they don't need to be excluded from the index. This is why the Postgres &lt;code&gt;english&lt;/code&gt; configuration for &lt;code&gt;to_tsvector&lt;/code&gt; removes the stop words (details &lt;a href="https://xata.io/blog/postgres-full-text-search-engine#tsvector" rel="noopener noreferrer"&gt;here in part 1&lt;/a&gt;), but the Elasticsearch &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html" rel="noopener noreferrer"&gt;standard analyzer&lt;/a&gt; doesn't. It doesn't need to.&lt;/p&gt;

&lt;p&gt;While BM25 is superior, there are some pro-Postgres arguments to be considered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;if you aggressively exclude the stop words, like the &lt;code&gt;english&lt;/code&gt; configuration in Postgres does, that compensates for the lack of IDF in some cases.&lt;/li&gt;
&lt;li&gt;in practice there might be stronger signals of relevancy in the data itself (upvotes, reviews, etc). See the &lt;a href="https://xata.io/blog/postgres-full-text-search-engine#numeric-date-and-exact-value-boosters" rel="noopener noreferrer"&gt;section on boosters from part 1&lt;/a&gt; on how to make use of them in Postgres.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Could BM25 or TF-IDF be implemented on top of the existing Postgres functionality? Actually, yes. See this &lt;a href="https://www.alibabacloud.com/blog/keyword-analysis-with-postgresql-cosine-and-linear-correlation-algorithms-for-text-analysis_595793" rel="noopener noreferrer"&gt;blog post&lt;/a&gt; that uses &lt;code&gt;ts_stats&lt;/code&gt; and &lt;code&gt;ts_debug&lt;/code&gt; to compute TF-IDF. It's not very simple, but possible (as usual with Postgres).&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance and scalability considerations
&lt;/h2&gt;

&lt;p&gt;Let's start by noting that the two systems couldn't be more different:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL has a single master and multiple read replicas, Elasticsearch has horizontal scalability via sharding.&lt;/li&gt;
&lt;li&gt;Postgres is relational, supports joining tables, has ACID transactions, and offers constraints, while Elasticsearch is document oriented and offers consistency guarantees only per document.&lt;/li&gt;
&lt;li&gt;Postgres is row oriented, while Elasticsearch has an internal column store in the form of &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/doc-values.html" rel="noopener noreferrer"&gt;doc values&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Postgres is native C code, while Elasticsearch runs on the JVM.&lt;/li&gt;
&lt;li&gt;Postgres has a connection-oriented wire protocol; Elasticsearch has a REST-like DSL over HTTP.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of these impact performance and scalability, and it's no surprise then that the two tend to shine in different areas: PostgreSQL is commonly used as a primary data store, whereas Elasticsearch is usually utilized as a secondary store, particularly for search and analytics on time-series data such as logs. And yet, they do overlap on the use case of full-text search, which is the point of this blog post.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fip8qp0a7kzk6ogwzlfe3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fip8qp0a7kzk6ogwzlfe3.png" alt="Postgres and Elasticsearch use cases" width="628" height="452"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I was curious to know at roughly what amount of data Postgres slows down compared to Elasticsearch. On the movies dataset (34K rows) that we used in part 1, all queries were reasonably fast (&amp;lt;300 ms). So for the testing here, I chose a larger data set: a recipes dataset from Kaggle, containing &lt;a href="https://www.kaggle.com/datasets/wilmerarltstrmberg/recipe-dataset-over-2m?resource=download" rel="noopener noreferrer"&gt;2.3M recipes&lt;/a&gt;. The commands to load the CSV file in PostgreSQL can be found in this &lt;a href="https://gist.github.com/tsg/bc64fcea28fca844db11c65b9b1cb4ca" rel="noopener noreferrer"&gt;gist&lt;/a&gt;. For Elasticsearch, I've loaded the same CSV file using this &lt;a href="https://github.com/ovidiugiorgi/csv2opensearch" rel="noopener noreferrer"&gt;tool&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;After loading the data, I started by running searches similar to the ones used in part 1:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ts_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;search&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'darth vader'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;
   &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;recipes&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'darth vader'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;limit&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;title&lt;/span&gt;         &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="n"&gt;rank&lt;/span&gt;
&lt;span class="c1"&gt;----------------------+------------&lt;/span&gt;
 &lt;span class="n"&gt;Darth&lt;/span&gt; &lt;span class="n"&gt;Vader&lt;/span&gt; &lt;span class="n"&gt;Biscuits&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;09910322&lt;/span&gt;
 &lt;span class="n"&gt;Cloud&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt; &lt;span class="n"&gt;Pancakes&lt;/span&gt;     &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;09910322&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;468&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Elasticsearch I've used the following to run the search:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;POST&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/recipes/_search&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"query_string"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"darth AND vader"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I ran each query five times and recorded the best and worst times. Typically, the first query of a kind was the slowest because the following queries benefited from having the relevant pages already in memory. While this approach is rather unscientific, and you should conduct your own benchmarking on your data before drawing definitive conclusions, it should be sufficient for drawing some initial conclusions.&lt;/p&gt;

&lt;p&gt;Here are the results on a few queries:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;query&lt;/th&gt;
&lt;th&gt;Elasticsearch worst time (ms)&lt;/th&gt;
&lt;th&gt;Elasticsearch best time (ms)&lt;/th&gt;
&lt;th&gt;Postgres worst time (ms)&lt;/th&gt;
&lt;th&gt;Postgres best time (ms)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;darth vader&lt;/td&gt;
&lt;td&gt;52&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;chicken nuggets&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;313&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pancake&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;618&lt;/td&gt;
&lt;td&gt;157&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;curacao&lt;/td&gt;
&lt;td&gt;286&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;230&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mix&lt;/td&gt;
&lt;td&gt;67&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;25182&lt;/td&gt;
&lt;td&gt;8267&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;As you can see, Postgres performs well on some queries such as "darth vader" or "curacao," responding within milliseconds. However, on other queries like "pancake" or "mix," it performs significantly worse than Elasticsearch, with response times measured in seconds. It gets as bad as 25 seconds latency! What's going on here?&lt;/p&gt;

&lt;p&gt;The difference lies in how many rows match the query terms. Searching for “darth vader” in a recipes dataset matches 2 rows. But searching “mix” in a recipes dataset matches a million rows (literally, 1,038,914 to be precise). Since we order by rank, Postgres needs to call the &lt;code&gt;ts_rank&lt;/code&gt; function for each of the million rows. The Postgres docs even &lt;a href="https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING" rel="noopener noreferrer"&gt;warn about this&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Ranking can be expensive since it requires consulting the &lt;code&gt;tsvector&lt;/code&gt; of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Indeed, the issue is from ranking. If we're only interested in matching and we order by an (indexed) column, it is fast:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;recipes&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'mix'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;681&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But we're working on the assumption that ranking is necessary for a good search experience. One idea is to use what I call "sampling": before computing the ranks, take a sample of 10K rows that match. The assumption is that if your query matches so many documents, the ranking is likely to be ineffective anyway, so it's better to prioritize the response time.&lt;/p&gt;

&lt;p&gt;The SQL to do this looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;search_sample&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;recipes&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'mix'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ts_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;search&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'mix'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;search_sample&lt;/span&gt;
  &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;limit&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Re-running the tests with this sample approach gives us closer results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;query&lt;/th&gt;
&lt;th&gt;Elasticsearch worst time (ms)&lt;/th&gt;
&lt;th&gt;Elasticsearch best time (ms)&lt;/th&gt;
&lt;th&gt;Postgres worst time (ms)&lt;/th&gt;
&lt;th&gt;Postgres best time (ms)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;darth vader&lt;/td&gt;
&lt;td&gt;52&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;chicken nuggets&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;195&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pancake&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;145&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;curacao&lt;/td&gt;
&lt;td&gt;286&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;225&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mix&lt;/td&gt;
&lt;td&gt;67&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;400&lt;/td&gt;
&lt;td&gt;144&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Much better! Of course, we did sacrifice on the relevancy, which might or might not be ok in your case.&lt;/p&gt;

&lt;p&gt;Here are some conclusions and more considerations on the topic of performance and scalability:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;on search use cases over smaller datasets (&amp;lt;100K rows) both systems will perform well, but a Postgres-only solution will require less resources.&lt;/li&gt;
&lt;li&gt;on medium datasets (a few million rows), Elasticsearch is already faster, however, Postgres can perform within a 200ms latency budged if you use the sampling trick explained above.&lt;/li&gt;
&lt;li&gt;when the number of documents is really large, for example logs or other time series, Elasticsearch has the additional advantage of horizontal scalability.&lt;/li&gt;
&lt;li&gt;if you need a lot of aggregations or analytics (e.g. display a dashboard full of graphs) and the data set is large enough, Elasticsearch's columnar store will give it an advantage.&lt;/li&gt;
&lt;li&gt;giving Postgres an extra workload can affect the performance of your main instance. The solution is to move the searching to a replica, but then you lose some of the consistency guarantees.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Semantic and hybrid search
&lt;/h2&gt;

&lt;p&gt;Both part 1 and this blog post focused on keyword searching techniques. However, in the last few years, semantic/vector search has taken the world of search by storm, so I feel like I need to touch on this aspect as well in comparing the two.&lt;/p&gt;

&lt;p&gt;Semantic search leverages language models to generate embeddings for each document. Embeddings are arrays of numbers that represent the text on a number of dimensions. Pieces of text that have similar embeddings have a similar &lt;em&gt;meaning&lt;/em&gt;. In other words, semantic search can “search by meaning”, rather than “by keywords”. This is quite exciting now, because large language models (LLMs) give us very accurate understanding of meaning. It means you don't have to maintain list of synonyms or add different keywords to your documents to match how your users are searching.&lt;/p&gt;

&lt;p&gt;Postgres supports vector search via the &lt;a href="https://github.com/pgvector/pgvector" rel="noopener noreferrer"&gt;pgvector&lt;/a&gt; extension, while Elasticsearch has it built-in via the &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html" rel="noopener noreferrer"&gt;KNN search&lt;/a&gt;. You can find benchmarks on &lt;a href="https://ann-benchmarks.com/index.html" rel="noopener noreferrer"&gt;ann-benchmarks&lt;/a&gt; (look for &lt;code&gt;pgvector&lt;/code&gt; and &lt;code&gt;luceneknn&lt;/code&gt;) but keep in mind that both implementations are under active development and their performance is being improved.&lt;/p&gt;

&lt;p&gt;While exciting, it turns out that semantic search alone doesn't really work great on the typical search experiences that we have today - at least not on a majority of datasets. If you are curious, I recently wrote a &lt;a href="https://xata.io/blog/keyword-vs-semantic-search-chatgpt" rel="noopener noreferrer"&gt;comparison between keyword and semantic search&lt;/a&gt; for the particular use case of selecting the context for ChatGPT.&lt;/p&gt;

&lt;p&gt;For search use cases like the recipes one in this blog post, hybrid search might give better results: use a combination of keyword and semantic search to improve the ranking.&lt;/p&gt;

&lt;p&gt;Elastic has recently announced their “&lt;a href="https://www.elastic.co/blog/may-2023-launch-announcement" rel="noopener noreferrer"&gt;Elasticsearch Relevance Engine&lt;/a&gt;”, which includes hybrid search. In Postgres, given that it's all building blocks, you can combine the full-text search functionality and pgvector. I'm looking forward to diving deeper into this topic as well, but I'll leave that for a follow-up blog post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Choosing between a Postgres-only architecture and a Postgres + Elasticsearch architecture will depend on your use case and scale.&lt;/p&gt;

&lt;p&gt;For example, if you have a table or list in your application on which you support CRUD operations and you want to add full-text search functionality to it, Postgres will likely work well for you for quite some time.&lt;/p&gt;

&lt;p&gt;On the other hand, if you have a large data set search and search relevancy is critical to your application (for example, in e-commerce), using a dedicated search engine like Elasticsearch is going to perform better both in latency and relevancy.&lt;/p&gt;

&lt;p&gt;In many cases, it might make sense to start with the simpler Postgres-only approach, but be ready to pivot to the Postgres + Elasticsearch architecture when needed.&lt;/p&gt;

&lt;p&gt;If you read this far, you might want to give &lt;a href="https://xata.io/" rel="noopener noreferrer"&gt;Xata a try&lt;/a&gt;. It offers both Postgres and Elasticsearch in the same data platform, and can also handle the syncing between them with no extra effort. If you have any feedback on this blog post, or are interested in the follow-up blog posts, you can follow us on &lt;a href="https://twitter.com/xata" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt; or join us in &lt;a href="https://xata.io/discord" rel="noopener noreferrer"&gt;Discord&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>elasticsearch</category>
      <category>webdev</category>
      <category>postgressql</category>
    </item>
    <item>
      <title>Create an advanced search engine with PostgreSQL</title>
      <dc:creator>Tudor Golubenco</dc:creator>
      <pubDate>Mon, 24 Jul 2023 10:45:24 +0000</pubDate>
      <link>https://dev.to/xata/create-an-advanced-search-engine-with-postgresql-23e5</link>
      <guid>https://dev.to/xata/create-an-advanced-search-engine-with-postgresql-23e5</guid>
      <description>&lt;p&gt;This is part 1 of a blog mini-series, in which we explore the full-text search functionality in PostgreSQL and investigate how much of the typical search engine functionality we can replicate. In part 2, we'll do a comparison between PostgreSQL's full-text search and Elasticsearch.&lt;/p&gt;

&lt;p&gt;If you want to follow along and try out the sample queries (which we recommend; it's more fun that way), the code samples are executed against the &lt;a href="https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots" rel="noopener noreferrer"&gt;Wikipedia Movie Plots&lt;/a&gt; data set from Kaggle. To import it, download the CSV file, then create this table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;ReleaseYear&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Title&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Origin&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Director&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Casting&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Genre&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;WikiPage&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Plot&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And import the CSV file like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ReleaseYear&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Origin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Director&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Casting&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Genre&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;WikiPage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Plot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'wiki_movie_plots_deduped.csv'&lt;/span&gt; &lt;span class="k"&gt;DELIMITER&lt;/span&gt; &lt;span class="s1"&gt;','&lt;/span&gt; &lt;span class="n"&gt;CSV&lt;/span&gt; &lt;span class="n"&gt;HEADER&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The dataset contains 34,000 movie titles and is about 81 MB in CSV format.&lt;/p&gt;

&lt;h2&gt;
  
  
  PostgreSQL full-text search primitives
&lt;/h2&gt;

&lt;p&gt;The Postgres approach to full-text search offers building blocks that you can combine to create your own search engine. This is quite flexible but it also means it generally feels lower-level compared to search engines like Elasticsearch, Typesense, or Mellisearch, for which full-text search is the primary use case.&lt;/p&gt;

&lt;p&gt;The main building blocks, which we'll cover via examples, are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;tsvector&lt;/code&gt; and &lt;code&gt;tsquery&lt;/code&gt; data types&lt;/li&gt;
&lt;li&gt;The match operator &lt;code&gt;@@&lt;/code&gt; to check if a &lt;code&gt;tsquery&lt;/code&gt; matches a &lt;code&gt;tsvector&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Functions to rank each match (&lt;code&gt;ts_rank&lt;/code&gt;, &lt;code&gt;ts_rank_cd&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;The GIN index type, an inverted index to efficiently query &lt;code&gt;tsvector&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We'll start by looking at these building blocks and then we'll get into more advanced topics, covering relevancy boosters, typo-tolerance, and faceted search.&lt;/p&gt;

&lt;h3&gt;
  
  
  tsvector
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;tsvector&lt;/code&gt; data type stores a sorted list of &lt;em&gt;lexemes&lt;/em&gt;. A &lt;em&gt;lexeme&lt;/em&gt; is a string, just like a token, but it has been &lt;em&gt;normalized&lt;/em&gt; so that different forms of the same word are made. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as &lt;code&gt;s&lt;/code&gt; or &lt;code&gt;ing&lt;/code&gt; in English). Here is an example, using the &lt;code&gt;to_tsvector&lt;/code&gt; function to parse an English phrase into a &lt;code&gt;tsvector&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;unnest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'I&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt;m going to make him an offer he can&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt;t refuse. Refusing is not an option.'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

 &lt;span class="n"&gt;lexeme&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;positions&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;
&lt;span class="c1"&gt;--------+-----------+---------&lt;/span&gt;
 &lt;span class="k"&gt;go&lt;/span&gt;     &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;m&lt;/span&gt;      &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;make&lt;/span&gt;   &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;offer&lt;/span&gt;  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="k"&gt;option&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;      &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;refus&lt;/span&gt;  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;   &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, stop words like “I”, “to” or “an” are removed, because they are too common to be useful for search. The words are normalized and reduced to their root (e.g. “refuse” and “Refusing” are both transformed into “refus”). The punctuation signs are ignored. For each word, the &lt;strong&gt;positions&lt;/strong&gt; in the original phrase are recorded (e.g. “refus” is the 12th and the 13th word in the text) and the &lt;strong&gt;weights&lt;/strong&gt; (which are useful for ranking and we'll discuss later).&lt;/p&gt;

&lt;p&gt;In the example above, the transformation rules from words to &lt;em&gt;lexemes&lt;/em&gt; are based on the &lt;code&gt;english&lt;/code&gt; search configuration. Running the same query with the &lt;code&gt;simple&lt;/code&gt; search configuration results in a &lt;code&gt;tsvector&lt;/code&gt; that includes all the words as they were found in the text:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;unnest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'simple'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'I&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt;m going to make him an offer he can&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt;t refuse. Refusing is not an option.'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

  &lt;span class="n"&gt;lexeme&lt;/span&gt;  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;positions&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;
&lt;span class="c1"&gt;----------+-----------+---------&lt;/span&gt;
 &lt;span class="n"&gt;an&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;can&lt;/span&gt;      &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;      &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;going&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;he&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;him&lt;/span&gt;      &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;i&lt;/span&gt;        &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="k"&gt;is&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;      &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;m&lt;/span&gt;        &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;make&lt;/span&gt;     &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="k"&gt;not&lt;/span&gt;      &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;      &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;offer&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="k"&gt;option&lt;/span&gt;   &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;      &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;refuse&lt;/span&gt;   &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;      &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;refusing&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;      &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;t&lt;/span&gt;        &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;      &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="k"&gt;to&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, “refuse” and “refusing” now result in different lexemes. The &lt;code&gt;simple&lt;/code&gt; configuration is particularly useful when you have columns that contain labels or tags.&lt;/p&gt;

&lt;p&gt;PostgreSQL has built-in configurations for a pretty good set of languages. You can see the list by running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;cfgname&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_ts_config&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notably, however, there is no configuration for CJK (Chinese-Japanese-Korean), which is worth keeping in mind if you need to create a search query in those languages. While the &lt;code&gt;simple&lt;/code&gt; configuration should work in practice quite well for unsupported languages, I'm not sure if that is enough for CJK.&lt;/p&gt;

&lt;h3&gt;
  
  
  tsquery
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;tsquery&lt;/code&gt; data type is used to represent a normalized query. A &lt;code&gt;tsquery&lt;/code&gt; contains search terms, which must be already-normalized lexemes, and may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators. There are functions like &lt;code&gt;to_tsquery&lt;/code&gt;, &lt;code&gt;plainto_tsquery&lt;/code&gt;, and &lt;code&gt;websearch_to_tsquery&lt;/code&gt; that are helpful in converting user-written text into a proper &lt;code&gt;tsquery&lt;/code&gt;, primarily by normalizing words appearing in the text.&lt;/p&gt;

&lt;p&gt;To get a feeling of &lt;code&gt;tsquery&lt;/code&gt;, let's see a few examples using &lt;code&gt;websearch_to_tsquery&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'the darth vader'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
 &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;
&lt;span class="c1"&gt;----------------------&lt;/span&gt;
&lt;span class="s1"&gt;'darth'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="s1"&gt;'vader'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is a logical AND, meaning that the document needs to contain both "darth" and "vader" in order to match. You can do logical OR as well:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'darth OR vader'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
 &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;
&lt;span class="c1"&gt;----------------------&lt;/span&gt;
 &lt;span class="s1"&gt;'darth'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="s1"&gt;'vader'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And you can exclude words:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'darth vader -wars'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
   &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;
&lt;span class="c1"&gt;---------------------------&lt;/span&gt;
 &lt;span class="s1"&gt;'darth'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="s1"&gt;'vader'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="s1"&gt;'war'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also, you can represent phrase searches:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'"the darth vader son"'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
     &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;
&lt;span class="c1"&gt;------------------------------&lt;/span&gt;
 &lt;span class="s1"&gt;'darth'&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'vader'&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'son'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means: “darth”, followed by “vader”, followed by “son”.&lt;/p&gt;

&lt;p&gt;Note, however, that the “the” word is ignored, because it's a stop word as per the &lt;code&gt;english&lt;/code&gt; search configuration. This can be an issue on phrases like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'"do or do not, there is no try"'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
 &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;
&lt;span class="c1"&gt;----------------------&lt;/span&gt;
 &lt;span class="s1"&gt;'tri'&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Oops, almost the entire phrase was lost. Using the &lt;code&gt;simple&lt;/code&gt; config gives the expected result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'simple'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'"do or do not, there is no try"'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                           &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;
&lt;span class="c1"&gt;--------------------------------------------------------------------------&lt;/span&gt;
 &lt;span class="s1"&gt;'do'&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'or'&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'do'&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'not'&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'there'&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'is'&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'no'&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'try'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can check whether a &lt;code&gt;tsquery&lt;/code&gt; matches a &lt;code&gt;tsvector&lt;/code&gt; by using the match operator &lt;code&gt;@@&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'darth vader'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt;
    &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;'Darth Vader is my father.'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="k"&gt;column&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;
&lt;span class="c1"&gt;----------&lt;/span&gt;
 &lt;span class="n"&gt;t&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While the following example doesn't match:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'darth vader -father'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt;
    &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;'Darth Vader is my father.'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="k"&gt;column&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;
&lt;span class="c1"&gt;----------&lt;/span&gt;
 &lt;span class="n"&gt;f&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  GIN
&lt;/h3&gt;

&lt;p&gt;Now that we've seen &lt;code&gt;tsvector&lt;/code&gt; and &lt;code&gt;tsquery&lt;/code&gt; at work, let's look at another key building block: the GIN index type is what makes it fast. GIN stands for &lt;em&gt;Generalized Inverted Index.&lt;/em&gt; GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items. This means that GIN can be used for more than just text search, notably for JSON querying.&lt;/p&gt;

&lt;p&gt;You can create a GIN index on a set of columns, or you can first create a column of type &lt;code&gt;tsvector&lt;/code&gt;, to include all the searchable columns. Something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt; &lt;span class="n"&gt;tsvector&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
   &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Plot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
   &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'simple'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Director&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
     &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'simple'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Genre&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
   &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'simple'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Origin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
   &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'simple'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Casting&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;STORED&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And then create the actual index:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_search&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;GIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;search&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now perform a simple test search like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'darth vader'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

                        &lt;span class="n"&gt;title&lt;/span&gt;
&lt;span class="c1"&gt;--------------------------------------------------&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;IV&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="k"&gt;New&lt;/span&gt; &lt;span class="n"&gt;Hope&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aka&lt;/span&gt; &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;Return&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Jedi&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;III&lt;/span&gt; &lt;span class="err"&gt;–&lt;/span&gt; &lt;span class="n"&gt;Revenge&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Sith&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To see the effects of the index, you can compare the timings of the above query with and without the index. The GIN index takes it from over 200 ms to about 4 ms on my computer.&lt;/p&gt;

&lt;h3&gt;
  
  
  ts_rank
&lt;/h3&gt;

&lt;p&gt;So far, we've seen how &lt;code&gt;ts_vector&lt;/code&gt; and &lt;code&gt;ts_query&lt;/code&gt; can match search queries. However, for a good search experience, it is important to show the best results first - meaning that the results need to be sorted by &lt;em&gt;relevancy&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Taking it directly from the &lt;a href="https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING" rel="noopener noreferrer"&gt;docs&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;PostgreSQL provides two predefined ranking functions, which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur. However, the concept of relevancy is vague and very application-specific. Different applications might require additional information for ranking, e.g., document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The two ranking functions mentioned are &lt;code&gt;ts_rank&lt;/code&gt; and &lt;code&gt;ts_rank_cd&lt;/code&gt;. The difference between them is that while they both take into account the frequency of the term, &lt;code&gt;ts_rank_cd&lt;/code&gt; also takes into account the proximity of matching lexemes to each other.&lt;/p&gt;

&lt;p&gt;To use them in a query, you can do something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ts_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;search&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'darth vader'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'darth vader'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
  &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

                      &lt;span class="n"&gt;title&lt;/span&gt;                       &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="n"&gt;rank&lt;/span&gt;
&lt;span class="c1"&gt;--------------------------------------------------+-------------&lt;/span&gt;
 &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Empire&lt;/span&gt; &lt;span class="n"&gt;Strikes&lt;/span&gt; &lt;span class="n"&gt;Back&lt;/span&gt;                          &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;26263964&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;IV&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="k"&gt;New&lt;/span&gt; &lt;span class="n"&gt;Hope&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aka&lt;/span&gt; &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;18902963&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;III&lt;/span&gt; &lt;span class="err"&gt;–&lt;/span&gt; &lt;span class="n"&gt;Revenge&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Sith&lt;/span&gt;     &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;10292397&lt;/span&gt;
 &lt;span class="n"&gt;Rogue&lt;/span&gt; &lt;span class="n"&gt;One&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt; &lt;span class="n"&gt;Story&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;film&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;              &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;10049681&lt;/span&gt;
 &lt;span class="k"&gt;Return&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Jedi&lt;/span&gt;                               &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;09910346&lt;/span&gt;
 &lt;span class="n"&gt;American&lt;/span&gt; &lt;span class="n"&gt;Honey&lt;/span&gt;                                   &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;09910322&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One thing to note about &lt;code&gt;ts_rank&lt;/code&gt; is that it needs to access the &lt;code&gt;search&lt;/code&gt; column for each result. This means that if the &lt;code&gt;WHERE&lt;/code&gt; condition matches a lot of rows, PostgreSQL needs to visit them all in order to do the ranking, and that can be slow. To exemplify, the above query returns in 5-7 ms on my computer. If I modify the query to do search for &lt;code&gt;darth OR vader&lt;/code&gt;, it returns in about 80 ms, because there are now over 1000 matching result that need ranking and sorting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Relevancy tuning
&lt;/h2&gt;

&lt;p&gt;While relevancy based on word frequency is a good default for the search sorting, quite often the data contains important indicators that are more relevant than simply the frequency.&lt;/p&gt;

&lt;p&gt;Here are some examples for a movies dataset:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Matches in the title should be given higher importance than matches in the description or plot.&lt;/li&gt;
&lt;li&gt;More popular movies can be promoted based on ratings and/or the number of votes they receive.&lt;/li&gt;
&lt;li&gt;Certain categories can be boosted more, considering user preferences. For instance, if a particular user enjoys comedies, those movies can be given a higher priority.&lt;/li&gt;
&lt;li&gt;When ranking search results, newer titles can be considered more relevant than very old titles.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why dedicated search engines typically offer ways to use different columns or fields to influence the ranking. Here are example tuning guides from &lt;a href="https://www.elastic.co/guide/en/app-search/current/relevance-tuning-guide.html" rel="noopener noreferrer"&gt;Elastic&lt;/a&gt;, &lt;a href="https://typesense.org/docs/guide/ranking-and-relevance.html" rel="noopener noreferrer"&gt;Typesense&lt;/a&gt;, and &lt;a href="https://www.meilisearch.com/docs/learn/core_concepts/relevancy" rel="noopener noreferrer"&gt;Meilisearch&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you want a visual demo of the impact of relevancy tuning, here is a quick 4 minutes video about it:&lt;/p&gt;



&lt;h3&gt;
  
  
  Numeric, date, and exact value boosters
&lt;/h3&gt;

&lt;p&gt;While Postgres doesn't have direct support for boosting based on other columns, the rank is ultimately just a sort expression, so you can add your own signals to it.&lt;/p&gt;

&lt;p&gt;For example, if you want to add a boost for the number of votes, you can do something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ts_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;search&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'jedi'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="c1"&gt;-- numeric booster example&lt;/span&gt;
    &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NumberOfVotes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt;
 &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'jedi'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The logarithm is there to smoothen the impact, and the 0.01 factor brings the booster to a comparable scale with the ranking score.&lt;/p&gt;

&lt;p&gt;You can also design more complex boosters, for example, boost by the rating, but only if the ranking has a certain number of votes. To do this, you can create a function like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;numericBooster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rating&lt;/span&gt; &lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;votes&lt;/span&gt; &lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;voteThreshold&lt;/span&gt; &lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;returns&lt;/span&gt; &lt;span class="nb"&gt;numeric&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="err"&gt;$$&lt;/span&gt;
        &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="n"&gt;votes&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;voteThreshold&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;rating&lt;/span&gt; &lt;span class="k"&gt;end&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="err"&gt;$$&lt;/span&gt; &lt;span class="k"&gt;language&lt;/span&gt; &lt;span class="k"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And use it like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ts_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;search&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'jedi'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="c1"&gt;-- numeric booster example&lt;/span&gt;
    &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;numericBooster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Rating&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NumberOfVotes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;005&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt;
 &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'jedi'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's take another example. Say we want to boost the ranking of comedies. You can create a &lt;code&gt;valueBooster&lt;/code&gt; function that looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;valueBooster&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;factor&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;returns&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="err"&gt;$$&lt;/span&gt;
        &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="n"&gt;factor&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;end&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="err"&gt;$$&lt;/span&gt; &lt;span class="k"&gt;language&lt;/span&gt; &lt;span class="k"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The function returns a factor if the column matches a particular value and 0 instead. Use it in a query like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;genre&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="n"&gt;ts_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;search&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'jedi'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
   &lt;span class="c1"&gt;-- value booster example&lt;/span&gt;
   &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;valueBooster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Genre&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'comedy'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;05&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt;
   &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'jedi'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                                                                                                 &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                      &lt;span class="n"&gt;title&lt;/span&gt;                       &lt;span class="o"&gt;|&lt;/span&gt;               &lt;span class="n"&gt;genre&lt;/span&gt;                &lt;span class="o"&gt;|&lt;/span&gt;        &lt;span class="n"&gt;rank&lt;/span&gt;
&lt;span class="c1"&gt;--------------------------------------------------+------------------------------------+---------------------&lt;/span&gt;
 &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Men&lt;/span&gt; &lt;span class="n"&gt;Who&lt;/span&gt; &lt;span class="n"&gt;Stare&lt;/span&gt; &lt;span class="k"&gt;at&lt;/span&gt; &lt;span class="n"&gt;Goats&lt;/span&gt;                       &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;comedy&lt;/span&gt;                             &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1107927106320858&lt;/span&gt;
 &lt;span class="n"&gt;Clerks&lt;/span&gt;                                           &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;comedy&lt;/span&gt;                             &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1107927106320858&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Clone&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;                        &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;animation&lt;/span&gt;                          &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;09513916820287704&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="err"&gt;–&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Phantom&lt;/span&gt; &lt;span class="n"&gt;Menace&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;     &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;sci&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;fi&lt;/span&gt;                             &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;09471701085567474&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="err"&gt;–&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Phantom&lt;/span&gt; &lt;span class="n"&gt;Menace&lt;/span&gt;        &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;space&lt;/span&gt; &lt;span class="n"&gt;opera&lt;/span&gt;                        &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;09471701085567474&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;II&lt;/span&gt; &lt;span class="err"&gt;–&lt;/span&gt; &lt;span class="n"&gt;Attack&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Clones&lt;/span&gt;     &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;science&lt;/span&gt; &lt;span class="n"&gt;fiction&lt;/span&gt;                    &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;09285612404346466&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;III&lt;/span&gt; &lt;span class="err"&gt;–&lt;/span&gt; &lt;span class="n"&gt;Revenge&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Sith&lt;/span&gt;     &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;science&lt;/span&gt; &lt;span class="n"&gt;fiction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;            &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;09285612404346466&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="k"&gt;Last&lt;/span&gt; &lt;span class="n"&gt;Jedi&lt;/span&gt;                         &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adventure&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fantasy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sci&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;fi&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0889768898487091&lt;/span&gt;
 &lt;span class="k"&gt;Return&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Jedi&lt;/span&gt;                               &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;science&lt;/span&gt; &lt;span class="n"&gt;fiction&lt;/span&gt;                    &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;07599088549613953&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;IV&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="k"&gt;New&lt;/span&gt; &lt;span class="n"&gt;Hope&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aka&lt;/span&gt; &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;science&lt;/span&gt; &lt;span class="n"&gt;fiction&lt;/span&gt;                    &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;07599088549613953&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Column weights
&lt;/h3&gt;

&lt;p&gt;Remember when we talked about the &lt;code&gt;tsvector&lt;/code&gt; lexemes and that they can have weights attached? Postgres supports 4 weights, named A, B, C, and D. A is the biggest weight while D is the lowest and the default. You can control the weights via the &lt;code&gt;setweight&lt;/code&gt; function which you would typically call when building the &lt;code&gt;tsvector&lt;/code&gt; column:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt; &lt;span class="n"&gt;tsvector&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;setweight&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'A'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
   &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Plot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
   &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'simple'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Director&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
   &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'simple'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Genre&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
   &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'simple'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Origin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
   &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'simple'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Casting&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;STORED&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's see the effects of this. Without &lt;code&gt;setweight&lt;/code&gt;, a search for &lt;code&gt;jedi&lt;/code&gt; returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ts_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;search&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'jedi'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;
   &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt;
   &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'jedi'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                      &lt;span class="n"&gt;title&lt;/span&gt;                       &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="n"&gt;rank&lt;/span&gt;
&lt;span class="c1"&gt;--------------------------------------------------+-------------&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Clone&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;                        &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;09513917&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="err"&gt;–&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Phantom&lt;/span&gt; &lt;span class="n"&gt;Menace&lt;/span&gt;        &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;09471701&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="err"&gt;–&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Phantom&lt;/span&gt; &lt;span class="n"&gt;Menace&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;     &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;09471701&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;III&lt;/span&gt; &lt;span class="err"&gt;–&lt;/span&gt; &lt;span class="n"&gt;Revenge&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Sith&lt;/span&gt;     &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;092856124&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;II&lt;/span&gt; &lt;span class="err"&gt;–&lt;/span&gt; &lt;span class="n"&gt;Attack&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Clones&lt;/span&gt;     &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;092856124&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="k"&gt;Last&lt;/span&gt; &lt;span class="n"&gt;Jedi&lt;/span&gt;                         &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;08897689&lt;/span&gt;
 &lt;span class="k"&gt;Return&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Jedi&lt;/span&gt;                               &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;075990885&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;IV&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="k"&gt;New&lt;/span&gt; &lt;span class="n"&gt;Hope&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aka&lt;/span&gt; &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;075990885&lt;/span&gt;
 &lt;span class="n"&gt;Clerks&lt;/span&gt;                                           &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;06079271&lt;/span&gt;
 &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Empire&lt;/span&gt; &lt;span class="n"&gt;Strikes&lt;/span&gt; &lt;span class="n"&gt;Back&lt;/span&gt;                          &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;06079271&lt;/span&gt;
 &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Men&lt;/span&gt; &lt;span class="n"&gt;Who&lt;/span&gt; &lt;span class="n"&gt;Stare&lt;/span&gt; &lt;span class="k"&gt;at&lt;/span&gt; &lt;span class="n"&gt;Goats&lt;/span&gt;                       &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;06079271&lt;/span&gt;
 &lt;span class="n"&gt;How&lt;/span&gt; &lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="n"&gt;Deal&lt;/span&gt;                                      &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;06079271&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And with the &lt;code&gt;setweight&lt;/code&gt; on the title column:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ts_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;search&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'jedi'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;
   &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt;
   &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'jedi'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                      &lt;span class="n"&gt;title&lt;/span&gt;                       &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="n"&gt;rank&lt;/span&gt;
&lt;span class="c1"&gt;--------------------------------------------------+-------------&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="k"&gt;Last&lt;/span&gt; &lt;span class="n"&gt;Jedi&lt;/span&gt;                         &lt;span class="o"&gt;|&lt;/span&gt;   &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;6361112&lt;/span&gt;
 &lt;span class="k"&gt;Return&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Jedi&lt;/span&gt;                               &lt;span class="o"&gt;|&lt;/span&gt;   &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;6231253&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Clone&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;                        &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;09513917&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="err"&gt;–&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Phantom&lt;/span&gt; &lt;span class="n"&gt;Menace&lt;/span&gt;        &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;09471701&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="err"&gt;–&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Phantom&lt;/span&gt; &lt;span class="n"&gt;Menace&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;     &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;09471701&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;III&lt;/span&gt; &lt;span class="err"&gt;–&lt;/span&gt; &lt;span class="n"&gt;Revenge&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Sith&lt;/span&gt;     &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;092856124&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;II&lt;/span&gt; &lt;span class="err"&gt;–&lt;/span&gt; &lt;span class="n"&gt;Attack&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Clones&lt;/span&gt;     &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;092856124&lt;/span&gt;
 &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt; &lt;span class="n"&gt;Episode&lt;/span&gt; &lt;span class="n"&gt;IV&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="k"&gt;New&lt;/span&gt; &lt;span class="n"&gt;Hope&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aka&lt;/span&gt; &lt;span class="n"&gt;Star&lt;/span&gt; &lt;span class="n"&gt;Wars&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;075990885&lt;/span&gt;
 &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Empire&lt;/span&gt; &lt;span class="n"&gt;Strikes&lt;/span&gt; &lt;span class="n"&gt;Back&lt;/span&gt;                          &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;06079271&lt;/span&gt;
 &lt;span class="n"&gt;Clerks&lt;/span&gt;                                           &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;06079271&lt;/span&gt;
 &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Men&lt;/span&gt; &lt;span class="n"&gt;Who&lt;/span&gt; &lt;span class="n"&gt;Stare&lt;/span&gt; &lt;span class="k"&gt;at&lt;/span&gt; &lt;span class="n"&gt;Goats&lt;/span&gt;                       &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;06079271&lt;/span&gt;
 &lt;span class="n"&gt;How&lt;/span&gt; &lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="n"&gt;Deal&lt;/span&gt;                                      &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;06079271&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note how the movie titles with “jedi” in their name have jumped to the top of the list, and their rank has increased.&lt;/p&gt;

&lt;p&gt;It's worth pointing out that having only four weight “classes” is somewhat limiting, and that they need to be applied when computing the &lt;code&gt;tsvector&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Typo-tolerance / fuzzy search
&lt;/h2&gt;

&lt;p&gt;PostgreSQL doesn't support fuzzy search or typo-tolerance directly, when using &lt;code&gt;tsvector&lt;/code&gt; and &lt;code&gt;tsquery&lt;/code&gt;. However, working on the assumptions that the typo is in the query part, we can implement the following idea:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;index all &lt;em&gt;lexemes&lt;/em&gt; from the content in a separate table&lt;/li&gt;
&lt;li&gt;for each word in the query, use similarity or Levenshtein distance to search in this table&lt;/li&gt;
&lt;li&gt;modify the query to include any words that are found&lt;/li&gt;
&lt;li&gt;perform the search&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is how it works. First, use &lt;code&gt;ts_stats&lt;/code&gt; to get all words in a materialized view:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;MATERIALIZED&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;unique_lexeme&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
   &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ts_stat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'SELECT search FROM movies'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, for each word in the query, check if it is in the &lt;code&gt;unique_lexeme&lt;/code&gt; view. If it's not, do a fuzzy-search in that view to find possible misspellings of it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;unique_lexeme&lt;/span&gt;
   &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;levenshtein_less_equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'pregant'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

   &lt;span class="n"&gt;word&lt;/span&gt;
&lt;span class="c1"&gt;----------&lt;/span&gt;
 &lt;span class="n"&gt;premant&lt;/span&gt;
 &lt;span class="n"&gt;pregrant&lt;/span&gt;
 &lt;span class="n"&gt;pregnant&lt;/span&gt;
 &lt;span class="n"&gt;paegant&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the above we use the &lt;a href="https://en.wikipedia.org/wiki/Levenshtein_distance" rel="noopener noreferrer"&gt;Levenshtein distance&lt;/a&gt; because that's what search engines like Elasticsearch use for fuzzy search.&lt;/p&gt;

&lt;p&gt;Once you have the candidate list of words, you need to adjust the query include them all.&lt;/p&gt;

&lt;h2&gt;
  
  
  Faceted search
&lt;/h2&gt;

&lt;p&gt;Faceted search is popular especially on e-commerce sites because it helps customers to iteratively narrow their search. Here is an example from amazon.com:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa3lnldd30k8gspna305o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa3lnldd30k8gspna305o.png" alt="Faceted search on Amazon" width="440" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The above can implemented by manually defining categories and then adding them as &lt;code&gt;WHERE&lt;/code&gt; conditions to the search. Another approach is to create the categories algorithmically based on the existing data. For example, you can use the following to create a “Decade” facet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;ReleaseYear&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="n"&gt;decade&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'star wars'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;decade&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;decade&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt;
&lt;span class="c1"&gt;--------+-----&lt;/span&gt;
   &lt;span class="mi"&gt;2000&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;39&lt;/span&gt;
   &lt;span class="mi"&gt;2010&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;31&lt;/span&gt;
   &lt;span class="mi"&gt;1990&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;29&lt;/span&gt;
   &lt;span class="mi"&gt;1950&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;28&lt;/span&gt;
   &lt;span class="mi"&gt;1940&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;26&lt;/span&gt;
   &lt;span class="mi"&gt;1980&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;22&lt;/span&gt;
   &lt;span class="mi"&gt;1930&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;13&lt;/span&gt;
   &lt;span class="mi"&gt;1960&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;11&lt;/span&gt;
   &lt;span class="mi"&gt;1970&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;   &lt;span class="mi"&gt;7&lt;/span&gt;
   &lt;span class="mi"&gt;1910&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;   &lt;span class="mi"&gt;3&lt;/span&gt;
   &lt;span class="mi"&gt;1920&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;   &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This also provides counts of matches for each decade, which you can display in brackets.&lt;/p&gt;

&lt;p&gt;If you want to get multiple facets in a single query, you can combine them, for example, by using CTEs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;releaseYearFacets&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="s1"&gt;'Decade'&lt;/span&gt; &lt;span class="n"&gt;facet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ReleaseYear&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'star wars'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;genreFacets&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="s1"&gt;'Genre'&lt;/span&gt; &lt;span class="n"&gt;facet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Genre&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;websearch_to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'star wars'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;releaseYearFacets&lt;/span&gt; &lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;genreFacets&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

 &lt;span class="n"&gt;facet&lt;/span&gt;  &lt;span class="o"&gt;|&lt;/span&gt;   &lt;span class="n"&gt;val&lt;/span&gt;   &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt;
&lt;span class="c1"&gt;--------+---------+-----&lt;/span&gt;
 &lt;span class="n"&gt;Decade&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;1910&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;   &lt;span class="mi"&gt;3&lt;/span&gt;
 &lt;span class="n"&gt;Decade&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;1920&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;   &lt;span class="mi"&gt;3&lt;/span&gt;
 &lt;span class="n"&gt;Decade&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;1930&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;13&lt;/span&gt;
 &lt;span class="n"&gt;Decade&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;1940&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;26&lt;/span&gt;
 &lt;span class="n"&gt;Decade&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;1950&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;28&lt;/span&gt;
 &lt;span class="n"&gt;Decade&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;1960&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;11&lt;/span&gt;
 &lt;span class="n"&gt;Decade&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;1970&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;   &lt;span class="mi"&gt;7&lt;/span&gt;
 &lt;span class="n"&gt;Decade&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;1980&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;22&lt;/span&gt;
 &lt;span class="n"&gt;Decade&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;1990&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;29&lt;/span&gt;
 &lt;span class="n"&gt;Decade&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;39&lt;/span&gt;
 &lt;span class="n"&gt;Decade&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;2010&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;31&lt;/span&gt;
 &lt;span class="n"&gt;Genre&lt;/span&gt;  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;comedy&lt;/span&gt;  &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;21&lt;/span&gt;
 &lt;span class="n"&gt;Genre&lt;/span&gt;  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;drama&lt;/span&gt;   &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;35&lt;/span&gt;
 &lt;span class="n"&gt;Genre&lt;/span&gt;  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;musical&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;   &lt;span class="mi"&gt;9&lt;/span&gt;
 &lt;span class="n"&gt;Genre&lt;/span&gt;  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;unknown&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;13&lt;/span&gt;
 &lt;span class="n"&gt;Genre&lt;/span&gt;  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;war&lt;/span&gt;     &lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="mi"&gt;15&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above should work quite well on small to medium data sets, however it can become slow on very large data sets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We've seen the PostgreSQL full-text search primitives, and how we can combine them to create a pretty advanced full-text search engine, which also happens to support things like joins and ACID transactions. In other words, it has functionality that the other search engines typically don't have.&lt;/p&gt;

&lt;p&gt;There are more advanced search topics that would be worth covering in detail:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;suggesters / auto-complete&lt;/li&gt;
&lt;li&gt;exact phrase matching&lt;/li&gt;
&lt;li&gt;hybrid search (semantic + keyword) by combining with pg-vector&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these would be worth their own blog post (coming!), but by now you should have an intuitive feeling about them: they are quite possible using PostgreSQL, but they require you to do the work of combining the primitives and in some cases the performance might suffer on very large datasets.&lt;/p&gt;

&lt;p&gt;In part 2, we'll make a detailed comparison with Elasticsearch, looking to answer the question on when is it worth it to implement search into PostgreSQL versus adding Elasticsearch to your infrastructure and syncing the data. If you want to be notified when this gets published, you can follow us on &lt;a href="https://twitter.com/xata" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt; or join our &lt;a href="https://xata.io/discord" rel="noopener noreferrer"&gt;Discord&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>postgressql</category>
      <category>search</category>
      <category>database</category>
    </item>
    <item>
      <title>Postgres schema changes are still a PITA</title>
      <dc:creator>Tudor Golubenco</dc:creator>
      <pubDate>Sat, 08 Jul 2023 21:23:42 +0000</pubDate>
      <link>https://dev.to/xata/postgres-schema-changes-are-still-a-pita-46bf</link>
      <guid>https://dev.to/xata/postgres-schema-changes-are-still-a-pita-46bf</guid>
      <description>&lt;p&gt;We software engineers don’t agree on much, but we agree on this one: database schema changes are a pain in the a**. &lt;/p&gt;

&lt;p&gt;Part of my job at Xata is to talk with as many developers as possible - from fresh bootcamp graduates, to indie developers, to principal engineers working in large teams. We talk about databases in general, what issues they face, the tools they use, and so on.&lt;/p&gt;

&lt;p&gt;From the people that we’ve talked with, almost everyone said that schema changes and schema management are one of their &lt;em&gt;least favorite&lt;/em&gt; parts of working with databases. While this sentiment is pretty universal, the reasons that they bring up are not always the same. Small companies or developers working on hobby projects, for example, have to make lots of changes as their applications grow and they discover new requirements. They’d like their schema changes workflow to be as simple and straight-forward as pull requests on GitHub.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6q8bj0eq8u6vqiurhdrx.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6q8bj0eq8u6vqiurhdrx.gif" alt="Simpsons he's miserable" width="480" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For larger companies, with more data and high-traffic tables, schema changes may happen less often, but they still need to worry about things like downtime caused by locking. They require long internal guides on performing schema changes correctly (e.g. &lt;a href="https://docs.gitlab.com/ee/development/migration_style_guide.html" rel="noopener noreferrer"&gt;GitLab&lt;/a&gt;, &lt;a href="https://medium.com/paypal-tech/postgresql-at-scale-database-schema-changes-without-downtime-20d3749ed680" rel="noopener noreferrer"&gt;PayPal&lt;/a&gt;), custom tools (e.g. &lt;a href="https://github.com/facebookincubator/OnlineSchemaChange" rel="noopener noreferrer"&gt;Meta&lt;/a&gt;,  &lt;a href="https://developer.squareup.com/blog/shift-safe-and-easy-database-migrations/" rel="noopener noreferrer"&gt;Square&lt;/a&gt;), and they often document incidents or near-incidents caused by schema migrations (e.g. &lt;a href="https://github.blog/2021-12-01-github-availability-report-november-2021/" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;, &lt;a href="https://medium.com/doctolib/adding-a-not-null-constraint-on-pg-faster-with-minimal-locking-38b2c00c4d1c" rel="noopener noreferrer"&gt;Doctolib&lt;/a&gt;, &lt;a href="https://gocardless.com/blog/zero-downtime-postgres-migrations-the-hard-parts/" rel="noopener noreferrer"&gt;GoCardless&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Across the board, developers complain about schema changes affecting their &lt;strong&gt;velocity&lt;/strong&gt;: they require more communication, more steps, come with backwards compatibility concerns. As a result, some developers never modify and never remove columns, they only add new ones. This creates “schema-debt”- which creates bugs, confuses new team members, and keeps compatibility code around way past its use-by date.&lt;/p&gt;

&lt;h1&gt;
  
  
  Issues with schema changes
&lt;/h1&gt;

&lt;p&gt;The following are specific to PostgreSQL, because this is the database system we use at Xata, but some of them apply to other database systems as well.&lt;/p&gt;

&lt;h3&gt;
  
  
  Locking gotchas
&lt;/h3&gt;

&lt;p&gt;PostgreSQL has many &lt;a href="https://www.postgresql.org/docs/current/explicit-locking.html" rel="noopener noreferrer"&gt;different types of locks&lt;/a&gt; and most &lt;code&gt;ALTER TABLE&lt;/code&gt; statements (but &lt;a href="https://github.com/postgres/postgres/blob/REL_15_STABLE/src/backend/commands/tablecmds.c#L4161" rel="noopener noreferrer"&gt;not all&lt;/a&gt;) take the table &lt;code&gt;ACCESS EXCLUSIVE&lt;/code&gt; lock, which conflicts with all other lock types. This means that the table is essentially inaccessible and it’s important to keep this lock for the minimum possible duration. &lt;/p&gt;

&lt;p&gt;Even when the lock is taken for a short time, it can still make the table seem unavailable. During normal operations, reads and writes run on your tables concurrently. However, when an &lt;code&gt;ALTER TABLE&lt;/code&gt; query requiring &lt;code&gt;ACCESS EXCLUSIVE&lt;/code&gt; runs, Postgres needs to create an opening where no other queries or transactions are running. To achieve this, all queries issued after the &lt;code&gt;ALTER TABLE&lt;/code&gt; are put in a queue to run after the &lt;code&gt;ALTER TABLE&lt;/code&gt; completes. Here's the issue: if your table has a running query or a running transaction, your schema change cannot begin. While Postgres is waiting to run your schema change, all of the queries that you would've assumed would run instantly are now being queued up. This gives your table the appearance of being unavailable.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.depesz.com/2019/09/26/how-to-run-short-alter-table-without-long-locking-concurrent-queries/" rel="noopener noreferrer"&gt;trick&lt;/a&gt; to avoid the above is to explicitly take the lock before running the &lt;code&gt;ALTER TABLE&lt;/code&gt; and set a &lt;code&gt;lock_timeout&lt;/code&gt; to avoid queueing queries for a long period of time. If the timeout hits, you keep retrying until the lock succeeds, and only then perform the &lt;code&gt;ALTER TABLE&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;There are other gotchas as well, like using the magic keyword &lt;code&gt;CONCURRENTLY&lt;/code&gt; when adding indexes, but that doesn’t work inside transactions. When adding a constraint like &lt;code&gt;NOT NULL&lt;/code&gt;, Postgres needs to first check that there are no NULLs in the table, and it has to do that while holding the lock. You can &lt;a href="https://medium.com/doctolib/adding-a-not-null-constraint-on-pg-faster-with-minimal-locking-38b2c00c4d1c" rel="noopener noreferrer"&gt;work around&lt;/a&gt; it by adding a &lt;code&gt;CHECK CONSTRAINT&lt;/code&gt; with &lt;code&gt;NOT VALID&lt;/code&gt;, which means that the constraint is applied to new rows but not to old ones. Then you can run &lt;code&gt;VALIDATE&lt;/code&gt; later, which doesn’t need the &lt;code&gt;ACCESS EXCLUSIVE&lt;/code&gt; lock because it assumes new rows are respecting the constraint.&lt;/p&gt;

&lt;p&gt;In short, PostgreSQL offers good ways to control locking and avoid keeping locks for a long time, but it’s fair to say that it is a minefield; it’s easy to make mistakes, and every mistake can be costly. For this reason, teams need to have internal docs on how to do it correctly, and be extra careful during reviews. Staging systems are helpful to catch some of the gotchas, as long as they have the exact same data as the prod database and similar levels of traffic, which is typically not the case.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2iizigqkm7krmfydf5dj.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2iizigqkm7krmfydf5dj.jpg" alt="Schema changes staging meme" width="500" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Application deploys and the 6 stages of rename
&lt;/h3&gt;

&lt;p&gt;The previous section mostly applies to teams that have large tables, but this one applies to any team that cares about things not breaking. It’s also not Postgres specific.&lt;/p&gt;

&lt;p&gt;Schema changes don’t happen in isolation, they are part of a new feature or fix that includes application code changes. If we could deploy the schema change and the application code to all application servers at the exact same moment, this wouldn’t be an issue. However, in practice, there’s a window of time where the old code needs to work with the new schema, or the other way around.&lt;/p&gt;

&lt;p&gt;The strategy to do things correctly depends on the change you’d like to make. For example, if you add a column: you would first perform the schema change, backfill the data, then deploy the application code. After, you would apply constraints, which itself can be a multiple-step process.&lt;/p&gt;

&lt;p&gt;If you want to remove a column, it’s the other way around. You first make sure no code is accessing that column, then you drop it from the schema.&lt;/p&gt;

&lt;p&gt;What about renames? You would follow these steps (hat tip to the &lt;a href="https://planetscale.com/docs/learn/handling-table-and-column-renames#how-to-rename-a-column-on-planetscale" rel="noopener noreferrer"&gt;PlanetScale docs&lt;/a&gt;):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a new column with the new name.&lt;/li&gt;
&lt;li&gt;Update and deploy the application to write data to both columns.&lt;/li&gt;
&lt;li&gt;Backfill missing data from the old column to the new column.&lt;/li&gt;
&lt;li&gt;Optionally, add constraints like &lt;strong&gt;&lt;code&gt;NOT NULL&lt;/code&gt;&lt;/strong&gt; to the new column once all the data is backfilled.&lt;/li&gt;
&lt;li&gt;Update the application to only use the new column, and remove any references to the old column name.&lt;/li&gt;
&lt;li&gt;Drop the old column.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In summary, in order to make schema changes correctly, even ignoring locking issues, you often need a multi-step process. This is both slowing you down and has a large surface area for expensive mistakes to be made.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rollback? Rollback your expectations
&lt;/h3&gt;

&lt;p&gt;If there’s one thing scarier than performing schema changes, it’s the thought that you might have to undo them under time constraints.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpchptybcwzqryqpj11xq.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpchptybcwzqryqpj11xq.jpg" alt="Schema changes rollback choice meme" width="500" height="756"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If schema changes require careful consideration and a multi-step process, rolling them back is not any simpler. You typically have to carefully apply the same steps in the reverse order.&lt;/p&gt;

&lt;p&gt;Because this is complicated and takes a long time, most of us tend to not test rollbacks before starting a schema migration. This makes it doubly scary; you’re operating an incident, with your database in an unknown state, and now you need to run through a set of untested steps in production. &lt;/p&gt;

&lt;p&gt;For the disciplined developers out there who did test their rollbacks, you may still be stuck waiting a long time for your rollback to complete. This can leave you staring at a terminal for minutes or hours, waiting for your application to come back online for your users.&lt;/p&gt;

&lt;h1&gt;
  
  
  What if we weren’t so afraid of schema changes?
&lt;/h1&gt;

&lt;p&gt;One of my favorite parts about working on Xata is that we can take this type of workflow issues and reconsider them from first principles.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Let’s imagine that we have a magic wand.&lt;/em&gt; How would your ideal schema management system look like? This is mine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There is a &lt;strong&gt;standard workflow&lt;/strong&gt; that is followed every time, regardless of the schema type.&lt;/li&gt;
&lt;li&gt;Schema changes are &lt;strong&gt;lock-free&lt;/strong&gt; or take table locks for minimal amounts of time.&lt;/li&gt;
&lt;li&gt;Schema changes &lt;strong&gt;don’t cause downtime&lt;/strong&gt; by breaking application code.&lt;/li&gt;
&lt;li&gt;You can apply all types of changes in a &lt;strong&gt;single-step&lt;/strong&gt;, or at most a couple of steps.&lt;/li&gt;
&lt;li&gt;Schema changes are &lt;strong&gt;undoable&lt;/strong&gt;, and the undo is quick.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short, a system that makes schema changes &lt;strong&gt;standardized&lt;/strong&gt;, &lt;strong&gt;zero-downtime, lock-free, one-step, and undoable.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Is this possible?
&lt;/h3&gt;

&lt;p&gt;The first insight is that when it comes to schema changes, we put a lot of onus on the application code in order to keep the database simple. If we move the complexity on the database side, we implement it once and all applications benefit from it.&lt;/p&gt;

&lt;p&gt;Instead of having to make the application code forwards/backwards compatible, we make the database schema backwards compatible. The database could serve both the old schema (before the change) and the new schema (after the change) simultaneously. You can apply the schema change, and both old code and new code can work in parallel until the rolling upgrade is complete. &lt;/p&gt;

&lt;p&gt;Putting it on a timeline, it would look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpuflok36ecjnphcxifki.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpuflok36ecjnphcxifki.png" alt="The timeline of a schema change + application roll-out." width="800" height="384"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The schema change is started, for example when the PR is merged. It can take some time until the new schema is available, so the application deployment doesn’t yet start.&lt;/li&gt;
&lt;li&gt;Once the new schema is ready, the application deployment starts. It can be a rolling restart, so the old and the new version of the application might coexist for a period of time. That is ok because the old schema is still available.&lt;/li&gt;
&lt;li&gt;Once the application deployment is completed, the old schema can be deleted. However, you might want to leave it live for a while longer, in case a rollback is needed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If a rollback is needed while the old schema is still available, you can safely roll-back the application code. From its point of view, the schema was never modified. &lt;/p&gt;

&lt;p&gt;During the time period in which both the old and the new schema are valid, the system keeps temporary hidden columns, and uses views to represent the old and the new schema. New inserts and updates are automatically “upgraded” or “downgraded” between the two schemas, so both the old and the new versions of the application can work normally.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvekyknuzai1albv2p0dy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvekyknuzai1albv2p0dy.png" alt="Views are used to represent the old and the new schema to the application." width="800" height="433"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With the above, the basic column operations (add/remove/rename) all become standardized and safe. No more having to schedule schema changes, no more avoiding renames, or postponing deleting unused columns for weeks “just to be sure”.&lt;/p&gt;

&lt;p&gt;Adding constraints is more challenging to implement, because the old and the new schema can be in conflict. For example, let’s say you are adding a &lt;code&gt;NOT NULL&lt;/code&gt; constraint on a column. If a new &lt;code&gt;INSERT&lt;/code&gt; comes over the old schema with a &lt;code&gt;NULL&lt;/code&gt; value, it needs to be accepted, because it respects the old schema. However, the resulting row cannot be exposed in the new schema, because it wouldn’t respect the constraint. In this case, it’s best to hide the new row in the new schema, and block the migration from completing until this issue is solved.&lt;/p&gt;

&lt;p&gt;From what I know, there is only one project that tries something close to this: the relatively recent &lt;a href="https://github.com/fabianlindfors/reshape" rel="noopener noreferrer"&gt;Reshape&lt;/a&gt;. It uses Postgres views to expose the two versions of the schema and triggers to upgrade/downgrade the new data. It doesn’t do the constraints part as described above, but shows that this approach is possible. Combined with the &lt;a href="https://xata.io/blog/workflow-github-vercel-netlify-xata" rel="noopener noreferrer"&gt;Xata pull request based workflow&lt;/a&gt;, I think the ideal system described above is possible!&lt;/p&gt;

&lt;h1&gt;
  
  
  Next Steps
&lt;/h1&gt;

&lt;p&gt;If you’re feeling the pain of schema changes and think that there must be a better way, we’d love to talk to you! We plan to work on this in the form of an open-source project for PostgreSQL, and if you want to be notified when we release it, you can follow us on &lt;a href="https://twitter.com/xata" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt;, join us on &lt;a href="https://xata.io/discord" rel="noopener noreferrer"&gt;Discord&lt;/a&gt;, or simply &lt;a href="https://app.xata.io/" rel="noopener noreferrer"&gt;sign up for Xata&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>postgressql</category>
      <category>database</category>
      <category>developer</category>
    </item>
    <item>
      <title>On the performance of REPLICA IDENTITY FULL in Postgres</title>
      <dc:creator>Tudor Golubenco</dc:creator>
      <pubDate>Wed, 07 Jun 2023 11:49:39 +0000</pubDate>
      <link>https://dev.to/xata/on-the-performance-of-replica-identity-full-in-postgres-340o</link>
      <guid>https://dev.to/xata/on-the-performance-of-replica-identity-full-in-postgres-340o</guid>
      <description>&lt;p&gt;At &lt;a href="https://xata.io/" rel="noopener noreferrer"&gt;Xata&lt;/a&gt;, we’re relying on the Postgres logical replication events quite a bit. Having a reliable stream with everything that has changed in the database is really useful in a number of cases, including data syncing&lt;br&gt;
to Elasticsearch, web-hooks, attachments clean-up, and more. One of the configuration parameters for logical replication is the so called “replica identity”:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A published table must have a “replica identity” configured in order to be&lt;br&gt;
able to replicate &lt;code&gt;UPDATE&lt;/code&gt; and &lt;code&gt;DELETE&lt;/code&gt; operations, so that appropriate rows&lt;br&gt;
to update or delete can be identified on the subscriber side. By default, this&lt;br&gt;
is the primary key, if there is one.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You’d typically want to use the primary key or an index as the “replica identity” and avoid setting the “replica identity” to full. The reason is this quote from the &lt;a href="https://www.postgresql.org/docs/current/logical-replication-publication.html" rel="noopener noreferrer"&gt;Postgres docs&lt;/a&gt; (emphasis mine):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If the table does not have any suitable key, then it can be set to replica identity “full”, which means the entire row becomes the key. This, however, &lt;strong&gt;is very inefficient and should only be used as a fallback if no other solution is possible&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That sounds quite discouraging, however, there are some good reasons for which you might still want to consider enabling it (see next section), so it’s worth digging a bit deeper and analyzing what exactly happens when you enable it and how it affects performance. This is what this blog post is about.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzqr58nm1g6hmijm4tca4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzqr58nm1g6hmijm4tca4.jpg" alt="Replica identity full meme" width="653" height="382"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1&gt;
  
  
  Ok, but why would you enable it?
&lt;/h1&gt;

&lt;p&gt;In our case there were two things that convinced us that we might want to consider REPLICA IDENTITY FULL:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On UPDATEs and DELETEs, Postgres only includes the old values of the “identity columns”. So if you want to get the values &lt;strong&gt;after&lt;/strong&gt; the change as well as the values &lt;strong&gt;before&lt;/strong&gt; the change, setting replica identity to “full” tricks Postgres into passing the old values as well.&lt;/li&gt;
&lt;li&gt;UPDATE WAL events don’t include &lt;a href="https://www.postgresql.org/docs/current/storage-toast.html" rel="noopener noreferrer"&gt;TOASTed columns&lt;/a&gt;, unless that column has changed in the current operation. In practice this means that if you have column values over 8KB, you will miss them from the update events. Again, by setting the replica identity to FULL, we make sure they are always in the WAL event.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To illustrate the above, let’s look at an example, using &lt;a href="https://github.com/eulerto/wal2json" rel="noopener noreferrer"&gt;wal2json&lt;/a&gt; for convenience. An update event looks something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"kind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"update"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"schema"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"xata"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"table"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"foo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"columnnames"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"age"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"columntypes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"integer"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"columnvalues"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"michael"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Michael Scott"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"oldkeys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"keynames"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"keytypes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"keyvalues"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"michael"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above was generated after a command like this one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;foo&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'michael'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The table has a primary key (the &lt;code&gt;id&lt;/code&gt; column) and that is used as the replica identity.&lt;/p&gt;

&lt;p&gt;Things to note:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Under &lt;code&gt;columnvalues&lt;/code&gt; the new values for the &lt;code&gt;id&lt;/code&gt;, &lt;code&gt;name&lt;/code&gt;, and &lt;code&gt;age&lt;/code&gt;columns are included.&lt;/li&gt;
&lt;li&gt;Under &lt;code&gt;oldkeys&lt;/code&gt; only the &lt;code&gt;id&lt;/code&gt; column is included, because that’s what the REPLICA IDENTITY is set to.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The table, however, has another column, named &lt;code&gt;description&lt;/code&gt; which is not included at all in the event. Why? Because the value of it is large and it is TOASTed. If the UPDATE command would have changed the description, then it would have been in the &lt;code&gt;columnvalues&lt;/code&gt; but because it was not touched, it is skipped.&lt;/p&gt;

&lt;p&gt;Now let’s compare it with the same event generated with REPLICA IDENTITY FULL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;foo&lt;/span&gt; &lt;span class="n"&gt;REPLICA&lt;/span&gt; &lt;span class="k"&gt;IDENTITY&lt;/span&gt; &lt;span class="k"&gt;FULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The event for the same UPDATE command above looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"kind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"update"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"schema"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"xata"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"table"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"foo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"columnnames"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"age"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"columntypes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"integer"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"columnvalues"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"michael"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Michael Scott"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;43&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"oldkeys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"keynames"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"age"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"keytypes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"integer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"keyvalues"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"michael"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Michael Scott"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;redacted very large description&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Things to note:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;columnvalues&lt;/code&gt; looks the same. It contains all the columns and their new values, but not the TOASTed column/value pairs.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;oldkeys&lt;/code&gt; now contains really all the columns with their values &lt;strong&gt;before&lt;/strong&gt; the update, including the TOAST value for &lt;code&gt;description&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It should be said that with the replica identity set to the primary key or another unique index, you get enough data out in the replication stream to completely sync a Postgres replica in sync. This means that you can overcome the limitations by keeping state outside of Postgres. However, the amount of work that you need to do outside of Postgres might become significant both in terms of resources needed and added code (compared to a single ALTER TABLE statement).&lt;/p&gt;

&lt;h1&gt;
  
  
  Impact on the replica and changes in Postgres 16
&lt;/h1&gt;

&lt;p&gt;The quote from the beginning of the  blog post with the “very inefficient” wording is actually from the — currently stable — Postgres 15 docs , but it was changed in the &lt;a href="https://www.postgresql.org/docs/16/logical-replication-publication.html" rel="noopener noreferrer"&gt;Postgres 16 docs&lt;/a&gt; — currently under development:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If the table does not have any suitable key, then it can be set to replica identity &lt;code&gt;FULL&lt;/code&gt;, which means the entire row becomes the key. When replica identity &lt;code&gt;FULL&lt;/code&gt; is specified, indexes can be used on the subscriber side for searching the rows. […] &lt;strong&gt;If there are no such suitable indexes, the search on the subscriber side can be very inefficient,&lt;/strong&gt; therefore replica identity &lt;code&gt;FULL&lt;/code&gt; should only be used as a fallback if no other solution is possible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The above hints that the “very inefficient” part refers to the subscriber that needs to identify the rows that were updated / deleted, and that this is being improved in version 16 so that it’s able to use indexed columns even if they are not set explicitly in the replica identity.&lt;/p&gt;

&lt;p&gt;What is not clear from the docs above is that also in previous versions of Postgres, if the table has a primary key, it is used by the subscriber to find the affected row. You can see the code &lt;a href="https://github.com/postgres/postgres/blob/REL_14_STABLE/src/backend/replication/logical/worker.c#L1225-L1235" rel="noopener noreferrer"&gt;here&lt;/a&gt;, it looks for a replica index, and if it’s not found, it uses the PK instead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="cm"&gt;/*
 * Get replica identity index or if it is not defined a primary key.
 *
 * If neither is defined, returns InvalidOid
 */&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;Oid&lt;/span&gt;
&lt;span class="nf"&gt;GetRelationIdentityOrPK&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Relation&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Oid&lt;/span&gt;         &lt;span class="n"&gt;idxoid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;idxoid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RelationGetReplicaIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;OidIsValid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idxoid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;idxoid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RelationGetPrimaryKeyIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;idxoid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is good news because it means that the “very inefficient” case is only relevant to tables that have no primary key.&lt;/p&gt;

&lt;p&gt;With Postgres 16, the “very inefficient” case is reduced even further, because now Postgres can use other indexes beyond the PK, even if they are not explicitly set in the REPLICA IDENTITY. The patch for this change can be found &lt;a href="https://www.postgresql.org/message-id/CACawEhVLqmAAyPXdHEPv1ssU2c=dqOniiGz7G73HfyS7+nGV4w@mail.gmail.com" rel="noopener noreferrer"&gt;here&lt;/a&gt;. After the patch, the &lt;code&gt;GetRelationIdentityOrPK&lt;/code&gt; function stays the same, but the caller routine makes another effort into finding a suitable index to use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;    &lt;span class="cm"&gt;/*
     * Simple case, we already have a primary key or a replica identity index.
     */&lt;/span&gt;
    &lt;span class="n"&gt;idxoid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GetRelationIdentityOrPK&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;localrel&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OidIsValid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idxoid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;idxoid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;remoterel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;replident&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;REPLICA_IDENTITY_FULL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
       &lt;span class="cm"&gt;/*
        * We are looking for one more opportunity for using an index. If
        * there are any indexes defined on the local relation, try to pick a
        * suitable index.
        *
        * The index selection safely assumes that all the columns are going
        * to be available for the index scan given that remote relation has
        * replica identity full.
        *
        * Note that we are not using the planner to find the cheapest method
        * to scan the relation as that would require us to either use lower
        * level planner functions which would be a maintenance burden in the
        * long run or use the full-fledged planner which could cause
        * overhead.
        */&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;FindUsableIndexForReplicaIdentityFull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;localrel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attrMap&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Impact on the size of the WAL events
&lt;/h1&gt;

&lt;p&gt;While the worst potential impact of REPLICA IDENTITY FULL is on the replica, as we’ve seen above, it does also impact the primary because it requires more data to be stored in the WAL and it causes it to send more data over the network to the subscribers. This means more IO bandwidth consumed, more CPU usage, and more RAM usage.&lt;/p&gt;

&lt;p&gt;How much more? It depends on the type of write traffic you have and how many large values you have, as well as how often they are updated.&lt;/p&gt;

&lt;p&gt;Let’s look at it by the event type:&lt;/p&gt;

&lt;h3&gt;
  
  
  Inserts
&lt;/h3&gt;

&lt;p&gt;Inserts are not impacted at all, because there are no “old values” and the large values are included regardless of the REPLICA IDENTITY setting. Easy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Updates
&lt;/h3&gt;

&lt;p&gt;If REPLICA IDENTITY is set to FULL, Postgres needs to keep the old values for all columns while executing the transaction and then write them to the WAL.&lt;/p&gt;

&lt;p&gt;If there are TOAST values in the row, they need to be loaded in memory, decompressed, and then logged to the WAL. The relevant code that loads them to the memory is this, with most of the hard work happening in &lt;code&gt;toast_flatten_tuple&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;replident&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;REPLICA_IDENTITY_FULL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="cm"&gt;/*
         * When logging the entire old tuple, it very well could contain
         * toasted columns. If so, force them to be inlined.
         */&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HeapTupleHasExternal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tp&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="n"&gt;tp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;toast_flatten_tuple&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tp&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In conclusion, the size of the WAL events for updates is ~doubled if there are no TOASTed values, and potentially way more than doubled if there are TOASTed values. &lt;/p&gt;

&lt;h3&gt;
  
  
  Deletes
&lt;/h3&gt;

&lt;p&gt;For deletes normally only the key is logged in the WAL, and by enabling REPLICA IDENTITY FULL, all columns become part of the key, meaning that the whole row is logged and sent over the wire on delete.&lt;/p&gt;

&lt;h1&gt;
  
  
  Benchmarking
&lt;/h1&gt;

&lt;p&gt;Based on the analysis above, we were guessing that the cost of REPLICA IDENTITY FULL would be worth it in our case, so decided to put it to test. We wanted to simulate something close to the worst case scenario but still realistic, so we have extended our benchmarking test suite to have a test with a lot of large values and performing a lot of updates. The test didn’t push the resources to the limit but was heavy enough for us to measure the impact.&lt;/p&gt;

&lt;p&gt;Here are the result of running the benchmark test before and after enabling REPLICA IDENTITY FULL.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr5t9cwy0ai5jq1t8x2f3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr5t9cwy0ai5jq1t8x2f3.png" alt="CPU time before and after" width="800" height="373"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fptlz7a0talfkwiwunotj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fptlz7a0talfkwiwunotj.png" alt="Replication slot lag before and after" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The CPU time showed a modest increase in peak times from 30% to about 35%. There was a similar modest increase in the peak replication slot lag.&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;If you’re using logical replication between instances, the impact from REPLICA IDENTITY FULL will likely be manageable as long as the replicated tables have a Primary Key. If you are on Postgres 16 already, having no Primary Key but another unique index might also be ok.&lt;/p&gt;

&lt;p&gt;In any case, there will be impact in the amount of data written to the WAL and sent over the network. The more UPDATE and DELETE traffic that you have, the more significant this impact will be.&lt;/p&gt;

&lt;p&gt;If you find the above interesting and want to work on similar challenges, make sure to check out the &lt;a href="https://xata.io/careers" rel="noopener noreferrer"&gt;Xata careers&lt;/a&gt; page.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>performance</category>
      <category>database</category>
    </item>
    <item>
      <title>Semantic Search With Xata, OpenAI, TypeScript, and Deno</title>
      <dc:creator>Tudor Golubenco</dc:creator>
      <pubDate>Tue, 21 Mar 2023 21:47:19 +0000</pubDate>
      <link>https://dev.to/xata/semantic-search-with-xata-openai-typescript-and-deno-31lb</link>
      <guid>https://dev.to/xata/semantic-search-with-xata-openai-typescript-and-deno-31lb</guid>
      <description>&lt;p&gt;At the same time we launched our &lt;a href="https://xata.io/chatgpt" rel="noopener noreferrer"&gt;ChatGPT integration&lt;/a&gt;, we also added a new vector type to Xata, making it easy to store embeddings. Additionally, we have added a new &lt;code&gt;vectorSearch&lt;/code&gt; endpoint, which performs similarity search on embeddings.&lt;/p&gt;

&lt;p&gt;Let’s take a quick tour to see how you can use these new capabilities to implement semantic search. We’re going to use the OpenAI embeddings API, TypeScript and Deno. This tutorial assumes prior knowledge of TypeScript, but no prior knowledge of Xata, Deno, or OpenAI.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is semantic search?
&lt;/h2&gt;

&lt;p&gt;Instead of just matching keywords, as traditional search engines do, semantic search attempts to understand the context and intent of the query and the relationships between the words used in it.&lt;/p&gt;

&lt;p&gt;For example, let’s say you have the following sample sentences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"The quick brown fox jumps over the lazy dog story"&lt;/li&gt;
&lt;li&gt;“The vehicle collided with the tree”&lt;/li&gt;
&lt;li&gt;“The cat growled towards the dog”&lt;/li&gt;
&lt;li&gt;"Lorem ipsum dolor sit amet, consectetur adipiscing elit”&lt;/li&gt;
&lt;li&gt;“The sunset painted the sky with vibrant colors”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you search for “sample text in latin”, traditional keyword search won’t match the “lorem ipsum” text, but semantic search will (and we’re going to demo it in this article).&lt;/p&gt;

&lt;p&gt;Similarly, if you search for “the kitty hissed at the puppy”, semantic search will see that the phrase has the same meaning as “the cat growled towards the dog”, even if they use none of the same words. Or, for another example, “vanilla sky” should bring up the “The sunset painted the sky with vibrant colors” sentence. Pretty cool, right? This is now quite possible thanks large-language models and vector search.&lt;/p&gt;

&lt;h2&gt;
  
  
  A quick intro to embeddings
&lt;/h2&gt;

&lt;p&gt;From a data point of view, embeddings are simply arrays of floats. They are the output of ML models, where the input can be a word, a sentence, a paragraph of text, an image, an audio file, a video file, and so on.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2bwks2ug9d2909m8vmge.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2bwks2ug9d2909m8vmge.png" alt="Embeddings diagram" width="800" height="180"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each number in the array of floats represents the input text on a particular dimension, which depends on the model. This means that the more similar the embeddings are, the more “similar” the input objects are. I’ve put “similar” in quotes because it depends on the model what kind of similar it means. When it comes to text, it’s usually about “similar in meaning”, even if different words, expressions, or languages are used.&lt;/p&gt;

&lt;p&gt;Reducing complex data to an array of numbers representing its qualities turns out to be very useful for a number of use cases. Think of reverse image search, recommendations algorithms for video and music, product recommendations, categorizing products, and so on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vector type in Xata
&lt;/h2&gt;

&lt;p&gt;If you want to follow along, start with steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sign up or sign into Xata &lt;a href="https://app.xata.io/signin" rel="noopener noreferrer"&gt;here&lt;/a&gt; (the usage from this tutorial fits well within the free tier, so you don’t need to set up billing)&lt;/li&gt;
&lt;li&gt;Create a database named &lt;code&gt;vectors&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Create a table named &lt;code&gt;Sentences&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Add two columns:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;sentence&lt;/code&gt; of type string&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;embedding&lt;/code&gt; of type vector. Use 1536 as the dimension&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frmszk33kbfkpbaf8anqs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frmszk33kbfkpbaf8anqs.png" alt="Add vector type screenshot" width="800" height="897"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you are done, the schema should look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh8i6ije0mhmmuscfv6f5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh8i6ije0mhmmuscfv6f5.png" alt="View schema screenshot" width="800" height="610"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Initialize the Xata project
&lt;/h2&gt;

&lt;p&gt;To get ready for running the typescript code, install the Xata CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @xata.io/cli@latest

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run &lt;code&gt;xata auth login&lt;/code&gt; to authenticate the CLI. This will open a browser window and prompt you to generate a new API key. Give it any name you’d like.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;xata auth login

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a folder for the code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;sentences
&lt;span class="nb"&gt;cd &lt;/span&gt;sentences

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And run &lt;code&gt;xata init&lt;/code&gt; to connect it to the Xata DB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;xata init

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Xata CLI will ask you to select the database and then ask you how to generate the types. Select &lt;strong&gt;Generate TypeScript code with Deno imports&lt;/strong&gt;. Use default settings for the rest of the questions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjkaudr5kp1qesmwstwzg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjkaudr5kp1qesmwstwzg.png" alt="Xata init screenshot" width="800" height="513"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Prepare OpenAI and Deno
&lt;/h2&gt;

&lt;p&gt;Create an &lt;a href="https://platform.openai.com/signup" rel="noopener noreferrer"&gt;OpenAI account&lt;/a&gt; and generate a key. Note that you need to set up billing for OpenAI in order for to run these examples, but the cost will be tiny (under $1).&lt;/p&gt;

&lt;p&gt;Add the OpenAI key to the &lt;code&gt;.env&lt;/code&gt; file which was created by the &lt;code&gt;xata init&lt;/code&gt; command above. Your &lt;code&gt;.env&lt;/code&gt; should look something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# API key used by the CLI and the SDK&lt;/span&gt;
&lt;span class="c"&gt;# Make sure your framework/tooling loads this file on startup to have it available for the SDK&lt;/span&gt;
&lt;span class="nv"&gt;XATA_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;xau_&amp;lt;redacted&amp;gt;
&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-&amp;lt;redacted&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install the Deno CLI. See &lt;a href="https://deno.land/manual@v1.31.3/getting_started/installation" rel="noopener noreferrer"&gt;this page&lt;/a&gt; for the various install options, on macOs with Homebrew it is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;deno

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Load data
&lt;/h2&gt;

&lt;p&gt;It’s now the time to write a bit of TypeScript code. Create a &lt;code&gt;loadWithEmbeddings.ts&lt;/code&gt; file top level in your project with the following contents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Configuration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;OpenAIApi&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;npm:openai&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;getXataClient&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;./src/xata.ts&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;dotenvConfig&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;&amp;lt;https://deno.land/x/dotenv@v1.0.1/mod.ts&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;dotenvConfig&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;export&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;openAIConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Deno&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;openAI&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAIApi&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;openAIConfig&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;xata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getXataClient&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;The quick brown fox jumps over the lazy dog story&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;The vehicle collided with the tree&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;The cat growled towards the dog&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Lorem ipsum dolor sit amet, consectetur adipiscing elit&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;The sunset painted the sky with vibrant colors&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sentence&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openAI&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createEmbedding&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text-embedding-ada-002&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="nx"&gt;embedding&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="nx"&gt;xata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Sentences&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step by step, this is what the script is doing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses &lt;code&gt;dotenv&lt;/code&gt; to load the &lt;code&gt;.env&lt;/code&gt; file into the current environment. This makes sure the &lt;code&gt;XATA_API_KEY&lt;/code&gt; and the &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; are available to the rest of the script.&lt;/li&gt;
&lt;li&gt;Initializes the OpenAI and Xata client libraries.&lt;/li&gt;
&lt;li&gt;Defines the test data in the &lt;code&gt;sentences&lt;/code&gt; array.&lt;/li&gt;
&lt;li&gt;For each &lt;code&gt;sentence&lt;/code&gt; , calls the &lt;a href="https://platform.openai.com/docs/guides/embeddings" rel="noopener noreferrer"&gt;OpenAI embeddings API&lt;/a&gt; to get the embedding for it.&lt;/li&gt;
&lt;li&gt;Inserts a record containing the sentence and the embedding into Xata.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Execute the script like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;deno run &lt;span class="nt"&gt;--allow-net&lt;/span&gt; &lt;span class="nt"&gt;--allow-env&lt;/span&gt; &lt;span class="nt"&gt;--allow-read&lt;/span&gt; &lt;span class="nt"&gt;--allow-run&lt;/span&gt; ./loadWithEmbeddings.ts

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you visit the Xata UI now, you should see the data loaded, together with the embeddings.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fible20p5wx4swipa6kmk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fible20p5wx4swipa6kmk.png" alt="View data screenshot" width="800" height="501"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Run search queries
&lt;/h2&gt;

&lt;p&gt;Let’s write another simple script that performs a search based on an input query. Name this script &lt;code&gt;search.ts&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Configuration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;OpenAIApi&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;npm:openai&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;getXataClient&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;./src/xata.ts&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;dotenvConfig&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;&amp;lt;https://deno.land/x/dotenv@v1.0.1/mod.ts&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;dotenvConfig&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;export&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;openAIConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Deno&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;openAI&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAIApi&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;openAIConfig&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;xata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getXataClient&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Deno&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Please provide a search query&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Example: deno run --allow-net --allow-env search.ts 'the quick brown fox'&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;Deno&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;Deno&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openAI&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createEmbedding&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text-embedding-ada-002&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="nx"&gt;embedding&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;xata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Sentences&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;vectorSearch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;embedding&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getMetadata&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;|&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is what’s going on in the script:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The beginning is similar as for the previous script, using &lt;code&gt;dotenv&lt;/code&gt; to load the &lt;code&gt;.env&lt;/code&gt; file and then initializing the client libraries for OpenAI and Xata&lt;/li&gt;
&lt;li&gt;Read the query as the first argument passed to the script&lt;/li&gt;
&lt;li&gt;Use the &lt;a href="https://platform.openai.com/docs/guides/embeddings" rel="noopener noreferrer"&gt;OpenAI embeddings API&lt;/a&gt; to create embeddings for the query&lt;/li&gt;
&lt;li&gt;Run a vector search using the Xata &lt;a href="https://xata.io/docs/typescript-client/vector-search" rel="noopener noreferrer"&gt;Vector Search API.&lt;/a&gt; This finds vectors that are similar to the provided embedding&lt;/li&gt;
&lt;li&gt;Print the results, together with the similarity score&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To run the script, execute it like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;deno run &lt;span class="nt"&gt;--allow-net&lt;/span&gt; &lt;span class="nt"&gt;--allow-env&lt;/span&gt; &lt;span class="nt"&gt;--allow-read&lt;/span&gt; &lt;span class="nt"&gt;--allow-run&lt;/span&gt; &lt;span class="se"&gt;\\&lt;/span&gt;
  ./search.ts &lt;span class="s1"&gt;'sample text in latin'&lt;/span&gt;

1.8154079 | Lorem ipsum dolor sit amet, consectetur adipiscing elit
1.7424928 | The quick brown fox jumps over the lazy dog story
1.7360129 | The &lt;span class="nb"&gt;cat &lt;/span&gt;growled towards the dog
1.7311659 | The sunset painted the sky with vibrant colors
1.7038174 | The vehicle collided with the tree

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, searching for “sample text in latin” results in the “Lorem ipsum” text as we hoped. You can also try some variations, for example, “sample sentence” still brings the “Lorem ipsum” one as the top result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;deno run &lt;span class="nt"&gt;--allow-net&lt;/span&gt; &lt;span class="nt"&gt;--allow-env&lt;/span&gt; &lt;span class="nt"&gt;--allow-read&lt;/span&gt; &lt;span class="nt"&gt;--allow-run&lt;/span&gt; &lt;span class="se"&gt;\\&lt;/span&gt;
  ./search.ts &lt;span class="s1"&gt;'sample sentence'&lt;/span&gt;

1.805396 | Lorem ipsum dolor sit amet, consectetur adipiscing elit
1.7715557 | The quick brown fox jumps over the lazy dog story
1.7608802 | The sunset painted the sky with vibrant colors
1.7573793 | The &lt;span class="nb"&gt;cat &lt;/span&gt;growled towards the dog
1.7493906 | The vehicle collided with the tree

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The scores on the left are numbers between 0 and 2 which indicate how close each sentence is to the provided query. If you run with a sentence that exits in the data, you’ll get a score very close to 2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;deno run &lt;span class="nt"&gt;--allow-net&lt;/span&gt; &lt;span class="nt"&gt;--allow-env&lt;/span&gt; &lt;span class="nt"&gt;--allow-read&lt;/span&gt; &lt;span class="nt"&gt;--allow-run&lt;/span&gt; &lt;span class="se"&gt;\\&lt;/span&gt;
  ./search.ts &lt;span class="s1"&gt;'The quick brown fox jumps over the lazy dog story'&lt;/span&gt;

1.9999993 | The quick brown fox jumps over the lazy dog story
1.8063612 | The &lt;span class="nb"&gt;cat &lt;/span&gt;growled towards the dog
1.769694 | Lorem ipsum dolor sit amet, consectetur adipiscing elit
1.7673006 | The vehicle collided with the tree
1.7605586 | The sunset painted the sky with vibrant colors

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let’s try the “vanilla sky” query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;deno run &lt;span class="nt"&gt;--allow-net&lt;/span&gt; &lt;span class="nt"&gt;--allow-env&lt;/span&gt; &lt;span class="nt"&gt;--allow-read&lt;/span&gt; &lt;span class="nt"&gt;--allow-run&lt;/span&gt; &lt;span class="se"&gt;\\&lt;/span&gt;
  ./search.ts &lt;span class="s1"&gt;'vanilla sky'&lt;/span&gt;

1.8137007 | The sunset painted the sky with vibrant colors
1.7584264 | Lorem ipsum dolor sit amet, consectetur adipiscing elit
1.7579571 | The quick brown fox jumps over the lazy dog story
1.7505519 | The vehicle collided with the tree
1.738231 | The &lt;span class="nb"&gt;cat &lt;/span&gt;growled towards the dog

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bingo, top result matches what we expected.&lt;/p&gt;

&lt;p&gt;Another one to try:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;deno run &lt;span class="nt"&gt;--allow-net&lt;/span&gt; &lt;span class="nt"&gt;--allow-env&lt;/span&gt; &lt;span class="nt"&gt;--allow-read&lt;/span&gt; &lt;span class="nt"&gt;--allow-run&lt;/span&gt; &lt;span class="se"&gt;\\&lt;/span&gt;
  ./search.ts &lt;span class="s1"&gt;'The car crashed into the oak.'&lt;/span&gt;

1.907266 | The vehicle collided with the tree
1.7763984 | The &lt;span class="nb"&gt;cat &lt;/span&gt;growled towards the dog
1.7755028 | The quick brown fox jumps over the lazy dog story
1.7745116 | The sunset painted the sky with vibrant colors
1.7570261 | Lorem ipsum dolor sit amet, consectetur adipiscing elit

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Again, the sentence with the same meaning shows up at the top with a score close to 2.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The large-language-models are powerful tools that open up new use cases. Semantic search is one of these use cases, and we’ve seen how the Xata vector search can be used to implement it. You can also use it to build recommendation engines, or finding similar entries in a knowledge-base, or questions in a Q&amp;amp;A website.&lt;/p&gt;

&lt;p&gt;If you’re running this tutorial, please join us on the Xata &lt;a href="https://xata.io/discord" rel="noopener noreferrer"&gt;Discord&lt;/a&gt; and let us know what you are building!&lt;/p&gt;

</description>
      <category>typescript</category>
      <category>deno</category>
      <category>openai</category>
      <category>ai</category>
    </item>
    <item>
      <title>Semantic or keyword search for finding ChatGPT context. Who searched it better?</title>
      <dc:creator>Tudor Golubenco</dc:creator>
      <pubDate>Wed, 08 Mar 2023 11:46:02 +0000</pubDate>
      <link>https://dev.to/xata/semantic-or-keyword-search-for-finding-chatgpt-context-who-searched-it-better-4bdp</link>
      <guid>https://dev.to/xata/semantic-or-keyword-search-for-finding-chatgpt-context-who-searched-it-better-4bdp</guid>
      <description>&lt;p&gt;Last week we’ve added a Q&amp;amp;A bot that answers questions from &lt;a href="https://xata.io/docs/overview?show-chat" rel="noopener noreferrer"&gt;our documentation&lt;/a&gt;. This leverages the ChatGPT tech to answer questions from the Xata documentation, even though the OpenAI GPT model was never trained on the Xata docs. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiy4idyeggnubmgmjabhz.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiy4idyeggnubmgmjabhz.gif" alt="GIF of the Xata bot answering a question" width="786" height="475"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The way we do this is by using an approach suggested by Simon Willison in this &lt;a href="https://simonwillison.net/2023/Jan/13/semantic-search-answers/" rel="noopener noreferrer"&gt;blog post&lt;/a&gt;. The same approach can be found also in an &lt;a href="https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb" rel="noopener noreferrer"&gt;OpenAI cookbook&lt;/a&gt;. The idea is the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run a text search against the documentation to find the content that is most relevant to the question asked by the user.&lt;/li&gt;
&lt;li&gt;Produce a prompt with this general form:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;With these rules: {rules}
And this text: {context}
Given the above text, answer the question: {question}
Answer:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Send the prompt to the ChatGPT API and let the model complete the answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We found out that this works quite well and, combined with a relatively low model temperature (the concept of temperature is explained in this &lt;a href="https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/" rel="noopener noreferrer"&gt;blog post&lt;/a&gt;), this tends to produce correct results and code snippets, as long as the answer can be found in the documentation.&lt;/p&gt;

&lt;p&gt;A key limitation to this approach is that the prompt that you build in the second step above needs to have max 4000 tokens (~3000 words). This means that the first step, the text search to select the most relevant documents, becomes really important. If the search step does a good job and provides the right context, ChatGPT tends to also do a good job in producing a correct and to-the-point result.&lt;/p&gt;

&lt;p&gt;So what’s the best way to find the most relevant pieces of content in the documentation? The OpenAI cookbook, as well as Simon’s blog, use what is called semantic search. Semantic search leverages the language model to generate embeddings for both the question and the content. Embeddings are arrays of numbers that represent the text on a number of dimension. Pieces of text that have similar embeddings have a similar meaning. This means a good strategy is to find the pieces of content that the most similar embeddings to the question embeddings.&lt;/p&gt;

&lt;p&gt;Another possible strategy, based on the more classical keyword search, looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ask ChatGPT to extract the keywords from the question, with a prompt like this:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Extract keywords for a search query from the text provided. 
Add synonyms for words where a more common one exists.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Use the provided keywords to run a free-text-search and pick the top results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Putting it in a single diagram, the two methods look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsf68a86pf4yklo1goxt2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsf68a86pf4yklo1goxt2.png" alt="Diagram showing how we use keyword or vector search with ChatGPT" width="800" height="340"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We have tried both on our documentation and have noticed some pros and cons. &lt;/p&gt;

&lt;p&gt;Let’s start by comparing a few results. Both are ran against the same database, and they both use the ChatGPT &lt;code&gt;gpt-3.5-turbo&lt;/code&gt; model. As there is randomness involved, I ran each question 2-3 times and picked what looked to me like the best result.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question: How do I install the Xata CLI?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Answer with vector search:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsqva4ueu51qzileyhx5g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsqva4ueu51qzileyhx5g.png" alt="Question for the Xata bot: How do I install the Xata CLI" width="800" height="824"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer with keyword search:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbeyr5ydm4pn0692vm49h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbeyr5ydm4pn0692vm49h.png" alt="Question for the Xata bot: How do I install the Xata CLI" width="800" height="757"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Verdict&lt;/em&gt;: Both versions provided the correct answer, however the vector search one is a bit more complete. They both found the correct docs page for it, but I think our highlights-based heuristic selected a shorter chunk of text in case of the keyword strategy. Winner: vector search. &lt;/p&gt;

&lt;p&gt;Score: 1-0&lt;/p&gt;

&lt;h2&gt;
  
  
  Question: How do you use Xata with Deno?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Answer with vector search:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc066fjfzo5l4ileo3ybo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc066fjfzo5l4ileo3ybo.png" alt="Question for the Xata bot: How do you use Xata with Deno?" width="800" height="344"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer with keyword search:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7roalais0yspcikttn4d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7roalais0yspcikttn4d.png" alt="Question for the Xata bot: How do you use Xata with Deno?" width="800" height="763"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; Disappointing result for vector search, who somehow missed the dedicated Deno page in our docs. It did find some other Deno relevant content, but not the page that contained the very useful example. Winner: keyword search.&lt;/p&gt;

&lt;p&gt;Score: 1-1&lt;/p&gt;

&lt;h2&gt;
  
  
  Question: How can I import a CSV file with custom column types?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;With vector search:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8h12zq2hm49zbhwasu9e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8h12zq2hm49zbhwasu9e.png" alt="Image description" width="800" height="381"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With keyword search:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fof30wz0rnwv36dqonu8n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fof30wz0rnwv36dqonu8n.png" alt="Image description" width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; Both have found the right page (”Import a CSV file”), but the keyword search version managed to get a more complete answer. I did run this multiple times to make sure it’s not a fluke. I think the difference comes from how the text fragment is selected (neighbouring the keywords in case of keyword search, from the beginning of the page in case of vector search). Winner: keyword search.&lt;/p&gt;

&lt;p&gt;Score: 1-2&lt;/p&gt;

&lt;h2&gt;
  
  
  Question: How can I filter a table named Users by the email column?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;With vector search:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Famrdiks2zrfvudsejznr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Famrdiks2zrfvudsejznr.png" alt="Image description" width="800" height="522"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With keyword search:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjz3cnb5ivx0zxjopq6ig.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjz3cnb5ivx0zxjopq6ig.png" alt="Image description" width="800" height="554"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Verdict:&lt;/em&gt; The vector search did better on this one, because it found the “Filtering” page on which there were more examples that ChatGPT could use to compose the answer. The keyword search answer is subtly broken, because it uses “query” instead of “filter” for the method name. Winner: vector search.&lt;/p&gt;

&lt;p&gt;Score: 2-2&lt;/p&gt;

&lt;h2&gt;
  
  
  Question: What is Xata?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;With vector search:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fluti2n9lue29r582ke02.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fluti2n9lue29r582ke02.png" alt="Image description" width="800" height="525"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With keyword search:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcnwxn69sfejrnhgd21t1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcnwxn69sfejrnhgd21t1.png" alt="Image description" width="800" height="550"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; This one is a draw, because both answers are quite good. The two picked different pages to summarize in an answer, but both did a good job and I can’t pick a winner.&lt;/p&gt;

&lt;p&gt;Score: 3-3&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuration and tuning
&lt;/h2&gt;

&lt;p&gt;This is a sample Xata request used for keyword search:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;POST&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;https://workspace-id.eu-west&lt;/span&gt;&lt;span class="mi"&gt;-1&lt;/span&gt;&lt;span class="err"&gt;.xata.sh/db/docs:main/tables/search/ask&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"question"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What is Xata?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"rules"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"Do not answer questions about pricing or the free tier. Respond that Xata has several options available, please check https://xata.io/pricing for more information."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"If the user asks a how-to question, provide a code snippet in the language they asked for with TypeScript as the default."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"Only answer questions that are relating to the defined context or are general technical questions. If asked about a question outside of the context, you can respond with &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;It doesn't look like I have enough information to answer that. Check the documentation or contact support.&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"Results should be relevant to the context provided and match what is expected for a cloud database."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"If the question doesn't appear to be answerable from the context provided, but seems to be a question about TypeScript, Javascript, or REST APIs, you may answer from outside of the provided context."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"If you answer with Markdown snippets, prefer the GitHub flavour."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"Your name is DanGPT"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"searchType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"keyword"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"search"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"fuzziness"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"target"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="s2"&gt;"slug"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"column"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="s2"&gt;"section"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"column"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"keywords"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"boosters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"valueBooster"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"column"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"section"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"guide"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"factor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And this what we use for vector search:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;POST&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;https://workspace-id.eu-west&lt;/span&gt;&lt;span class="mi"&gt;-1&lt;/span&gt;&lt;span class="err"&gt;.xata.sh/db/docs:main/tables/search/ask&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"question"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"How do I get a record by id?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"rules"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"Do not answer questions about pricing or the free tier. Respond that Xata has several options available, please check https://xata.io/pricing for more information."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"If the user asks a how-to question, provide a code snippet in the language they asked for with TypeScript as the default."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"Only answer questions that are relating to the defined context or are general technical questions. If asked about a question outside of the context, you can respond with &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;It doesn't look like I have enough information to answer that. Check the documentation or contact support.&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"Results should be relevant to the context provided and match what is expected for a cloud database."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"If the question doesn't appear to be answerable from the context provided, but seems to be a question about TypeScript, Javascript, or REST APIs, you may answer from outside of the provided context."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"Your name is DanGPT"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"searchType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vector"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"vectorSearch"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"column"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"embeddings"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"contentColumn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"filter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"section"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"guide"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, the keyword search version has more settings, configuring fuzziness and boosters and column weights. The vector search only uses a filter. I would call this a plus for keyword search: you have more dials to tune the search and therefore get better answers. But it’s also more work, and the results from vector search are quite good without this tuning. &lt;/p&gt;

&lt;p&gt;In our case, we already have tuned the keyword search for our, well, docs search functionality. So it wasn’t necessarily extra work, and while playing with ChatGPT we discovered improvements to our docs and search as well. Also, Xata just happens to have a very nice UI for tuning your keyword search, so the work wasn’t hard to begin with (planning a separate blog post about that).&lt;/p&gt;

&lt;p&gt;There is no reason for which vector search couldn’t also have boosters and column weights and the like, but we don’t have it yet in Xata and I don’t know of any other solution that makes that as easy as we make keyword search tuning. And, in general, there is more prior art to keyword search, but it is quite possible that vector search will catch up.&lt;/p&gt;

&lt;p&gt;For now, I’m going to call keyword search a winner on this one.&lt;/p&gt;

&lt;p&gt;Score: 3-4&lt;/p&gt;

&lt;h2&gt;
  
  
  Convenience
&lt;/h2&gt;

&lt;p&gt;Our documentation already had a search function, dog-fooding Xata, so that was quite simple to extend to a chat bot. Xata now also supports vector search natively, but using it required adding embeddings for all the documentation pages and figuring out a good chunking strategy. We have used the OpenAI embeddings API for producing the text embeddings, which had a minimal cost. Winner: Keyword search &lt;/p&gt;

&lt;p&gt;Score 3-5&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency
&lt;/h2&gt;

&lt;p&gt;The keyword search approach needs an extra round-trip to the ChatGPT API. This adds in terms of latency to the result started to be streamed in the UI. By my measurements, this adds around 1.8s extra time &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With vector search:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3qrrcb01lbsxy2pumipe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3qrrcb01lbsxy2pumipe.png" alt="Xata bot request timing with vector search" width="800" height="195"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With keyword search:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fogn0cbkv8a5pcskqfon0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fogn0cbkv8a5pcskqfon0.png" alt="Image description" width="800" height="196"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note: The total and the content download times here are not relevant, because they mostly depend on how long the generated response is. Look at the “Waiting for server response” bar (the green one) to compare.&lt;/p&gt;

&lt;p&gt;Winner: Vector search&lt;/p&gt;

&lt;p&gt;Score: 4-5&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost
&lt;/h2&gt;

&lt;p&gt;The keyword search version needs to do an extra API call to the ChatGPT API, on the other hand, the vector search version needs to produce embeddings for all the documents in the database plus the question. Unless we’re talking about a lot of documents, I’m going to call this a tie.&lt;/p&gt;

&lt;p&gt;Score: 5-6&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The score is tight! In our case we have gone with using the keyword search for now, mostly because we have more ways of tuning it and as a result of that it generates slightly better answers for our set of test questions. Also, any improvements that we make to search automatically benefit both the search and the chat use cases. As we’re improving our vector search capabilities with more tuning options, we might switch to vector search, or a hybrid approach, in the future.&lt;/p&gt;

&lt;p&gt;If you’d like to set up a similar chat bot for your own documentation, or any kind of knowledge base, you can easily implement the above using the Xata ask endpoint. &lt;a href="https://app.xata.io/signup" rel="noopener noreferrer"&gt;Create an account&lt;/a&gt; for free and join us on &lt;a href="https://xata.io/discord" rel="noopener noreferrer"&gt;Discord&lt;/a&gt;. I’d be happy to personally help you get it up and running!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>chatgpt</category>
    </item>
  </channel>
</rss>
