loading...
Cover image for PostgreSQL (Rails) - Reverse search (tagging, subscription)

PostgreSQL (Rails) - Reverse search (tagging, subscription)

zealot128 profile image Stefan Wienert Originally published at stefanwienert.de on ・5 min read

Using PostgresQL’s search capabilities (tsvector etc.) is a well-known low-key alternative to using a full blown search engine, like Elasticsearch/Solr/Algolia. But it turns out, it is also quite easily usable as a “reverse search engine”, what Elasticsearch calls “Percolate”. One use case could be user supplied “Subscriptions”, think of “search subscriptions” of classified that you want to trigger whenever a new item reaches the database.

After fiddling around one day with Elasticsearch percolators I was not so satisfied with the results, that I decided to trying PG’s tsvector/tsquery.

Recently, I’ve wanted to build a keyword tagger , that given a blob of text uses a predefined vocabulary database to give you the “most relevant” tags for that tags. The process should be absolutely the same as with search subscriptions by reversing Query (queries are stored in the DB) and Documents (only one document: the document to match with the queries). For this example I will use keyword matching as we implemented it in Empfehlungsbund in various places:

  • Having a table keywords with a string column keyword, e.g. a managed table of most important keywords for your domain. In our cases that includes all the technolgy terms of developers/administrators (like “Javascript” or “Ruby on Rails”)
  • there is also a little hierarchie of keywords in “competence groups” and we also add the “search relevance” which expresses how often this term is used by Job Seekers on our job platforms of Empfehlungsbund.

1. Adding TS-columns

  1. We add a tsquery column keyword_query and a normalized token keyword_search_token
CREATE TABLE public.keywords (
    id bigint NOT NULL,
    keyword character varying,
    keyword_search_token character varying,
    keyword_query tsquery,
);

Enter fullscreen mode Exit fullscreen mode

keyword_query if you follow several guides about PG’s fulltext search, I never came across one that proposed to just use a tsquery instead of a tsvector. But in our case (percalote) this makes total sense to add the string as tsquery.

keyword_search_token is a normalized version of our query, a processing pipeline that we ran before saving a record AND before tagging a document on that document later on. In our case the Pipeline consists of:

  • Remove All HTML,
  • normalize collocations (“java-developer” and “java developer”)
  • Remove common (German) stop words,
  • REPLACE important special chars that our search engine will need, like “.” (‘.NET’, “Vue.js”) or prefix/suffix ‘#’, ‘+’ (C++, C#). Many search engines does get this not right and strip out all special chars by default.

Here a Ruby snippet from our codebase:

STOPWORDS = YAML.load_file('config/stopwords.yml')                                                                                                 
WORD_BOUNDARY = '(^|\s|[,\.\-\!\?])'
REPLACE_REGEX = /#{WORD_BOUNDARY}(#{STOPWORDS.join('|')})#{WORD_BOUNDARY}/i

REMOVE_STRING = [
  REPLACE_REGEX,
  %r{\([mwd/ ]+\)}
]

def preprocess_query(text)
   ActionController::Base.helpers.strip_tags(text).
      yield_self { |s| replace_special_chars(s) }.
      yield_self { |s| REMOVE_STRING.reduce(s) { |agg, elem| agg.gsub(elem, ' ') } }.
      # "Foo/Bar"
      gsub('/', ' / ').
      # EcmaScript5 -> Ecmascript 5
      gsub(/(\w+)(\d+)#{WORD_BOUNDARY}/, '\\1 \\2 \\3').
      tr('-', ' ').
      gsub(/\s+/, ' ').
      strip
end

def self.replace_special_chars(string)
  # C# C++, F#, ...
  string.
    gsub(/[#\+]/, '#' => "RAUTE", '+' => 'PLUS').
    # vue.js node.js, .net but not sentence dot
    gsub(/(\w+)?\.(\w+)/, '\\1DOT\\2')
  end

Enter fullscreen mode Exit fullscreen mode

2. Adding a tsquery column and trigger

Now, we want to build a trigger that manages the keyword_query conversion automatically. We can to this inline by splitting the words by whitespace and joining them back together with the infix “<->” operator, which means “the words on either side of the operator should be near together”. (Introduced in PG 9.6). This is extremely useful for searching for collocations like “Ruby on Rails”

CREATE FUNCTION public.keyword_results_before_insert_update_row_tr() RETURNS trigger
  LANGUAGE plpgsql
  AS $$
BEGIN
  new.keyword_query := to_tsquery(array_to_string(string_to_array(new.keyword_search_token, ' '), ' <-> '));
  RETURN NEW;
END;
$$;
CREATE TRIGGER keyword_results_before_insert_update_row_tr 
BEFORE INSERT OR UPDATE ON public.keyword_results 
FOR EACH ROW EXECUTE PROCEDURE  public.keyword_results_before_insert_update_row_tr();

Enter fullscreen mode Exit fullscreen mode

2a. Rails tsquery migration

If you are using Rails you can use hairtrigger Gem to more easily create the trigger in a migration:

add_column :keyword_results, :keyword_search_token, :string
add_column :keyword_results, :keyword_query, :tsquery
add_index :keyword_results, :keyword_query, using: 'gist'

create_trigger(compatibility: 1).on(:keyword_results).before(:insert, :update) do
  "new.keyword_query := to_tsquery(array_to_string(string_to_array(new.keyword_search_token, ' '), ' <-> '))"
end

Enter fullscreen mode Exit fullscreen mode

BUT: ActiveRecord’s PG-Adapter at this moment does not know at the moment to handle tsquery columns and will fail with:

unknown OID 3615: failed to recognize type of 'keyword_query'. It will be treated as String.

Enter fullscreen mode Exit fullscreen mode

And the table will be missing from db/schema.rb, and thus not loaded in test, with this comment:

# Could not dump table "keywords" because of following StandardError
# Unknown type 'tsquery' for column 'keyword_query'

Enter fullscreen mode Exit fullscreen mode

The normal solution which you find for this class of problem is generally to switch to “SQL”-mode of Schema dumper. I don’t like that, because it makes the diffs gigantic and merges/rebases fail more often. I found, I could just tell Rails via patching how to handle that problem:

# config/initializers/active_record_pg_types.rb
require "active_record/connection_adapters/postgresql_adapter"

module PGAddTypes
  # 1st place: how to handle this column = it's a string!
  def initialize_type_map(m = type_map)
    super
    m.register_type "tsquery", ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::OID::SpecializedString.new(:tsquery)
    m.register_type(3615, ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::OID::SpecializedString.new(:tsquery))
  end
end
# 2nd place:
ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:tsquery] = { name: "tsquery" }

ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.prepend PGAddTypes

# 3rd place: mapping a schema / migration line to a database mapping
module ActiveRecord::ConnectionAdapters::PostgreSQL::ColumnMethods
  def tsquery(*args, **options)
    args.each { |name| column(name, :tsquery, options) }
  end
end

Enter fullscreen mode Exit fullscreen mode

With this code in place, schema-dumping and loading will work again!

3. Reverse searching against table

preprocessed_query = preprocess_query(params[:query])
# create intermediate from table with our keywords and the preprocessed full_text:
from = KeywordResult.sanitize_sql_array ['keywords, to_tsvector(?) full_text', preprocessed_query]
KeywordResult.
  # where: having a match
  where(%{full_text @@ tsv_body}).
  from(from).
  order('score desc').
  # select: all normal columns + a "score"
  select("keywords.*, ts_rank(full_text, tsv_body) as score")

Enter fullscreen mode Exit fullscreen mode

In SQL:

SELECT keywords.*, ts_rank(full_text, tsv_body) as score
FROM keyword_results, to_tsvector('Tag this string with keywords like Ruby on Rails') full_text
WHERE (full_text @@ tsv_body)
ORDER BY score desc

Enter fullscreen mode Exit fullscreen mode

(Of course, you should NOT pass the string in raw like this but use prepared statements with a placeholder variable)

Tips

Improving the accuracy by scoping

If you have a large spectrum of keywords there is the problem that some words can mean different things for different audiences. For example, when parsing a job ad or a applicant CV, imagine you find the word “Amazon” - For an IT person that probably is related to “AWS”, for a Retail staff maybe the Amazon Ad or Shops. Or think of generic greek/latin names for all kind of technologies (“Atlas”, “Prometheus” etc.) which could have all kind of meanings in different contexts, or roles like “Architect” or “Designer”. In our case we add the general sector to a keyword, like “IT” (programmer, architects, admins, tech designers), “Engineering”, “medicine”, “white-collar office” (Marketing, Sales, HR, Accounting) and similar. In a first step we first find, which of the Tags would have the most direct keyword matches and then only search for those keywords in this sector.

Suggest new keywords

As mentioned in the beginning, we have a 2 tier hierarchy of our keywords. Each keyword belongs to a “competence” (E.g. “Ruby on Rails”, “RSpec” belongs to “Ruby”). To suggest keywords we just take the keywords found, go up to the competence and then take the most relevant N keywords ordered by the “search popularity”.

Discussion

pic
Editor guide