DEV Community

Honeybadger Staff for Honeybadger

Posted on • Originally published at honeybadger.io

Full-text Search with Elasticsearch in Rails

This article was originally written by Ianis Triandafilov on the Honeybadger Developer Blog.

Elasticsearch is one of the most popular search engines out there. Among the many big companies that love it and actively use it in their production, there are such giants as Netflix, Medium, GitHub.

Elasticsearch is very powerful, with the main use cases featuring full-text search, real-time log, and security analysis.

Unfortunately, Elasticsearch doesn't get much attention from the Rails community, so this article attempts to change this with two goals in mind: introduce the reader to the Elasticsearch concepts and show how to use it with Ruby on Rails.

You can find the source code of an example project that we're going to build here. The commit history more or less corresponds to the order of the sections in this article.

Introduction

From a broader perspective, Elasticsearch is a search engine that

  • is built on top of Apache Lucene;
  • stores and effectively indexes JSON documents;
  • is open-source;
  • provides a set of REST APIs for interacting with it;
  • by default has no security (anyone can query it via public endpoints);
  • scales horizontally pretty well.

Let's take a quick look at some of the basic concepts.

With Elasticsearch, we put documents into indices, which are then queried for data.

An index is similar to a table in a relational database; it is a store where we put documents (rows) that can later be queried.

A document is a collection of fields (similar to a row in a relational database).

A mapping is like schema definition in a relational database. Mapping can be defined explicitly or guessed by Elasticsearch at the insert time; it's always better to define the index mapping upfront.

With that covered, let's now set up our environment.

Installing Elasticsearch

The easiest way to install Elasticsearch on macOS is to use brew:

brew tap elastic/tap
brew install elastic/tap/elasticsearch-full
Enter fullscreen mode Exit fullscreen mode

As an alternative, we can run it via docker:

docker run \
  -p 127.0.0.1:9200:9200 \
  -p 127.0.0.1:9300:9300 \
  -e "discovery.type=single-node" \
  docker.elastic.co/elasticsearch/elasticsearch:7.16.2
Enter fullscreen mode Exit fullscreen mode

For other options, please refer to the official reference.

Elasticsearch accepts requests on port 9200 by default. You can check that it's running with a simple curl request (or open it in a browser):

curl http://localhost:9200
Enter fullscreen mode Exit fullscreen mode

APIs

Elasticsearch provides a set of REST APIs to interact with for every possible type of task. For example, suppose we run a POST request with a JSON content type to create a document:

curl -X POST http://localhost:9200/my-index/_doc \
  -H 'Content-Type: application/json' \
  -d '{"title": "Banana Cake"}'
Enter fullscreen mode Exit fullscreen mode

In this case, my-index is the name of an index (if it's not present, it gets created automatically).

The _doc is a system route (all system routes start with an underscore).

There are multiple ways how we can interact with the APIs.

  1. Using curl from the command-line (you might find jq handy).
  2. Running GET queries from the browser using some extension for pretty printing JSON.
  3. Installing Kibana and using the Dev Tools console, which is my favorite way.
  4. Finally there are also some great Chrome extensions.

For the sake of this article, it doesn't matter which you choose—we are not going to interact with the APIs directly anyways. Instead, we will use a gem, which talks to the REST API under the hood.

Starting a new app

The idea is to create a song lyrics application using a public dataset of 26K+ songs. Each song has a title, artist, genre, and text lyrics fields. We will be using Elasticsearch for a full-text search.

Let's start by creating a simple Rails application:

rails new songs_api --api -d postgresql
Enter fullscreen mode Exit fullscreen mode

As we will only use it as API, we provide the --api flag to limit the set of middlewares used.

Let's scaffold our app:

bin/rails generate scaffold Song title:string artist:string genre:string lyrics:text
Enter fullscreen mode Exit fullscreen mode

Now, let's run the migrations and start the server:

bin/rails db:create db:migrate
bin/rails server
Enter fullscreen mode Exit fullscreen mode

After that, we verify that the GET endpoint works:

curl http://localhost:3000/songs
Enter fullscreen mode Exit fullscreen mode

This returns an empty array, which is no wonder as there's no data yet.

Introducing Elasticsearch

Let's add Elasticsearch into the mix. To do so, we will need the elasticsearch-model gem. It's an official Elasticsearch gem that integrates nicely with Rails models.

Add the following to your Gemfile:

gem 'elasticsearch-model'
Enter fullscreen mode Exit fullscreen mode

By default, it will connect to port 9200 on localhost, which suits us perfectly, but if you want to change that, you can initialize the client by

Song.__elasticsearch__.client = Elasticsearch::Client.new host: 'myserver.com', port: 9876
Enter fullscreen mode Exit fullscreen mode

Next, to make our model indexable by Elasticsearch, we need to do two things. First, we need to prepare a mapping (which is essentially telling Elasticsearch about our data structure), and second, we should construct a search request. Our gem can do both, so let's see how to use it.

It's always a good idea to keep Elastisearch-related code in a separate module, so let's create a concern at app/models/concerns/searchable.rb and add

# app/models/concerns/searchable.rb

module Searchable
  extend ActiveSupport::Concern

  included do
    include Elasticsearch::Model
    include Elasticsearch::Model::Callbacks

    mapping do
      # mapping definition goes here
    end

    def self.search(query)
      # build and run search
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

Even though it's just a skeleton, there's something to unpack here.

The first and the most important thing is the Elasticsearch::Model, which adds some functionality for interacting with ES. The Elasticsearch::Model::Callbacks module ensures that when we update a record, it automatically updates the data in Elasticsearch. The mapping block is where we put Elasticsearch index mapping, which defines what fields are going to be stored in Elasticsearch and what type should they have. Finally, there is a search method that we will use to actually search Elasticsearch for song lyrics. The gem we are using provides a search method that can be used with a simple query like Song.search("genesis”), but we will use it with a more complex search query constructed using the query DSL (more on that later).

Let's not forget to include the concern in our model class:

# /app/models/song.rb

class Song < ApplicationRecord
  include Searchable
end
Enter fullscreen mode Exit fullscreen mode

Mappings

In Elasticsearch, mapping is like a schema definition in a relational database. We describe the structure of the documents that we want to store. Unlike a typical relational database, we don't have to define our mapping upfront: Elasticsearch will do its best to guess the type for us. Still, as we don't want any surprises, we will explicitly define our mapping beforehand.

Mapping can be updated via the REST endpoint using PUT /my-index/_mapping and read via GET /my-index/_mapping, but the elasticsearch gem abstracts that for us, so all we need to do is provide the mapping block:

# app/models/concerns/searchable.rb

mapping do
  indexes :artist, type: :text
  indexes :title, type: :text
  indexes :lyrics, type: :text
  indexes :genre, type: :keyword
end
Enter fullscreen mode Exit fullscreen mode

We are going to index the artist, title, and lyrics fields using the text type. It is the only type that is indexed for a full-text search. For the genre, we will use the keyword type, which is an ideal search filtered by an exact value.

Now run the rails console with bin/rails console and then run

Song.__elasticsearch__.create_index!
Enter fullscreen mode Exit fullscreen mode

This will create our index in Elasticsearch. The __elasticsearch__ object is our gate to Elasticsearch world, packed with a lot of useful methods for interacting with Elasticsearch.

Importing the data

Every time we create a record, it will automatically send the data to Elasticsearch. So, we are going to download a dataset with song lyrics and import it into our app. First, download it from this link (a dataset under Creative Commons Attribution 4.0 International license). This CSV file contains more than 26,000 records, which we will import into our database and Elasticsearch with the code below:

require 'csv'

class Song < ApplicationRecord
  include Searchable

  def self.import_csv!
    filepath = "/path/to/your/file/tcc_ceds_music.csv"
    res = CSV.parse(File.read(filepath), headers: true)
    res.each_with_index do |s, ind|
      Song.create!(
        artist: s["artist_name"],
        title: s["track_name"],
        genre: s["genre"],
        lyrics: s["lyrics"]
      )
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

Open the rails console and run Song.import_csv! (this will take some time). Alternatively, we could import our data in bulk, which is much faster, but in this case, we want to make sure we create records in our PostgreSQL database and the Elasticsearch.

When the import finishes, we now have a great deal of lyrics we can search.

Searching the data

The elasticsearch-model gem adds a search method that allows us to search among all indexed fields. Let's use it in our searchable concern:

# app/models/concerns/searchable.rb

# ...
def self.search(query)
  self.__elasticsearch__.search(query)
end
# ...

Enter fullscreen mode Exit fullscreen mode

Open the rails console and run res = Song.search('genesis'). The response object contains a lot of meta information: how much time the request took, what nodes were used, etc. We are interested in hits, at res.response["hits"]["hits"].

Let's change our controller's index method to query ES instead.

# app/controllers/songs_controller.rb

def index
  query = params["query"] || ""
  res = Song.search(query)
  render json: res.response["hits"]["hits"]
end
Enter fullscreen mode Exit fullscreen mode

Now we can try loading it in a browser or using curl http://localhost:3000/songs?query=genesis. The response will look like this:


[
  {
  "_index": "songs",
  "_type": "_doc",
  "_id": "22676",
  "_score": 12.540506,
  "_source": {
    "id": 22676,
    "title": "genesis",
    "artist": "grimes",
    "genre": "pop",
    "lyrics": "heart know heart ...",
    "created_at": "...",
    "updated_at": "..."
    }
  },
...
]
Enter fullscreen mode Exit fullscreen mode

As you can see, the actual data is returned under the _source key, the other fields are metadata, the most important of which is _score showing how the document is relevant for the particular search. We'll get to it soon, but first let's learn how to make queries.

The query DSL

The Elasticsearch query DSL provides a way for constructing complex queries, and we can use it from the ruby code as well. For example, let's modify the search method to only search the artist field:

# app/models/concerns/searchable.rb

module Searchable
  extend ActiveSupport::Concern

  included do
    # ...

    def self.search(query)
      params = {
        query: {
          match: {
            artist: query,
          },
        },
      }

      self.__elasticsearch__.search(params)
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

The query-match construct allows us to search only a particular field (in this case, the artist). Now, if we query the songs again with "genesis" (try it out by loading http://localhost:3000/songs?query=genesis), we will only get the songs by the band "Genesis", and not the songs that have "genesis" in their title. If we want to query multiple fields, which is often the case, we can use a multi-match query:

# app/models/concerns/searchable.rb

def self.search(query)
  params = {
    query: {
      multi_match: {
        query: query, 
        fields: [ :title, :artist, :lyrics ] 
      },
    },
  }

  self.__elasticsearch__.search(params)
end
Enter fullscreen mode Exit fullscreen mode

Filtering

What if we only want to search, say, among rock songs? Then, we need to filter by genre! This is going to make our search a bit more complex, but don't worry—we will explain everything step by step!

  def self.search(query, genre = nil)
    params = {
      query: {
        bool: {
          must: [
            {
              multi_match: {
                query: query, 
                fields: [ :title, :artist, :lyrics ] 
              }
            },
          ],
          filter: [
            {
              term: { genre: genre }
            }
          ]
        }
      }
    }

    self.__elasticsearch__.search(params)
  end
Enter fullscreen mode Exit fullscreen mode

The first new keyword is bool, which is just a way to combine multiple queries into one. In our case, we are combining must and filter. The first one (must) contributes to the score and contains the same query we have already used before. The second one (filter) doesn't contribute to the score, it just does what it says: filters out documents that don't match the query. We want to filter our records by genre, so we use the term query.

It's important to note that the filter-term combination has nothing to do with the full-text search. It's just a regular filter by the exact value, the same way that the WHERE clause works in SQL (WHERE genre = 'rock'). It's good to know how to use the term filtering, but we won't need it here.

Scoring

The search results are ordered by _score that shows how an item is relevant for a particular search. The higher the score is, the more relevant the document is. You have probably noticed that when we searched the word genesis, the first result that popped up was the song by Grimes, whereas I was actually more interested in the Genesis band. So, can we alter the scoring mechanism to pay more attention to the artist field? Yes we can, but to do that, we need to first tweak our query:

  def self.search(query)
    params = {
      query: {
        bool: {
          should: [
            { match: { title: query }},
            { match: { artist: query }},
            { match: { lyrics: query }},
          ],
        }
      },
    }

    self.__elasticsearch__.search(params)
  end
Enter fullscreen mode Exit fullscreen mode

This query is essentially equivalent to the former one except that it is using the bool keyword, which is just a way to combine multiple queries into one. We use should, which contains three queries separately (one per field): they are essentially combined using logical OR. If we use must instead, they would be combined using logical AND. Why do we need a separate match per field? That's because now we can specify the boost property, which is a coefficient that multiplies the score from the particular query:

  def self.search(query)
    params = {
      query: {
        bool: {
          should: [
            { match: { title: query }},
            { match: { artist: { query: query, boost: 5 } }},
            { match: { lyrics: query }},
          ],
        }
      },
    }

    self.__elasticsearch__.search(params)
  end
Enter fullscreen mode Exit fullscreen mode

Other things equal, our score will be five times higher provided the query matches the artist. Try the genesis query again, with http://localhost:3000/songs?query=genesis, and you will see Genesis band songs coming first. Sweet!

Highlighting

Another helpful feature of Elasticsearch is being able to highlight the match within the document, which allows the user to understand better why a particular result appeared in the search.

In HTML, there's a special HTML tag for that, and Elasticsearch can automatically add that.

Let's open the searchable.rb concern again and add a new keyword:

def self.search(query)
  params = {
    query: {
      bool: {
        should: [
          { match: { title: query }},
          { match: { artist: { query: query, boost: 5 } }},
          { match: { lyrics: query }},
        ],
      }
    },
    highlight: { fields: { title: {}, artist: {}, lyrics: {} } }
  }

  self.__elasticsearch__.search(params)
end
Enter fullscreen mode Exit fullscreen mode

The new highlight field specifies which fields should be highlighted. We select all of them. Now, if we load http://localhost:3000/query=genesis, we should see a new field called "highlight" that contains document fields with matched phrases wrapped in the em tag.

For more on highlighting, please refer to the official guide.

Fuzziness

All right, what if we by mistake wrote benesis instead of genesis? This is not going to return any results, but we can tell Elasticsearch to be less picky and allow fuzzy search, so it will display the genesis results as well.

Here is how it is done. Just change the artist query from { match: { artist: { query: query, boost: 5 } }} to { match: { artist: { query: query, boost: 5, fuzziness: "AUTO" } }}. The exact fuzziness mechanics can be configured. Please refer to the official docs for more details.

Where to next?

I hope this article has convinced you that Elasticsearch is a powerful tool that can and should be used when you need to implement a non-trivial search. If you are ready to learn more, here are some helpful links:

Resources

Alternative gems

Top comments (0)