We lean heavily on Elasticsearch here at CompanyCam, especially to serve our highly filterable project feed. It's incredibly fast, even when you apply multiple filters to your query while searching a largish data set. Our primary interface for interacting with Elasticsearch is using the Searchkick gem. Searchkick is a powerhouse with many features out of the box, but we bump up against the edges when trying to reindex a large collection.
Mo' Projects, Mo' Problems
CompanyCam houses just under 21 million projects with tens of thousands of new projects added daily. On occasion we change the fields that we want to be available for filtering. In order to make the new field(s) available, we have to reindex the entire collection of records. Reindexing a collection of this size - where each record additionally pulls in values from associated records - can be quite slow. If we run the reindex synchronously it takes about 10 hours, and that is with eager loading of the associations and other optimizations. Never fear though, Searchkick accounts for this and has the ability to use ActiveJob to reindex asynchronously. The one thing that isn't accounted for is how to promote that index once the indexing is complete. You can run the task like reindex(async: { wait: true })
which will run the indexing operation async and do a periodic pull waiting for indexing to complete and then promote the index. This almost works but can still take hours and I hate to sit on a server instance waiting for this to complete - what if I get disconnected or the instance terminates due to a deploy? We decided that it was time to build a small workflow around indexing large collections asynchronously that automatically promotes itself upon completion.
Enter Tooling
My first goal was to start a collection indexing operation from our internal admin tool, Dash, and monitor the progress. With those two simple goals in mind, this is what I came up with.
I needed two jobs - one job to enqueue the indexing operation and another to monitor the operation until completion. Once those are built, I can use Rails basics to wrap this in a minimalistic UI.
The Jobs
The first job needed to:
- Accept a class name and start the reindex for the given class
- Store the pending index name for later usage, optimally in Redis
- Enqueue another job to monitor the progress of the indexing operation
This is what I came up with:
module Searchkick
class PerformAsyncReindexWorker
include Sidekiq::Worker
sidekiq_options retry: 0
def perform(klass)
result = klass.constantize.reindex(async: true)
index_name = result[:index_name]
Searchkick::AsyncReindexStatus.new.currently_reindexing << index_name
Searchkick::MonitorAsyncReindexWorker.perform_in(5.seconds, klass, index_name)
end
end
end
The second job, as you may have guessed, monitors the indexing progress until completion. The basic functionality needs to check the operation and:
- If incomplete, re-enqueue itself to check again in 10 seconds
- If complete, promote the index for the given class and removes the index name from the collection in Redis.
module Searchkick
class MonitorAsyncReindexWorker
include Sidekiq::Worker
def perform(klass, index_name)
status = Searchkick.reindex_status(index_name)
if status[:completed]
klass.constantize.search_index.promote(index_name)
Searchkick::AsyncReindexStatus.new.currently_reindexing.delete(index_name)
else
self.class.perform_in(10.seconds, klass, index_name)
end
end
end
end
By default the Searchkick::BulkReindexJob
uses the same queue as regular async reindexing, blocking user generated content from being indexed while performing a full reindex. So I also patched the Searchkick::BulkReindexJob
to use a custom queue we have just for performing full collection indexing operations. In an initializer I simply did:
class Searchkick::BulkReindexJob
queue_as { 'searchkick_full_reindex' }
end
The Status Object
You may be wondering what Searchkick::AsyncReindexStatus
is. It is a simple class that includes the Redis::Objects
library so that we can store a list of currently reindexing collections. It looks like this:
module Searchkick
class AsyncReindexStatus
include Redis::Objects
def id
'searchkick-async-reindex-status'
end
list :currently_reindexing
end
end
Note: I opted to use Redis::Objects
since it was already in our codebase and it is a bit simpler than interacting with Redis directly using Searchkick.redis
.
How to Kick off the Job
An indexing operation can be kicked off in one of two ways. You can start it via the command-line if you have access such as Searchkick::PerformAsyncReindexWorker.perform_async(model_class)
. Instead, we built a crude interface into our internal admin tool. The UI allows us to select a model and start the indexing operation and then track it's status until completion.
The Code
For the full code that we use you can look at this gist. Always happy to hear improvements that could be made as well!
Recap
Searchkick saves us serious time and energy developing features. By taking one of its existing features like async reindexing and wrapping it in a bit of workflow, we can get even more out of it. In the end, we were able to scratch our own itch for truly async indexing operations.
Top comments (0)