exAspArk

Posted on Aug 16, 2017

Batching – A powerful way to solve N+1 queries every Rubyist should know

#batching #ruby #restfulapi #graphql

In this post, I’m going to tell you about batching as a technique to help avoid N+1 queries, existing battle-tested tools like Haskell Haxl and JavaScript DataLoader, and how similar approaches can be used in any Ruby program.

What are N+1 queries?

First, let’s find out what N+1 queries are and why they are called that. Imagine, we have 2 SQL tables: users and posts. If we run the following code by using ActiveRecord models:

posts = Post.where(id: [1, 2, 3])
# SELECT * FROM posts WHERE id IN (1, 2, 3)

users = posts.map { |post| post.user }
# SELECT * FROM users WHERE id = 1
# SELECT * FROM users WHERE id = 2
# SELECT * FROM users WHERE id = 3

This makes 1 query to select all posts, then N queries to select users for each post. So, the code produces 1+N queries. Because changing the order of the addends does not change the sum, it’s common to call it N+1 queries.

How do people usually solve N+1 queries?

Generally, there are 2 popular ways to solve the problem of N+1 queries in the Ruby world:

Eager loading data in models

posts = Post.where(id: [1, 2, 3]).includes(:user)
# SELECT * FROM posts WHERE id IN (1, 2, 3)
# SELECT * FROM users WHERE id IN (1, 2, 3)

users = posts.map { |post| post.user }

It preloads the specified associations under the hood, which makes it easy to use. But ORM can’t always help. For example, in a case when we need to load data from different sources such as another database.

Preloading data and passing it through arguments

class Post < ApplicationRecord
  def rating(like_count, angry_count)
    like_count * 2 - angry_count
  end
end

posts = Post.where(id: [1, 2, 3])
# SELECT * FROM posts WHERE id IN (1, 2, 3)

post_emoticons = Emoticon.where(post_id: posts.map(&:id))

like_count_by_post_id = post_emoticons.like.group(:post_id).count
angry_count_by_post_id = post_emoticons.angry.group(:post_id).count
# SELECT COUNT(*) FROM emoticons WHERE name = 'like' AND post_id IN (1, 2, 3) GROUP BY post_id
# SELECT COUNT(*) FROM emoticons WHERE name = 'angry' AND post_id IN (1, 2, 3) GROUP BY post_id

posts.map do |post|
  post.rating(
    like_count_by_post_id[post.id],
    angry_count_by_post_id[post.id]
  )
end

This method is very flexible and can work when the simpler method of using includes can’t. And it may also be more memory efficient – in our example above we don’t load all emoticons to calculate a rating for each post. Instead, our database can do all the hard work and return just a count for each post, which is passed as an argument. However, passing the data through arguments can be complicated, especially when there are several layers below (e.g. load Emoticons, pass through Users to Posts).

Both of these approaches can help us to avoid N+1 queries. But the problem is that we should know in advance on the top level which data we need to preload. And we will preload it every time, even if it’s not necessary. For example, with GraphQL these approaches don’t work since with GraphQL we can’t predict which fields users are going to ask in a query. Is there any other way to avoid N+1 queries? Yes, it’s called batching.

What is batching?

Batching isn’t a new way to solve N+1 queries. Facebook released the Haskel Haxl library in 2014, but the technique has been used long before. It uses such concepts as Monad, Applicative and Functor. I won’t get into explaining these ideas as it deserves a separate post (would you be interested in learning more about functional programming in Ruby?).

The idea of batching was also implemented in other programming languages. One of the most well-known libraries is JavaScript DataLoader, which became very popular with the rise of GraphQL. Here is a great video about its source code by Lee Byron, an engineer at Facebook. The code is pretty straight forward and contains just 300 lines.

These are the general steps for batching:

Passing an item to load in any part of the app.
Loading and caching values for the passed items in batch.
Getting the loaded value where the item was passed.

The main advantage of using this technique is that batching is isolated. It allows to load data where and when it’s needed.

Here is a basic example of using JavaScript DataLoader:

var batch = (userIds) => ...;
var loader = new DataLoader(userIds => batch(userIds));

// “load” schedules a job to dispatch a queue with
// Node.js “process.nextTick” and returns a promise

loader.load(userId1).then(user1 => console.log(user1));
loader.load(userId2).then(user2 => console.log(user2));
loader.load(userId3).then(user3 => console.log(user3));

First, we are creating a loader with a function which accepts all the collected items to load (userIds). These items are passed to our batch function which loads all users at once. Then we can call the loader.load function which returns a promise with a loaded value (user). OK, but what about Ruby?

Batching in Ruby

At Universe, we have monthly hackathons when everyone is free to experiment with any ideas and technologies such as Ethereum, Elixir, Progressive Web App, etc. During my last hackathon, which I won in 1 of the nominations, I was learning about existing techniques to avoid N+1 queries in GraphQL and building a tool to transform GraphQL query to MongoDB Aggregation Pipeline for batching. Check out our previous post which describes how to use Aggregation Pipeline to “join” different collections, filter them, serialize and so on.

At the same time, while we’re migrating to GraphQL, we’re still supporting our RESTful APIs. Usually, N+1 DB queries and HTTP requests are the main problems which cause bottlenecks in our applications as we continue to scale our platform.

That is why we decided to create a new tool which will allow us to solve N+1 queries in our existing RESTful APIs as well as in GraphQL. A simple tool which every Ruby developer will be able to understand and use. It’s called BatchLoader.

Laziness

class Post < ApplicationRecord
  belongs_to :user

  def user_lazy
    # something cool with BatchLoader
  end
end

posts = Post.where(id: [1, 2, 3])
# SELECT * FROM posts WHERE id IN (1, 2, 3)

users_lazy = posts.map { |post| post.user_lazy }

BatchLoader.sync!(users_lazy)
# SELECT * FROM users WHERE id IN (1, 2, 3)

BatchLoader doesn’t try to mimic implementations in other programming languages which have an asynchronous nature. So, it doesn’t use any extra primitives such as Promises. There is no reason to use them in Ruby unless you’re using something like EventMachine.

Instead, it uses the idea of “lazy objects”, which are used in Ruby standard library. For example, “lazy arrays” – they allow to manipulate with elements and resolve them at the end when it’s necessary:

range = 1..Float::INFINITY

values_lazy = range.lazy.map { |i| i * i }.take(10)

values_lazy.force
# => [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

As you can see, the 2 code blocks above have the same pattern:

Collect lazy objects.
Resolve them at the end.

Batching

Now let’s have a closer look at Post#user_lazy, which returns a lazy BatchLoader instance:

# app/models/post.rb
def user_lazy
  BatchLoader.for(user_id).batch do |user_ids|
    User.where(id: user_ids)
  end
end

BatchLoader.for accepts an item (user_id) which should be collected and used for batching later. Then we call the batch method where we pass a block which will use all the collected items (user_ids). Inside the block, we execute a batch query for our items (User.where).

JavaScript DataLoader maps the passed item and loaded value implicitly. But it relies on these 2 constraints:

The array of passed items (user_ids) must be the same length as the array of loaded values (users). This usually means that we have to add nils in the array for absent values.
Each index in the array of passed items must correspond to the same index in the array of loaded values. This usually means that we should sort the loaded values.

BatchLoader, in this case, provides a load method which can be simply called to map a passed item (user_id) to loaded value (user):

# app/models/post.rb
def user_lazy
  BatchLoader.for(user_id).batch do |user_ids, batch_loader|
    User.where(id: user_ids).each { |u| batch_loader.load(u.id, u) }
  end
end

RESTful API example

Now imagine we have a regular Rails application with N+1 HTTP requests:

# app/models/post.rb
class Post < ApplicationRecord
  def rating
    HttpClient.request(:get, "https://example.com/ratings/#{id}")
  end
end

# app/controllers/posts_controller.rb
class PostsController < ApplicationController
  def index
    posts = Post.limit(10)
    serialized_posts = posts.map do |post| 
      {id: post.id, rating: post.rating} # <== N+1 HTTP requests
      end

    render json: serialized_posts
  end
end

We can batch the requests with a gem called parallel by executing all HTTP requests concurrently in threads. Thankfully, MRI releases GIL (global interpreter lock) when a thread hits blocking I/O – HTTP request in our case.

# app/models/post.rb
def rating_lazy
  BatchLoader.for(post).batch do |posts, batch_loader|
    Parallel.each(posts, in_threads: 10) do |post|
      batch_loader.load(post, post.rating)
    end
  end
end

# app/controllers/posts_controller.rb
class PostsController < ApplicationController
  def index
    posts = Post.limit(10)
    serialized_posts = posts.map do |post| 
      {id: post.id, rating: post.lazy_rating}
    end

    render json: BatchLoader.sync!(serialized_posts)
  end
end

Thread-safety

The example with concurrent HTTP requests in threads will work only if HttpClient is thread-safe. BatchLoader#load is thread-safe out of the box, so it doesn’t need any extra dependencies.

GraphQL example

Batching is particularly useful with GraphQL. Using such techniques as preloading data in advance to avoid N+1 queries can be very complicated since a user can ask for any available fields in a query. Let’s take a look at the simple graphql-ruby schema example:

Schema = GraphQL::Schema.define do
  query QueryType
end

QueryType = GraphQL::ObjectType.define do
  name "Query"
  field :posts, !types[PostType], resolve: ->(obj, args, ctx) do
    Post.all
  end
end

PostType = GraphQL::ObjectType.define do
  name "Post"
  field :user, !UserType, resolve: ->(post, args, ctx) do
    post.user # <== N+1 queries
  end
end

UserType = GraphQL::ObjectType.define do
  name "User"
  field :name, !types.String
end

If we want to execute a simple query like the following, we will get N+1 queries for each post.user:

query = "
{
  posts {
    user {
      name
    }
  }
}
"
Schema.execute(query)
# SELECT * FROM posts WHERE id IN (1, 2, 3)
# SELECT * FROM users WHERE id = 1
# SELECT * FROM users WHERE id = 2
# SELECT * FROM users WHERE id = 3

To avoid this problem, all we have to do is to change the resolver to use BatchLoader:

PostType = GraphQL::ObjectType.define do
  name "Post"
  field :user, !UserType, resolve: ->(post, args, ctx) do
    BatchLoader.for(post.user_id).batch do |ids, batch_loader|
      User.where(id: ids).each { |u| batch_loader.load(u.id, u) }
    end
  end
end

And setup GraphQL with the built-in lazy_resolve method:

Schema = GraphQL::Schema.define do
  query QueryType
  lazy_resolve BatchLoader, :sync
end

That’s it. GraphQL lazy_resolve will basically call a resolve lambda on the field. If it returns an instance of a lazy BatchLoader, it’ll call BatchLoader#sync later to get the actual loaded value automatically.

Caching

BatchLoader also provides a caching mechanism out of the box. So, it won’t make queries for already loaded values. For example:

def user_lazy(id)
  BatchLoader.for(id).batch do |ids, batch_loader|
    User.where(id: ids).each { |u| batch_loader.load(u.id, u) }
  end
end

user_lazy(1)      # no request
# => <#BatchLoader>

user_lazy(1).sync # SELECT * FROM users WHERE id IN (1)
# => <#User>

user_lazy(1).sync # no request
# => <#User>

Conclusion

Overall, batching is a powerful technique to avoid N+1 queries. I believe that every Ruby developer should know about it, and not just people who use GraphQL or other programming languages. It will allow you to decouple unrelated parts of your application and load your data where and when it’s needed in batch without sacrificing the performance.

At Universe, we’re using BatchLoader on production for both RESTful API and GraphQL by sharing the same code. For more information, check out the README and the source code, which has just 150 lines.

Originally published on Medium.

DEV Community