jmc

Posted on Jun 17, 2019

Understanding Rails Eager-Loading

#ruby #rails #activerecord #sql

Note: this is an article that I posted a few years ago on Medium. I'm reposting here as it's likely to be more useful to someone that way.

If you're like me, nothing makes your eyes glaze over faster than a wet, steaming pile of machine-generated SQL. My aim was to write a Rails eager-loading guide that's heavy on motivation and light on log statements. Hopefully I can provide you with the explanation I never had.

This post assumes that you understand model associations, are comfortable with JOINs, and are familiar with the N+1 query problem. If any of that sounds foreign, I recommend playing with plain SQL first (SELECT, JOIN, & WHERE).

(That's the real trick to making this simple. While the way Active Record makes objects from records is intuitive enough, its query-generation abilities won't save you from having to understand how the queries execute in the db.)

Let's begin!

Why do we need eager loading?

In other words: why are N+1 queries bad? It's because database trips are expensive. Queries have overhead, transactions have overhead, and talking to a remote database server is slow. Not something you want to do in a loop, but ORMs make it easy to shoot yourself in the foot here.

The good news is that if we tell Active Record about the associations we plan to use later, it can preload the associated records with a small number of queries, so that any looping can happen over records that are already in memory.

`eager_load`: One query, with JOINs

Look at this example:

users = User.eager_load(:address)
cities = users.map do |user|
  user.addresses.city
end

Using the magic of a SQL JOIN, Active Record can fetch the users and addresses with a single query. And because eager_load uses a left join, you'll get all the users, not just the ones with addresses.

I'll spare you the full SQL output, but know that the SELECT clause contains many entries like "user.name" AS t0_r0. Now you may have noticed that t0_r0 is an emoji who cries at a right angle, but did you know that he and his numbered friends make every column from both tables available during the query? This means we can reference any attributes from an associated model in chained where or group calls, e.g.:

users = User.eager_load(:address).where(address: { city: 'New Haven' })
streets = users.map do |user|
  user.address.street
end

So we've got everything in one query, and our where calls work. Are we done?

Maybe not: it turns out that those factory-farm eager_load queries aren't just hard on the eyes -- they can also be slower than the queries our next option produces (by an order of magnitude or more!). This is partly because joining is expensive, and partly because joins can introduce duplication that increases the number of objects to create.

And Codd help you if the joins aren't on indexed columns.

`preload`: One query per table

This one is simple : AR generates a query for each referenced table. The lack of joins means that this tends to be faster even when taking the extra query overhead into account.

But while our first example works fine with preload swapped out for eager_load, our second, which needs to use addresses.city in the generated WHERE clause, will fail. Since the users query only SELECTs from users, the WHERE clause for that query can't refer to columns from the address table. (There addresses query won't help here because it's done separately.)

`includes`: `preload` by default, `eager_load` if necessary

This is the the most commonly-used method, and if you've read up to this point, it hopefully requires the least explanation. When using includes, by default AR loads each table separately, just like with preload. However, includes will switch to eager_load's join-all-the-things behavior if you mention the associated table in a chained where or group call.

Tip: if you're calling where or group with SQL strings rather than hashes, you can use references to trigger a join.

`joins`: ¯\_(ツ)_/¯

Unsurprisingly, joins also generates a query which uses JOINs (inner joins, this time).

The main advantage here is that you get more control over the generated query. Remember that eager_load and includes will return every column from the joined tables. That's true even when select is called!

Often you'll reach for joins when the result set dictates it -- when you only want the records described by an INNER JOIN, or when a where relies on columns from associated tables, but you don't want to drag in the rest of those tables' columns.

So although joins tends to appear in articles about eager-loading, it's mostly used to restrict what gets loaded. It can be used together with select for some surgical loads. Let's look at one:

users = User.joins(:address).select('addresses.city AS city')
cities = users.map do |user|
  user.city
end

Here, city becomes a new method on the user instances, because we used an AS alias in the argument to select.

Tip: this will lead to some objects having methods that others of the same class don't. This can potentially lead to confusing code -- you can mitigate that by also defining the extra method on the model class. That way, preloading is just an optional speedup, rather than required to make the calls work. Might be a good time to introduce a query object, though.

Note that with joins, the "select + alias → new method" trick is the only way to use the eager-loaded values. Calling user.address.city instead in that loop will ignore the in-memory values and fire as many new queries as there are users.

Other optimizations

It can be confusing to pick apart which speedups come from which parts of Active Record. For the sake of completeness, let's look at a couple of unrelated tricks that AR uses to avoid querying.

First, AR uses proxy objects to represent queries, and only replaces them with result sets when you call a method like all, first, count, or each. This design is also what enables us to build queries via chained calls.

AR also caches results for recent queries. If you fire the same query multiple times, the database is only contacted once. (When the log says CACHE instead of LOAD, the results are being read from memory rather than the db.)

Unfortunately, neither of these helps with the N+1 problem, where loops in application code fire multiple unique queries. To solve that we still need eager-loading.

Final tips

There are usually multiple ways to fetch the same data. Understanding each means that you can pick the fastest, or the most readable.
When benchmarking, try to use a setup resembling your production environment. A local database will have much less overhead per-query than a remote one.
You should also index on columns you intend to JOIN on, but note that indexing isn't a cure-all. Indexes slow down INSERTs and UPDATEs, and can even slow down reads (if "storage order" suffices, e.g. fetching the entire table).
When eager-loading still isn't enough, a useful pattern is to build a hash map connecting one model to another, and look up associated values "manually" using the map. This can be especially useful in a Presenter / View Object.
There are also libraries like goldiloader which attempt to automate eager-loading beyond what AR provides.
When you only want certain columns, use pluck.

DEV Community

Understanding Rails Eager-Loading

Why do we need eager loading?

`eager_load`: One query, with JOINs

`preload`: One query per table

`includes`: `preload` by default, `eager_load` if necessary

`joins`: ¯\_(ツ)_/¯

Other optimizations

Final tips

Suggested reading

Top comments (0)

Read next

NoSQL vs NewSQL vs Distributed SQL: A Comprehensive Comparison

LINQPad for everything(or at least a lot)

TypedSQL in Prisma: Type-Safe SQL Made Simple

SQL, get median of all columns in each row #eg21

Why do we need eager loading?

eager_load: One query, with JOINs

preload: One query per table

includes: preload by default, eager_load if necessary

joins: ¯\_(ツ)_/¯

Other optimizations

Final tips

Suggested reading

Read next

NoSQL vs NewSQL vs Distributed SQL: A Comprehensive Comparison

LINQPad for everything(or at least a lot)

TypedSQL in Prisma: Type-Safe SQL Made Simple

SQL, get median of all columns in each row #eg21

`eager_load`: One query, with JOINs

`preload`: One query per table

`includes`: `preload` by default, `eager_load` if necessary

`joins`: ¯\_(ツ)_/¯