Introduction
The SQL provides us many features to query with efficiency: sub-queries, custom joins, aggregations, etc. We can use all of this with an ORM like ActiveRecord, however often we ignore it. Why? I believe there are three main reasons for this:
- You're not an advanced user of SQL;
- You don't know enough of
ActiveRecordto use its advanced features; - You simply are attained to some bad habits because of the facilities of your programming language (after all, it's much simpler to iterate over ruby objects doing simply comparisons than thinking in a complex custom join, right?).
All good with that until you don't face the side-effects of poor writing code: memory leak and low performance.
This post has the intention of deepening a little bit more in what ActiveRecord can do to help you avoid the problems mentioned above, and benchmark the ActiveRecord approach with some other solution using a ruby Enumerable.
Note: If you wanna follow along with the coding of this post, you can use this script with all the data, queries, and comparisons I used.
Domain
In this post, a simplified version of a Cryptocurrency Exchange will be used as a domain. To better understand what it's being queried in the next sections, here is an explanation of the meaning of each table:
- networks: The networks with a running blockchain, each one working with a specific set of currencies;
- currencies: Cryptocurrencies;
- currency_networks: An association table linking a currency to a network, meaning that this currency can be processed in the specified network;
- pairs: A pair of currencies that can be traded, e.g., "BTC-ETH" means "buying ETH (quote currency) with BTC (base currency)";
- tickers: Last 24h information from a specific currency pair.
It's not in the scope of this post to discuss anything related to the right way of modeling this domain. The domain was built just thinking in exemplifying queries.
Querying belongs_to associations
Consider the following domain models:
class Network < ActiveRecord::Base
has_many :currency_networks
has_many :currencies, through: :currency_networks
end
class Currency < ActiveRecord::Base
has_many :currency_networks
has_many :networks, through: :currency_networks
end
class CurrencyNetwork < ActiveRecord::Base
belongs_to :currency
belongs_to :network
end
Let's say we wanna have all currency networks where the network is active. We can solve this using a ruby Enumerable:
CurrencyNetwork.all.select { |cn| cn.network.active? }
Although that solves our problem, we're potentially scaling to new ones:
- Our
selectblock is hitting the database for eachCurrencyNetworkwe retrieved; - We're retrieving more data than we really need with
cn.network: we don't wanna retrieve theNetworkdata, just theCurrencyNetwork. Bare in mind that when we callcn.networkwe are instantiating a newActiveRecordobject to store theNetworkresult, consuming more memory than needed.
Those aren't big problems to a situation where we don't have many records in the database, as in the case of the script I'm running to exemplify this command, but you can imagine what that cause for your application depending on how frequent you run this action and the number of rows in your database.
We can improve this by delegating great part of the heavy lifting to our database doing this:
CurrencyNetwork.joins(:network).where(networks: { active: true })
If we append to_sql at the end of this query we can see the SQL that's producing:
SELECT "currency_networks".*
FROM "currency_networks"
INNER JOIN "networks"
ON "networks"."id" = "currency_networks"."network_id"
joins is a powerful resource responsible for joining our tables (INNER JOIN), what we can do because CurrencyNetwork belongs_to :network (specified on the ON clause). ActiveRecord provides us a clean way to access the table attributes from our belongs_to association with the hash syntax: { networks: { active: true } }.
What are the benefits of your query approach:
- Performance: We only hit the database once, compared to the potentially thousands of times with the
Enumerablesolution; - Spare memory: We only receive
ActiveRecordinstances fromCurrencyNetworkinstead of theNetworkinstances that came as plus with theEnumerablesolution.
We don't need to rely upon a priori reasoning only, we can use memory_profiles and benchmark_ips to compare the memory consumption and iterations per second of each solution.
Belongs to query - Enumerable solution (memory profiler):
Total allocated: 120236 bytes (1290 objects)
Total retained: 27100 bytes (224 objects)
Belongs to query - ActiveRecord query solution (memory profiler):
Total allocated: 4584 bytes (52 objects)
Total retained: 360 bytes (7 objects)
Considering how the database was populated in the script used for this test, with the query solution we consume ~26x less memory with the total allocations and ~75x less memory with retained allocations.
IPS - Belongs to query - ActiveRecord query solution: 17716.1 i/s
IPS - Belongs to query - Enumerable solution: 656.8 i/s - 26.97x (± 0.00) slower
Comparing the iterations per second, we can check that the Enumerable solution is 26.97x slower.
Querying has_many associations
Consider the same domain models of the previous section. We wanna do a slightly similar query that the previous one, but now our result must be the currencies belonging to active networks.
Enumerable solution:
Currency.all.select { |currency| currency.networks.where(active: true).any? }
In this example, I'm doing things propositional worse mixing a query with a where clause, but this is probably with what you'd come up to if not using joins.
ActiveRecord query solution:
Currency
.joins(currency_networks: :network)
.where(networks: { active: true })
You can notice that we use the same interface to query has_many as the one used with belongs_to, so there's nothing new here besides the pluralized relationship name in the joins.
The benefits are the same we saw in the previous section, and we can benchmark both approaches here too:
Has many query - Enumerable solution (memory profiler):
Total allocated: 171024 bytes (2107 objects)
Total retained: 24566 bytes (243 objects)
Has many query - ActiveRecord query solution (memory profiler):
Total allocated: 4464 bytes (50 objects)
Total retained: 280 bytes (6 objects)
We can see an, even more, accentuated memory consumption in this case with the Enumerable solution. The ActiveRecord query solution consumed ~38x less memory in the total allocated, and ~87x less memory in the total retained.
IPS - Has many query - ActiveRecord query solution: 23395.8 i/s
IPS - Has many query - Enumerable solution: 569.6 i/s - 41.08x (± 0.00) slower
No surprise, the Enumerator solution is 41.08x slower.
Querying With Aggregations and Subqueries
Aggregations are actions that imply in perform any mathematical operation in your data, be it counts, sums, averages calculations, or comparisons to retrieve maximums and minimums.
Consider the following domain models:
class Currency < ActiveRecord::Base
has_many :currency_networks
has_many :networks, through: :currency_networks
has_many :pairs_as_base, class_name: "Pair", foreign_key: :base_currency_id
has_many :pairs_as_quote, class_name: "Pair", foreign_key: :quote_currency_id
end
class Pair < ActiveRecord::Base
has_many :tickers
belongs_to :base_currency, class_name: "Currency", foreign_key: :base_currency_id
belongs_to :quote_currency, class_name: "Currency", foreign_key: :quote_currency_id
end
class Ticker < ActiveRecord::Base
belongs_to :pair
validates_presence_of :trade_count
end
Using aggregation we can answer how many trades were performed in all pairs:
Ticker.avg(:trade_count)
# => 458926.666666667
We deliver the heavy lifting of looking at each ticker record and performing the calculation to our database, which is good. At this point, you already have a notion of the performance improvement of ActiveRecord queries over Enumerable based solutions to do the same, so I'll spare you of the performance metrics in more simple examples like this one.
Bear in mind the result of this query I added as commentary in the snippet above: it's an integer, not an ActiveRecord relation. So it's not possible to keep chaining queries after calling an aggregation on that way.
You can expect another outcome if you're using aggregation over grouped columns. Let's say we wanna retrieve the sum of trade_count for each currency that is a base_currency of a Pair:
Currency
.joins(pairs_as_base: :tickers)
.group("currencies.name").sum(:trade_count)
# => {"Bitcoin"=>930518, "Ethereum"=>103466, "Polkadot"=>342796}
As a result, you have a hash with your aggregated answer.
Subqueries
Subqueries are queries inside queries, and can be placed basically anywhere in your query: SELECT, WHERE, FROM, etc. You can perform interesting queries combining aggregations with subqueries.
Let's say we wanna list all pairs are inside tickers that have the trade_count above the average trade counts of all tickers:
# Hits the database twice
Pair
.joins(:tickers)
.where("trade_count >= :avg", avg: Ticker.average(:trade_count))
# Hits the database only once
Pair
.joins(:tickers)
.where("trade_count >= (:avg)", avg: Ticker.select("AVG(trade_count)"))
Why the first query hits the database twice, and the second query only once? It's a detail of the ActiveRecord syntax.
Remember what we saw just before: calling the aggregation method as a ruby method directly from ActiveRecord object returns to us a number, and that can only be done by executing the query. So Ticker.average(:trade_count) is returning a number, and ActiveRecord is replacing :avg with it.
If you wanna have lazy using aggregation inside a query, we explicitly express the aggregation as an SQL function like here Ticker.select("AVG(trade_count)"). In the example of the single database hit, ActiveRecord is taking this query and injecting it in place of :avg. Be aware that here we're calling (:avg)! That's because a subquery must be evaluated inside (), otherwise an error will be raised. Appending to_sql in the single database hit query we can see the query that's performed:
SELECT "tickers".*
FROM "tickers"
WHERE (trade_count > (SELECT AVG(trade_count) FROM "tickers"))
Following to another query, we could ask now for the tickers that have their trade_count above or equal the average trade_count of all listed tickers:
avg_trade_count_sql = Ticker.select("AVG(trade_count)").to_sql
Ticker
.group(:pair_id)
.having("tickers.trade_count >= (#{avg_trade_count_sql})")
Here we use the subquery in the having clause, producing the following query:
SELECT "tickers".*
FROM "tickers"
GROUP BY "tickers"."pair_id"
HAVING (tickers.trade_count >= (SELECT AVG(trade_count) FROM "tickers"))
Or we could do a query to analyze this info in another way: get the average ticker trade_count for each pair, and only return the tickers that have trade_count above or equal the average trade_count of the tickers belonging to given pair:
avg_trade_count_by_pairs_sql = Ticker.select("pair_id, AVG(trade_count) as average").group(:pair_id).to_sql
Ticker
.joins("INNER JOIN (" + average_trade_count_by_pairs_sql + ") trades ON tickers.pair_id = trades.pair_id")
.where("tickers.trade_count >= trades.average")
This produces the following SQL:
SELECT "tickers".*
FROM "tickers"
INNER JOIN (
SELECT pair_id, AVG(trade_count) as average
FROM "tickers" GROUP BY "tickers"."pair_id"
) trades
ON tickers.pair_id = trades.pair_id
WHERE (tickers.trade_count >= trades.average)
This is a tricky one, so let's break it down to understand how we're doing this. First off, we have our subquery SQL in avg_trade_count_by_pairs_sql. If executed individually, will return to you an aggregated result like the following:
Each pair_id is listed with an aggregated result that represents the average of trade_count of all of the tickets belonging to the pair (i.e., SELECT * FROM tickets WHERE pair_id = [THE PAIR ID]). This SQL is merged inside the Ticker custom join (INNER JOIN) we named as trades. In the end, we have tickers.trade_count from the current ticker row being iterated, and trades.average that we have from our subquery, and represents the average ticker_count of the ticker pair of the current iteration (association made because of our ON tickers.pair_id = trades.pair_id from the custom join).
We're already writing custom joins, anything beyond that it's more SQL specifics than ActiveRecord, so we can end up here. To finish, though, let's benchmark this query with a simple solution using ruby Enumerable:
tickers = []
Pair.all.each do |pair|
avg = Ticker.where(pair: pair).average(:trade_count)
selectable_tickers = Ticker.where(pair: pair).where("trade_count >= ?", avg)
tickers.push(*selectable_tickers.to_a)
end
Avg trade count by pairs - Enumerable solution (memory profiler)
Total allocated: 72357 bytes (872 objects)
Total retained: 573 bytes (9 objects)
Avg trade count by pairs - ActiveRecord query solution (memory profiler)
Total allocated: 6348 bytes (86 objects)
Total retained: 296 bytes (5 objects)
Our ActiveRecord solution consumes ~11x less memory in the total allocated, and ~2x less memory in the total retained.
IPS - Avg trade count by pairs - ActiveRecord query solution: 15095.6 i/s
IPS - Avg trade count by pairs - Enumerable solution: 740.6 i/s - 20.38x (± 0.00) slower
Comparing the iterations per second, we can check that the Enumerable solution is 20.38x slower.
Conclusion
In this post, we saw how to use advanced ActiveRecord features to go beyond a simple where clause. We saw how to use ActiveRecord to query belongs_to and has_many associations, using joins, aggregations, and subqueries, combining all.
We also compared some queries with the equivalent using a ruby Enumerable based solution. For benchmarking both approaches we checked the memory consumption and iterations per second, which showed without surprise the Enumerable based solutions consuming in most cases dozens of times more memory because of extra allocations and being also dozens of times slower.
Of course, we should keep in mind that the benchmark was done using only a very small database. We weren't worried about checking how much worse it can get with larger databases, or searching for points of stabilization (a.k.a., "it doesn't get much worse than that"). The idea of benchmarking was to show you how bad it can be when you rely upon ruby Enumerator to do things that our database is capable of doing through ActiveRecord, which is a common bad "code smell".
This is it! If you have any comments or suggestions, don't hold back, let me know.
Options if you like my content and would like to support me directly (never required, but much appreciated):
BTC address: bc1q5l93xue3hxrrwdjxcqyjhaxfw6vz0ycdw2sg06


Top comments (2)
Great post really helpful and well written.
I was researching it and I was surprised when I look that you @rwehresmann are the author.
Haha What a coincidence! I'm glad to see that you liked, it's great to see you here