Dane Wu

Posted on Jun 26 • Edited on Jun 27

ActiveRecord: You Wanted a Number, but Loaded 500,000 Rows Into Memory

#ruby #rails #performance #database

Rails Performance: Lessons from Production — #3

The first two posts were about making queries cheaper — fewer N+1s, indexes, not dragging data back into Ruby. This one turns that into a more practical habit: let the database do what the database can do. A .count in the wrong place, a misused .present?, and you can pull an entire table into memory just to get a single number. Same example throughout (a shipments table).

💥 A report page that ate the server's memory

We had an admin report showing, per courier, the number of delivered shipments. It was fine — until someone tweaked the filtering logic. After that, opening the page spiked memory, and under traffic it OOM-killed the server.

The culprit was this line:

courier.shipments.select { |s| s.status == "delivered" }.size

Looks reasonable — "it's just select." But this .select { ... } (with a block) is Ruby's Array method, not ActiveRecord's select(:col). It loads all of that courier's shipments into memory first, then filters them one by one in Ruby. Fine on a small table; on a big one it eats all the RAM.

One word, select, two completely different meanings — one builds SQL, the other loads everything and filters in Ruby. It's one of the easiest traps to fall into and the hardest to spot. The fix is to let the DB filter and count:

courier.shipments.where(status: "delivered").count
# → SELECT COUNT(*) FROM shipments WHERE courier_id = ? AND status = 'delivered'

The DB filters, counts, and returns a single number. The memory problem vanishes.

Behind this is a recurring pattern: for a single number, boolean, or sum, you drag a whole batch of rows back into Ruby. This post walks through its common variants.

Continuing the four layers from series #1: moving data from ③ the DB back to ④ memory has a cost. The whole theme is — don't move it if you don't have to, and if you must, move less.

🔢 Counting: `count` vs `size` vs `length`

All three "count things," but they work differently — pick wrong and you hit the trap above.

count — always sends SELECT COUNT(*) to the DB and returns a number.

Shipment.where(status: "delivered").count
# → SELECT COUNT(*) FROM shipments WHERE status = 'delivered'   always queries the DB

Downside: even if the records are already loaded, it still queries the DB again.

length — can only count an in-memory array, so it force-loads everything then counts.

shipments = Shipment.where(status: "delivered")
shipments.length
# → SELECT * FROM shipments WHERE status = 'delivered'   pulls the whole batch into memory, then counts the array

Downside: if not loaded, it drags the whole batch into memory (the opening trap).

size — picks based on state: not loaded → acts like count; already loaded → acts like length. Always the cheapest option.

shipments = Shipment.where(status: "delivered")
shipments.size      # not loaded → runs count, SELECT COUNT(*), no rows moved
shipments.to_a      # (this loads the data into memory)
shipments.size      # loaded → counts the in-memory array, no DB query

	Behavior	Risk
`count`	always queries the DB	re-queries even when loaded
`length`	always uses memory	drags the whole batch in when not loaded
`size`	not loaded → DB, loaded → memory	almost none, safest

How to choose: only want a number, won't use the rows → count; data already (or about to be) loaded and you count while using it → size; length almost never.

This also explains why series #1 used c.shipments.size in the view, not .count — shipments were preloaded into memory, so size counts memory; .count would re-query the DB per courier, bringing the N+1 right back.

❓ Existence: `exists?` vs `present?`

You want to know whether a courier has any delivered shipments — just true / false.

❌ present? / any? — loads the matching rows into memory first, then checks if the array is empty:

courier.shipments.where(status: "delivered").present?
# → SELECT * FROM shipments WHERE courier_id = 1 AND status = 'delivered'
#   (loads all matches, just to decide "any?")

✅ exists? — asks the DB and stops at the first hit:

courier.shipments.where(status: "delivered").exists?
# → SELECT 1 FROM shipments WHERE courier_id = 1 AND status = 'delivered' LIMIT 1

LIMIT 1 is the key: the DB returns "yes" on the first match — no full scan, no rows moved.

Exception: if you'll use those rows right after the check, load them first with .to_a and use present? — the load isn't wasted:

shipments = courier.shipments.where(status: "delivered").to_a
if shipments.present?   # ✅ present? is right here, the data is already needed
  shipments.each { ... }
end

In one line: just checking existence, won't use the rows → exists?; using the rows right after → load first, then present?.

➕ Aggregation: sum, average, group — hand them to the DB

Compute "total weight of all delivered shipments."

❌ Pull rows back and loop in Ruby:

Shipment.where(status: "delivered").sum { |s| s.weight }   # block = Ruby method

sum with a block is Ruby's Array method — it loads 500k objects into memory, then loops and adds them one by one.

✅ Let the DB sum:

Shipment.where(status: "delivered").sum(:weight)   # column name = ActiveRecord method
# → SELECT SUM(weight) FROM shipments WHERE status = 'delivered'

The difference is the argument: sum(:weight) passes a column symbol → builds SQL for the DB ✅; sum { |s| s.weight } passes a block → loads then loops in Ruby ❌.

Other aggregates work the same — ActiveRecord wraps them:

Shipment.average(:weight)   # AVG
Shipment.maximum(:weight)   # MAX
Shipment.minimum(:weight)   # MIN

Grouped counts (how many each courier shipped) also go to the DB:

# ❌ pull everything, group in Ruby
Shipment.all.group_by(&:courier_id).transform_values(&:count)

# ✅ DB groups & counts, returns a tiny result
Shipment.group(:courier_id).count
# → SELECT courier_id, COUNT(*) FROM shipments GROUP BY courier_id
# → { 1 => 1200, 2 => 800, ... }

Rule of thumb: check whether sum / count is given a column name or a block — column name goes to the DB (right), block runs in Ruby (usually wrong).

In one line: sum, average, count, group — operations that "collapse a pile of rows into one result" are exactly what the DB does, right next to the data. Pulling rows back to compute = hauling a truckload home just to read one number off it.

📦 When you really must: `find_each` loads in batches

Sometimes you genuinely have to pull every row back into Ruby (sending emails, generating reports, exporting). For that unavoidable case, the point shifts to: don't load all at once — load in batches.

❌ .each — loads everything into memory at once:

Shipment.where(status: "delivered").each do |s|
  SendMailJob.perform_later(s)
end
# → SELECT * ... pulls all 500k rows; every object in RAM at once → possible OOM

✅ find_each — fetches batch by batch under the hood (1000 by default), processing one batch before loading the next:

Shipment.where(status: "delivered").find_each do |s|
  SendMailJob.perform_later(s)
end
# → SELECT * ... ORDER BY id LIMIT 1000
#   SELECT * ... AND id > 1000 ORDER BY id LIMIT 1000
#   ...uses the last id of the previous batch as the next starting point

At any moment only 1000 rows are in memory — 500k won't blow up. Batch size is tunable: find_each(batch_size: 500).

For operating on a whole batch (bulk updates), use in_batches:

Shipment.where(status: "pending").in_batches do |batch|
  batch.update_all(status: "processing")   # update a whole batch at once, not row by row
end

🏁 Wrap-up

You want	❌ pull into Ruby	✅ keep in the DB / batch
a count	`.to_a.count`	`.count` / `.size`
existence	`.present?`	`.exists?`
sum / avg / group	pull back and loop	`sum(:column)`, `group.count`
process many rows	`.each` (load all)	`find_each` (batch)

One principle runs through all of it:

Let the database do what the database can do; when you must pull into Ruby, do it in batches, not all at once.

This is the same thing as series #1 — moving data into memory has a cost. That opening OOM wasn't caused by "too much data," but by "moving too much data back for a single number." Before you write count, sum, or present?, ask: do I want a result, or do I actually need this whole batch of rows? It's usually the former — so let the DB hand you the result.

DEV Community

ActiveRecord: You Wanted a Number, but Loaded 500,000 Rows Into Memory

💥 A report page that ate the server's memory

🔢 Counting: `count` vs `size` vs `length`

❓ Existence: `exists?` vs `present?`

➕ Aggregation: sum, average, group — hand them to the DB

📦 When you really must: `find_each` loads in batches

🏁 Wrap-up

Top comments (0)

💥 A report page that ate the server's memory

🔢 Counting: count vs size vs length

❓ Existence: exists? vs present?

➕ Aggregation: sum, average, group — hand them to the DB

📦 When you really must: find_each loads in batches

🏁 Wrap-up

🔢 Counting: `count` vs `size` vs `length`

❓ Existence: `exists?` vs `present?`

📦 When you really must: `find_each` loads in batches