Robert Nubel

Posted on Jan 10, 2023

A Tale of Hashery and Woe: How Mutable Hash Keys Led to an ActiveRecord Bug

#rails #ruby #programming

In the changelog of Active Record right after version 7.0.4 is a small little bugfix that would, if I wasn't about to quote it and follow it up with this whole article, probably not catch your attention:

Fix a case where the query cache can return wrong values. See #46044

Aaron Patterson

Behind this innocuous release note, though, is a fascinating tale of how hashes in Ruby really work that will take us from the Rails codebase down to the MRI source code. And it starts with a simple question:

Will this Ruby snippet print "hit" or "miss"?

key = {}
hash = { key => "hit" }
key[:foo] = rand(1e6)

puts hash[key] || "miss"

Well... it depends.

The only correct answer is "yes". It could print either value! If you think that's crazy, well, let me prove it:

# ruby 3.2; same results on 2.7
1_000_000.times.each_with_object(Hash.new(0)) do |_, result|
  key = {}
  hash = { key => "hit" }
  key[:foo] = rand(1e6)

  result[hash[key] || "miss"] += 1
end

=> {"miss"=>996050, "hit"=>3950}

Wild, right? To get to the bottom of this, we'll have to dig into the Ruby source code and find the Hash implementation.

Pop open the hood

Ruby hashes are implemented in the succinctly named hash.c. This bug is happening when we look up a value, which ultimately requires us to find the entry in the hash's underlying array. The ar_find_entry function is responsible for that.

The arguments to it are:

hash: the hash object itself
hash_value: the numerical hash of the key object (in Ruby, any object responds to .hash and produces a deterministic, numerical hash -- try it out! This hash value is critical to how hashes work, as we'll see.)
key: the key object

static unsigned
ar_find_entry(VALUE hash, st_hash_t hash_value, st_data_t key)
{
    ar_hint_t hint = ar_do_hash_hint(hash_value);
    return ar_find_entry_hint(hash, hint, key);
}
# https://github.com/ruby/ruby/blob/ruby_3_2/hash.c#L744-L749

Your first thought (other than "whoa, how long has it been since I worked in C?") is probably to wonder what a "hint" is. The ar_do_hash_hint method, plus a quick lookup of the type definition in hash.h, tells us:

static inline ar_hint_t
ar_do_hash_hint(st_hash_t hash_value)
{
    return (ar_hint_t)hash_value;
}
# https://github.com/ruby/ruby/blob/ruby_3_2/hash.c#L403-L407

typedef unsigned char ar_hint_t;
# https://github.com/ruby/ruby/blob/ruby_3_2/internal/hash.h#L20

This typecast to an unsigned char means that a hint is just the last byte of the numerical hash. So, a number from 0 to 255.

Delving ever deeper

Armed with that knowledge, we can proceed to ar_find_entry_hint. I'll strip the debug code out of this for clarity:

static unsigned
ar_find_entry_hint(VALUE hash, ar_hint_t hint, st_data_t key)
{
    unsigned i, bound = RHASH_AR_TABLE_BOUND(hash);
    const ar_hint_t *hints = RHASH(hash)->ar_hint.ary;

    for (i = 0; i < bound; i++) {
        if (hints[i] == hint) {
            ar_table_pair *pair = RHASH_AR_TABLE_REF(hash, i);
            if (ar_equal(key, pair->key)) {
                return i;
            }
        }
    }
    return RHASH_AR_TABLE_MAX_BOUND;
}
# https://github.com/ruby/ruby/blob/ruby_3_2/hash.c#L701-L742

There are two checks this code makes for each entry to decide if it's our value:

if (hints[i] == hint) -- Ruby uses the hint as a way to quickly check if the key is likely to be our key or not.
if (ar_equal(key, pair->key)) -- but before we actually return the row as a hit, we check the actual equality of the given key and the one for this entry.

What this means is that two objects having keys with the same hint is perfectly normal: the code would spend a little wasted time before realizing the keys are different, but we wouldn't expect a false positive.

So when we mutate the hash key in the snippet at the top of this post, there's a 1 out of 256 chance that our hash still keeps the same hint.

But won't the equality check save us?

No! Even after we mutate a hash, the hash always stays equal with any references to it (because it's the same object and therefore still has the same memory address). And the hash table isn't storing the key object by value; it's only storing a pointer to it. So in the snippet above, the reference to key inside hash will always be equal to the that same key object no matter how much we mutate it; meaning the only thing standing between us and a false-positive hit is the hint check.

That gives us a 255 out of 256 chance to see our "expected behavior of a miss, and a 1 out of 256 chance to see the unexpected hit. I ran the snippet at the top of the post for 100,000,000 iterations -- where we'd expect to see 390,625 hits -- and got 390,956. I'm no statistician, but I think that proves our logic is correct.

Big deal. Clearly mutating a hash key is just a bad idea.

Well, you're not wrong. Python doesn't let dictionaries use mutable keys, which is very sensible. Go and JavaScript do, but they don't exhibit this bug and instead will always produce a hit after the key is mutated (without investigating, I'll bet they're doing the hashing based purely on the pointer address). So Ruby seems to stand alone in both allowing this behavior and having a hash implementation that leads to this indeterminate behavior.

How does this relate to Rails?

Ah, I almost forgot. This connects to Active Record because of how ActiveRecord::QueryCache is implemented:

def cache_sql(sql, name, binds)
  @lock.synchronize do
    result =
        if @query_cache[sql].key?(binds)
          @query_cache[sql][binds]
        else
          @query_cache[sql][binds] = yield
        end
    result.dup
  end
end

# https://github.com/rails/rails/blob/v7.0.4/activerecord/lib/active_record/connection_adapters/abstract/query_cache.rb#L127-L141

If you don't remember what the query cache is, it's an enabled-by-default middleware for ActiveRecord that caches read queries to prevent suboptimal code from hammering the database unnecessary. It gets reset after every request, and for the most part, it works great with no need to notice it working. But at Enova, we saw a case where the cache was producing false positives, which led me down this whole path.

Anyway, taking a look at the code again, the key thing is that binds will (if bound_parameters is on) hold a reference to a hash passed into a where clause. So this snippet demonstrates the same actual Ruby hash bug that we've just solved:

criteria = { thing: "one" }
results_one = MyModel.where(some_col: criteria)

criteria.merge!( other: "stuff" }
results_two = MyModel.where(some_col: criteria)
# 1/256 chance that results_two incorrectly gets the same data as results_one!

This is the bug that I reported in rails/rails#46044:

Querying with mutable bound parameters can produce false-positive query cache hits #46044

rnubel posted on Sep 15, 2022

In production at Enova, in one of our apps, we were seeing an incredibly rare issue where sometimes a query would improperly perform a cache load from the query cache and return the wrong value. After several days of debugging and load-testing and more debugging, I was eventually able to track down the issue to a place in our code where we doing, essentially:

search = { key: "value" }
r = Record.where(criteria: search).first
# ...
search.merge!(key2: "value2")
r2 = Record.where(criteria: search).first

As it turns out, the mutation of the search key produces a situation where -- occasionally, as demonstrated below -- the query cache returns the wrong value. The root cause is how Ruby hashes work internally: objects are hashed into a bucket in the internal structure based on a modulus of their numerical hash, and retrieved by searching the list of objects in the matching bucket for an equal object. When we mutate the search object, it is possible that the new numerical hash still falls into the same bucket, and because the key is a pointer to the object, the == check passes and the old object is returned.

In normal Ruby code, it would be obvious to an experienced Ruby developer that using a mutable hash key (and then mutating it) is a bad idea. Since the Rails query cache is under the covers, however, it is not obvious to a developer that the code above would be problematic. And because it happens so rarely, I suspect this is occurring in many Rails applications across the world without anyone noticing (other than the occasional head-scratching Sentry error, perhaps).

I do have a possible fix for this, which I'll add in a comment below.

Steps to reproduce

# frozen_string_literal: true

require "bundler/inline"

gemfile(true) do
  source "https://rubygems.org"

  git_source(:github) { |repo| "https://github.com/#{repo}.git" }

  gem "rails", github: "rails/rails", branch: "main"
  gem "sqlite3"
end

require "active_record"
require "minitest/autorun"
require "logger"

# This connection will do for database-independent bug reports.
ActiveRecord::Base.establish_connection(adapter: "sqlite3", database: ":memory:", prepared_statements: true)
# ActiveRecord::Base.logger = Logger.new(STDOUT) # you can enable this to see the cache loads, but it's noisy

ActiveRecord::Schema.define do
  create_table :my_records, force: true do |t|
    t.json :value
    t.text :description
  end
end

class MyRecord < ActiveRecord::Base; end

class QueryCacheMutableSearchTest < Minitest::Test
  def test_bug
    iterations = 10000
    false_positives = 0

    MyRecord.connection.enable_query_cache!

    iterations.times do
      key, val = rand(100000), rand(100000)

      record = MyRecord.create(value: { key => val }, description: "The record we want to find")

      search = { key => val }
      the_record = MyRecord.where(value: search).first # this should populate the cache
      assert the_record.present?

      # cache now looks like this, essentially:
      #  { "SELECT * FROM my_records WHERE value = $1" =>
      #    { [search] => the_record }
      #  }

      new_val = rand(100000) until new_val != val

      search.merge!(key => new_val) # this mutates the key inside the query cache

      # normally: because the hash of the key has changed, this is a cache miss
      # however, if the new hash key's numerical hash falls into the same bucket
      # as the original, the hash lookup will a) find the first query's entry and
      # b) use it, because the objects are equal b/c the `search` hash was mutated
      # is equal to key_obj (since it's a reference)

      should_not_exist = MyRecord.where(value: search).first # this SHOULD not return a value
      false_positives += 1 if should_not_exist.present?

      record.destroy
      MyRecord.connection.clear_query_cache
    end

    assert_equal 0, false_positives
  end
end

Expected behavior

The second query should never return a value, since the value it's supposed to look for does not exist in the database.

Actual behavior

The second query sometimes performs a CACHE MyRecord Load and returns the original record, incorrectly:

Failure:
QueryCacheMutableSearchTest#test_bug [minimal_case.rb:69]:
Expected: 0
  Actual: 43

This happens because the mutated search hash inside the query cache ends up a) hashing into the same bucket as the original search hash did inside the query cache's hash, and b) still remains equivalent to the original search hash since the query cache stores a reference to it. Technically, the query cache is keying off of the list of binds which is a list of objects like ActiveRecord::QueryAttribute, but ultimately they end up with a reference to the search variable itself and thus the problem still manifests.

System configuration

Rails version: edge (7.1.0.alpha), also occurs in 6.x and probably older versions as well Ruby version: 2.7.6

View on GitHub

Tenderlove (aka Aaron Patterson) was able to quickly fix it, though, by dup-and-freezing any hash values passed into ActiveRecord query conditions:

Dup and freeze complex types when making query attributes #46048

tenderlove posted on Sep 15, 2022

This avoids problems when complex data structures are mutated after being handed to ActiveRecord for processing. For example false hits in the query cache.

Possible fix for #46044

View on GitHub

Conclusion

Hopefully you found this entertaining and a little informative. Maybe you're wondering, is this actually a bug in Ruby itself? Certainly it feels like a design flaw. When I get around to it, perhaps soon, I plan to post this to the Ruby mailing list to at least see some core developers' thoughts on it.

DEV Community