Discussion on: Incident Retro: Failing Comment Creation + Erroneous Push Notifications

View post

Looks like you can do redis.keys('rpush:notifications:*') in place of the first script. It should be much quicker, because it doesn't send anything over the wire which doesn't match the pattern. If you want to do it in a paginated way, you could do: r.scan_each(match: 'rpush:notifications:*') { |key| } which basically does what your script did, except that the match is executed on the server, so it doesn't send nonmatching keys over the wire. You can also pass it a count option, but in my experiments, that was always slower (this will tell redis how many keys you want it to scan).

Locally I added 1.2 million matching keys and 1.2 million nonmatching keys, and then played around with it for a bit:

def time
  start = Time.new
  yield.tap { $stderr.printf "%.2f seconds\n", Time.now-start }
end

r = Redis.new

# Seed it with records
1_200_000.times.each do |i|
  r.set "rpush:notifications:#{i}", i
  r.set "other:thing:#{i}", i
end

# Check total number of keys
r.dbsize # => 1200014

# Check how many match our pattern (this is a lua script that calls the `keys
# command on the redis server, and then returns its size) It's much faster
# just b/c it doesn't need to ship all the keys over the wire
r.eval "return #redis.call('keys', 'rpush:notifications:*')", [0] # => 1200000

# Get the list of keys using https://redis.io/commands/keys
keys = time { r.keys('rpush:notifications:*') } # 7.77 seconds
keys.size # => 1200000

# Iterate over each key with successive scans (here we just increment a counter)
time { r.scan_each(match: 'rpush:notifications:*').count } # 19.16 seconds
# => 1200000

# Delete the keys in batches of 10k
time { r.keys('rpush:notifications:*').each_slice(10_000) { |s| p r.del s } } # 32.90s

# They're deleted
time { r.keys('rpush:notifications:*').size } # 0.57s
# => 0

Jamie Gaskins • Jul 21 '21

I like that you enjoy optimizing things and golfing code down to fit in tweets. 🙂 However, like many optimizations, this one is more complex than simple latency. KEYS would indeed have had lower latency, but it locks the Redis event loop for the duration, so we avoid that command wherever possible. To put it into perspective, your KEYS query took 7 seconds in your benchmark (we had almost 5M keys in Redis, so it would've actually taken longer, but I'll work with your numbers for the moment). In that amount of time we normally execute over 10,000 Redis queries, so running KEYS would introduce a pretty sizable head-of-line-blocking issue. On top of that, every HTTP request that makes it past Fastly to our backend hits this Redis instance at least once, so blocking Redis means blocking all HTTP requests. Since we don't want to do that, Redis needs to be available af. We also don't allow other O(n) calls where n can ever be nontrivial for that reason, so nothing like LRANGE my-list 0 -1, either.

In fairness, it almost certainly doesn't lock the main Redis thread for the full amount of time it takes the Ruby app to receive the entire list of keys (we're using Redis 6, which does I/O in a separate thread), but without any insight into how much of that time is spent in the main Redis event loop, we don't take chances. Since this issue impacted only a tiny fraction of our total traffic, we needed to continue optimizing for availability of Redis over shaving a few seconds off of the latency of the solution.

The code Molly posted was adapted from this worker. I didn't know about Redis#scan_each, though. That completely obviates our manual tracking of the cursor, which I really like.

Josh Cheek • Jul 22 '21

I like that you enjoy optimizing things and golfing code down to fit in tweets.

Thanks ^_^

[comment about keys blocking]

I agree with the analysis and conclusions here.

The code Molly posted was adapted from this worker. I didn't know about Redis#scan_each, though. That completely obviates our manual tracking of the cursor, which I really like.

Aye, +1 for scan_each. Also, note that on the last iteration, when the cursor is "0", there can be keys that were returned with the cursor, which the worker won't expire, b/c it won't execute the loop body on that last response.

eg:

$ ruby -r redis -e '
  redis = Redis.new

  p count_from_keys: redis.eval(%(return #redis.call("keys", "rpush:notifications:*")), [0])

  cursor = count = 0
  until (cursor, keys = redis.scan(cursor, match: "rpush:notifications:*")).first.to_i.zero?
    count += keys.size
  end
  p count_from_loop: count, final_key_size: keys.size
'
{:count_from_keys=>1200000}
{:count_from_loop=>1199998, :final_key_size=>2}

Jamie Gaskins • Jul 22 '21

Ooh, nice catch! I'll go ahead and port that worker over to use scan_each. Thanks Josh!