Earlier this year, I was working on an event bus for the Rage framework. The event bus was designed to allow publishing both synchronous and asynchronous events. With synchronous events, the request would wait for event subscribers to finish. With asynchronous events, subscribers would be executed after the request had been served.
But what happens if an application publishes a bunch of asynchronous events and then receives a SIGTERM? To prevent losing these events and to ensure they can be processed after a server restart, the framework needed to store asynchronous events in persistent storage.
By default, Rage would store asynchronous events on disk in an append-only log. This allowed for a seamless setup, eliminating the need for any specific configurations for the event bus to function.
Let's explore how the storage was implemented, walk through possible optimisations for an otherwise straightforward code, and see how seemingly minor code choices can significantly impact performance.
While working on the event bus, it evolved into something very different - a message queue. However, the code described in this article is still largely used within the framework.
Saving Data on Disk
Let's examine the following code, which stores asynchronous events on disk:
require "zlib"
class DiskBackend
def initialize
@file = File.open("storage/#{Time.now.strftime("%Y%m%d")}-#{Process.pid}", "a+b")
@entries = {}
end
def add(event, subscribers)
entry_ids = Array.new(subscribers.length) { generate_entry_id }
serialized_event = Marshal.dump(event)
subscribers.zip(entry_ids) do |subscriber, entry_id|
entry = "add:#{entry_id}:#{subscriber}:#{serialized_event}"
crc = Zlib.crc32(entry).to_s(16).rjust(8, "0")
@file.write("#{crc}:#{entry}\n")
end
entry_ids.each { |entry_id| @entries[entry_id] = true }
end
private
def generate_entry_id
Process.clock_gettime(Process::CLOCK_MONOTONIC).to_s
end
end
The DiskBackend
class stores event subscribers in a file. Since each event can have multiple subscribers, the class generates a unique entry for each subscriber.
In the constructor, we open the storage file and initialise the @entries
hash (more on this later). The add
method then performs the following steps:
- Generates unique IDs for all entries;
- Serialises the event;
- For each subscriber:
- Generates an entry in the format of
add:<entryID>:<subscriber>:<event>
; - Generates a CRC32 signature for the entry;
- Writes the entry to the file;
- Updates the
@entries
hash with the list of generated entry IDs. This information is used during storage file rotation to identify in-progress events that need to be copied to a new storage file.
- Generates an entry in the format of
Let's now run a simple benchmark to see how the code performs:
RubyVM::YJIT.enable
require "benchmark/ips"
require_relative "disk_backend"
backend = DiskBackend.new
Benchmark.ips do |x|
x.report("add") do
backend.add(event, subscribers)
end
end
On my machine, with Ruby 3.3.8, I get the following results:
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
add 44.335k i/100ms
Calculating -------------------------------------
add 226.405k (±16.5%) i/s (4.42 μs/i) - 1.153M in 5.224209s
226k operations per second! That's good. But I think I can make it better.
Profiling
Let's profile this code and see if we can extract some useful information from there:
%self total self wait child calls name
26.80 1.343 1.343 0.000 0.000 1000000 <Module::Marshal>#dump
16.33 1.286 0.818 0.000 0.468 1000000 Array#zip
9.24 4.604 0.463 0.000 4.141 1000000 DiskBackend#add
7.63 0.382 0.382 0.000 0.000 1000000 Float#to_s
6.85 0.343 0.343 0.000 0.000 1000000 Hash#[]=
6.15 5.011 0.308 0.000 4.703 1 Integer#times
3.83 0.192 0.192 0.000 0.000 1000000 IO#write
3.68 0.640 0.184 0.000 0.455 1000000 DiskBackend#generate_entry_id
3.62 0.821 0.181 0.000 0.640 1000000 Array#initialize
3.44 0.516 0.172 0.000 0.343 1000000 Array#each
2.46 0.944 0.123 0.000 0.821 1000000 <Class::Array>#new
1.90 0.095 0.095 0.000 0.000 1000000 Integer#to_s
1.82 0.091 0.091 0.000 0.000 1000000 <Module::Zlib>#crc32
1.79 0.090 0.090 0.000 0.000 1000000 String#rjust
1.45 0.073 0.073 0.000 0.000 1000000 <Module::Process>#clock_gettime
As expected, most of the time is spent serialising the event with Marshal.dump
. One surprising observation, however, is the Float#to_s
call coming from Process.clock_gettime(Process::CLOCK_MONOTONIC).to_s
. This code consumes 7.63% of the total execution time.
On the one hand, this is understandable - we are converting the float to a string to then use it as part of the event entry written to the file. On the other hand, we don't have to do this, as the conversion will happen automatically when the number is written to a string anyway. What if we removed the to_s
call then?
@@ -23,7 +23,7 @@ class DiskBackend
private
def generate_entry_id
- Process.clock_gettime(Process::CLOCK_MONOTONIC).to_s
+ Process.clock_gettime(Process::CLOCK_MONOTONIC)
end
end
Let's run the benchmark again:
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
add 64.575k i/100ms
Calculating -------------------------------------
add 625.264k (± 6.5%) i/s (1.60 μs/i) - 3.164M in 5.093242s
Whoa! Removing to_s
, which doesn't affect functionality at all, and allowing Ruby to implicitly convert the float, increased the performance of our code by 2.76 times!
If you think this is the quirk of YJIT, it's not; the numbers are roughly the same without YJIT. And it's not even about implicit string conversions; the speed-up comes from this code:
entry_ids.each { |entry_id| @entries[entry_id] = true }
Turns out, setting elements with number keys in a hash is much more performant than setting elements with string keys. Less is more - lesson learnt.
Constant-time Writes
Reading the following section can lead to uncontrollable anger if you're a Product Manager.
Let's see if we can push the code a bit further. In our profiling results, we also see that the Hash#[]=
call, used when populating the @entries
hash, consumes 6.85% of the time.
Hashes are fast; setting elements in a hash has O(1) complexity. Moreover, we've already optimised that code by removing to_s
. But what if we comment out the loop entirely?
@@ -17,7 +17,7 @@ class DiskBackend
@file.write("#{crc}:#{entry}\n")
end
- entry_ids.each { |entry_id| @entries[entry_id] = true }
+ # entry_ids.each { |entry_id| @entries[entry_id] = true }
end
private
The benchmark shows the following results:
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
add 66.647k i/100ms
Calculating -------------------------------------
add 702.800k (± 2.4%) i/s (1.42 μs/i) - 3.532M in 5.029299s
That's quite an improvement, but how do we get there if we still need the information in the @entries
hash? We change the requirements, of course!
Instead of copying pending events during storage file rotation, we can modify the logic to rotate the storage file only when there are no in-progress events. This would require changes to the storage file rotation mechanism. On the plus side, we can now maintain a simple counter instead of a hash:
@@ -3,7 +3,7 @@ require "zlib"
class DiskBackend
def initialize
@file = File.open("storage/#{Time.now.strftime("%Y%m%d")}-#{Process.pid}", "a+b")
- @entries = {}
+ @pending_events = 0
end
def add(event, subscribers)
@@ -17,7 +17,7 @@ class DiskBackend
@file.write("#{crc}:#{entry}\n")
end
- entry_ids.each { |entry_id| @entries[entry_id] = true }
+ @pending_events += entry_ids.length
end
private
The benchmark results for this version:
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
add 67.455k i/100ms
Calculating -------------------------------------
add 704.674k (± 0.7%) i/s (1.42 μs/i) - 3.575M in 5.073673s
Micro-optimisations
There's one final optimisation I'd like to implement. Thinking about event buses, events are often quite specific. This means that, most of the time, an event has only one subscriber. Let's optimise the add
method for this particular use case:
def add(event, subscribers)
serialized_event = Marshal.dump(event)
if subscribers.length == 1
entry = "add:#{generate_entry_id}:#{subscribers[0]}:#{serialized_event}"
crc = Zlib.crc32(entry).to_s(16).rjust(8, "0")
@file.write("#{crc}:#{entry}\n")
@pending_events += 1
else
entry_ids = Array.new(subscribers.length) { generate_entry_id }
subscribers.zip(entry_ids) do |subscriber, entry_id|
entry = "add:#{entry_id}:#{subscriber}:#{serialized_event}"
crc = Zlib.crc32(entry).to_s(16).rjust(8, "0")
@file.write("#{crc}:#{entry}\n")
end
@pending_events += entry_ids.length
end
end
Yes, the code looks terrible, but this provides an additional 10% boost:
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
add 75.275k i/100ms
Calculating -------------------------------------
add 770.936k (± 0.7%) i/s (1.30 μs/i) - 3.914M in 5.077604s
Wrapping Up
With several simple (and one ugly) changes, we improved the performance of our code by 3.4 times! As it often happens with performance optimisations, the most significant improvement came from the smallest and simplest change.
Does it matter though? Aren't all of these micro-optimisations? Well, yes and no.
Let's consider our @entries
hash example, where we were inserting elements into a hash:
entry_ids.each { |entry_id| @entries[entry_id] = true }
In theory, Ruby can insert 16M elements per second into a hash:
RubyVM::YJIT.enable
require "benchmark/ips"
entries = {}
Benchmark.ips do |x|
x.report("Hash#[]=") do
entries[Process.clock_gettime(Process::CLOCK_MONOTONIC)] = true
end
end
Results:
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
Hash#[] 1.450M i/100ms
Calculating -------------------------------------
Hash#[] 16.066M (± 6.4%) i/s (62.24 ns/i) - 81.173M in 5.083554s
It can't possibly affect our code, which makes less than 1M ops/s, can it? Yet, this very call slows down our code by 12%! So much for micro-optimisations.
The good news is that you likely don't need to worry about this. When it comes to web development, it's far more important to write maintainable code that is easy to read and change. Performance is not nearly as critical as maintainability. Gems, however, are a very different thing.
Gems provide abstractions that should be not only convenient but highly efficient, too. This enables applications to scale and handle more load using the same slow-but-maintainable user-level code.
While most optimisations might seem insignificant, their cumulative impact can be significant, especially in areas like gem development. The difference between a few microseconds can translate into thousands of operations per second, directly influencing an application's ability to scale and handle increased load efficiently.
Top comments (0)