Roman Samoilov

Posted on Sep 30

Impractical Ruby Optimisations

#ruby

Earlier this year, I was working on an event bus for the Rage framework. The event bus was designed to allow publishing both synchronous and asynchronous events. With synchronous events, the request would wait for event subscribers to finish. With asynchronous events, subscribers would be executed after the request had been served.

But what happens if an application publishes a bunch of asynchronous events and then receives a SIGTERM? To prevent losing these events and to ensure they can be processed after a server restart, the framework needed to store asynchronous events in persistent storage.

By default, Rage would store asynchronous events on disk in an append-only log. This allowed for a seamless setup, eliminating the need for any specific configurations for the event bus to function.

Let's explore how the storage was implemented, walk through possible optimisations for an otherwise straightforward code, and see how seemingly minor code choices can significantly impact performance.

While working on the event bus, it evolved into something very different - a message queue. However, the code described in this article is still largely used within the framework.

Saving Data on Disk

Let's examine the following code, which stores asynchronous events on disk:

require "zlib"

class DiskBackend
 def initialize
   @file = File.open("storage/#{Time.now.strftime("%Y%m%d")}-#{Process.pid}", "a+b")
   @entries = {}
 end

 def add(event, subscribers)
   entry_ids = Array.new(subscribers.length) { generate_entry_id }
   serialized_event = Marshal.dump(event)

   subscribers.zip(entry_ids) do |subscriber, entry_id|
     entry = "add:#{entry_id}:#{subscriber}:#{serialized_event}"
     crc = Zlib.crc32(entry).to_s(16).rjust(8, "0")

     @file.write("#{crc}:#{entry}\n")
   end

   entry_ids.each { |entry_id| @entries[entry_id] = true }
 end

 private

 def generate_entry_id
   Process.clock_gettime(Process::CLOCK_MONOTONIC).to_s
 end
end

The DiskBackend class stores event subscribers in a file. Since each event can have multiple subscribers, the class generates a unique entry for each subscriber.

In the constructor, we open the storage file and initialise the @entries hash (more on this later). The add method then performs the following steps:

Generates unique IDs for all entries;
Serialises the event;
For each subscriber:
- Generates an entry in the format of add:<entryID>:<subscriber>:<event>;
- Generates a CRC32 signature for the entry;
- Writes the entry to the file;
- Updates the @entries hash with the list of generated entry IDs. This information is used during storage file rotation to identify in-progress events that need to be copied to a new storage file.

Let's now run a simple benchmark to see how the code performs:

RubyVM::YJIT.enable

require "benchmark/ips"
require_relative "disk_backend"

backend = DiskBackend.new

Benchmark.ips do |x|
 x.report("add") do
   backend.add(event, subscribers)
 end
end

On my machine, with Ruby 3.3.8, I get the following results:

ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
                 add    44.335k i/100ms
Calculating -------------------------------------
                 add    226.405k (±16.5%) i/s    (4.42 μs/i) -      1.153M in   5.224209s

226k operations per second! That's good. But I think I can make it better.

Profiling

Let's profile this code and see if we can extract some useful information from there:

%self      total      self      wait     child     calls  name
26.80      1.343     1.343     0.000     0.000  1000000   <Module::Marshal>#dump        
16.33      1.286     0.818     0.000     0.468  1000000   Array#zip                     
 9.24      4.604     0.463     0.000     4.141  1000000   DiskBackend#add
 7.63      0.382     0.382     0.000     0.000  1000000   Float#to_s                    
 6.85      0.343     0.343     0.000     0.000  1000000   Hash#[]=                      
 6.15      5.011     0.308     0.000     4.703        1   Integer#times
 3.83      0.192     0.192     0.000     0.000  1000000   IO#write                      
 3.68      0.640     0.184     0.000     0.455  1000000   DiskBackend#generate_entry_id
 3.62      0.821     0.181     0.000     0.640  1000000   Array#initialize              
 3.44      0.516     0.172     0.000     0.343  1000000   Array#each                    
 2.46      0.944     0.123     0.000     0.821  1000000   <Class::Array>#new            
 1.90      0.095     0.095     0.000     0.000  1000000   Integer#to_s                  
 1.82      0.091     0.091     0.000     0.000  1000000   <Module::Zlib>#crc32          
 1.79      0.090     0.090     0.000     0.000  1000000   String#rjust                  
 1.45      0.073     0.073     0.000     0.000  1000000   <Module::Process>#clock_gettime

As expected, most of the time is spent serialising the event with Marshal.dump. One surprising observation, however, is the Float#to_s call coming from Process.clock_gettime(Process::CLOCK_MONOTONIC).to_s. This code consumes 7.63% of the total execution time.

On the one hand, this is understandable - we are converting the float to a string to then use it as part of the event entry written to the file. On the other hand, we don't have to do this, as the conversion will happen automatically when the number is written to a string anyway. What if we removed the to_s call then?

@@ -23,7 +23,7 @@ class DiskBackend
   private

   def generate_entry_id
-    Process.clock_gettime(Process::CLOCK_MONOTONIC).to_s
+    Process.clock_gettime(Process::CLOCK_MONOTONIC)
   end
 end

Let's run the benchmark again:

ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
                add    64.575k i/100ms
Calculating -------------------------------------
                add    625.264k (± 6.5%) i/s    (1.60 μs/i) -      3.164M in   5.093242s

Whoa! Removing to_s, which doesn't affect functionality at all, and allowing Ruby to implicitly convert the float, increased the performance of our code by 2.76 times!

If you think this is the quirk of YJIT, it's not; the numbers are roughly the same without YJIT. And it's not even about implicit string conversions; the speed-up comes from this code:

entry_ids.each { |entry_id| @entries[entry_id] = true }

Turns out, setting elements with number keys in a hash is much more performant than setting elements with string keys. Less is more - lesson learnt.

Constant-time Writes

Reading the following section can lead to uncontrollable anger if you're a Product Manager.

Let's see if we can push the code a bit further. In our profiling results, we also see that the Hash#[]= call, used when populating the @entries hash, consumes 6.85% of the time.

Hashes are fast; setting elements in a hash has O(1) complexity. Moreover, we've already optimised that code by removing to_s. But what if we comment out the loop entirely?

@@ -17,7 +17,7 @@ class DiskBackend
       @file.write("#{crc}:#{entry}\n")
     end

-    entry_ids.each { |entry_id| @entries[entry_id] = true }
+    # entry_ids.each { |entry_id| @entries[entry_id] = true }
   end

   private

The benchmark shows the following results:

ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
                add    66.647k i/100ms
Calculating -------------------------------------
                add    702.800k (± 2.4%) i/s    (1.42 μs/i) -      3.532M in   5.029299s

That's quite an improvement, but how do we get there if we still need the information in the @entries hash? We change the requirements, of course!

Instead of copying pending events during storage file rotation, we can modify the logic to rotate the storage file only when there are no in-progress events. This would require changes to the storage file rotation mechanism. On the plus side, we can now maintain a simple counter instead of a hash:

@@ -3,7 +3,7 @@ require "zlib"
 class DiskBackend
   def initialize
     @file = File.open("storage/#{Time.now.strftime("%Y%m%d")}-#{Process.pid}", "a+b")
-    @entries = {}
+    @pending_events = 0
   end

   def add(event, subscribers)
@@ -17,7 +17,7 @@ class DiskBackend
       @file.write("#{crc}:#{entry}\n")
     end

-    entry_ids.each { |entry_id| @entries[entry_id] = true }
+    @pending_events += entry_ids.length
   end

   private

The benchmark results for this version:

ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
                add    67.455k i/100ms
Calculating -------------------------------------
                add    704.674k (± 0.7%) i/s    (1.42 μs/i) -      3.575M in   5.073673s

Micro-optimisations

There's one final optimisation I'd like to implement. Thinking about event buses, events are often quite specific. This means that, most of the time, an event has only one subscriber. Let's optimise the add method for this particular use case:

def add(event, subscribers)
  serialized_event = Marshal.dump(event)

  if subscribers.length == 1
    entry = "add:#{generate_entry_id}:#{subscribers[0]}:#{serialized_event}"
    crc = Zlib.crc32(entry).to_s(16).rjust(8, "0")

    @file.write("#{crc}:#{entry}\n")

    @pending_events += 1
  else
    entry_ids = Array.new(subscribers.length) { generate_entry_id }

    subscribers.zip(entry_ids) do |subscriber, entry_id|
      entry = "add:#{entry_id}:#{subscriber}:#{serialized_event}"
      crc = Zlib.crc32(entry).to_s(16).rjust(8, "0")

      @file.write("#{crc}:#{entry}\n")
    end

    @pending_events += entry_ids.length
  end
end

Yes, the code looks terrible, but this provides an additional 10% boost:

ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
                add    75.275k i/100ms
Calculating -------------------------------------
                add    770.936k (± 0.7%) i/s    (1.30 μs/i) -      3.914M in   5.077604s

Wrapping Up

With several simple (and one ugly) changes, we improved the performance of our code by 3.4 times! As it often happens with performance optimisations, the most significant improvement came from the smallest and simplest change.

Does it matter though? Aren't all of these micro-optimisations? Well, yes and no.

Let's consider our @entries hash example, where we were inserting elements into a hash:

entry_ids.each { |entry_id| @entries[entry_id] = true }

In theory, Ruby can insert 16M elements per second into a hash:

RubyVM::YJIT.enable

require "benchmark/ips"

entries = {}

Benchmark.ips do |x|
 x.report("Hash#[]=") do
   entries[Process.clock_gettime(Process::CLOCK_MONOTONIC)] = true
 end
end

Results:

ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
            Hash#[]     1.450M i/100ms
Calculating -------------------------------------
            Hash#[]     16.066M (± 6.4%) i/s   (62.24 ns/i) -     81.173M in   5.083554s

It can't possibly affect our code, which makes less than 1M ops/s, can it? Yet, this very call slows down our code by 12%! So much for micro-optimisations.

The good news is that you likely don't need to worry about this. When it comes to web development, it's far more important to write maintainable code that is easy to read and change. Performance is not nearly as critical as maintainability. Gems, however, are a very different thing.

Gems provide abstractions that should be not only convenient but highly efficient, too. This enables applications to scale and handle more load using the same slow-but-maintainable user-level code.

While most optimisations might seem insignificant, their cumulative impact can be significant, especially in areas like gem development. The difference between a few microseconds can translate into thousands of operations per second, directly influencing an application's ability to scale and handle increased load efficiently.

DEV Community