Yes, Ruby is fast, but…

Beta Ziliani — Thu, 09 May 2024 13:53:31 +0000

John Hawthorn wrote a nice post discussing a recent tool to incorporate Crystal into your Ruby app. While JH brings an important point, it overlooks certain aspects that are worth consideration. I'll discuss Crystal's real performance and benefits, highlighting why such Ruby/Crystal integration is an indispensable tool to have on the bench.

This is also a structured presentation of some comments made on the Hacker News post.

tl;dr

JH makes the case that Ruby has a just-in-time compiler, and that optimizing the Ruby version of the code has a great performance improvement.
Crystal code doesn't need wrestling to be optimal.
The comparison is performed within Ruby, that is, incorporating the cost of calling Crystal within Ruby.
Pure Crystal shows something radically different!

Yes, Ruby is fast

The first point I want to make is that JH is right: we need to be fair to Ruby's JIT compiler (--yjit), and only consider benchmarks that include it. And, indeed, with it, Ruby gets very nice performance.

And let me be blunt here: I love Ruby! Ruby is one of my top 5 languages of choice. And a great community with many big companies agree, so I expect Ruby's performance will only increase with time as more improvements gets incorporated into it.

🔴 First point: Ruby's YJIT is fast!

The real performance of JITs and Crystal

Let's compare the execution of Ruby's YJIT, Python PyPy (another JIT compiler), and pure Crystal (that is, without the integration).

Ruby: On my computer, the numbers for Ruby's YJIT goes on par with those in the post. Each line corresponds to each of the optimizations proposed:

> ruby --yjit fib.rb
       user     system      total        real
   3.464166   0.022979   3.487145 (  3.491493)
   1.705869   0.002169   1.708038 (  1.710117)
   0.187083   0.000318   0.187401 (  0.187578)

Python: My Python-foo is limited, so I only ported the last problem (a simple while loop) and ran it with PyPy. It takes a bit less of time:

>  pypy fib.py
0.12447810173

Crystal: When we compile the code with --release, numbers are insignificant! Not only that, I've added some extra code to make sure the optimizations weren't throwing away important code. So not only I calculate the Fibonacci number of 45 (using an UInt128, to even stretch this further), but I also print the sum of the million runs!

> crystal build --release fib.cr; ./fib
        user     system      total        real
  1134903170000000
  0.000002   0.000004   0.000006 (  0.000004)
  1134903170000000
  0.000001   0.000002   0.000003 (  0.000003)
  1134903170000000
  0.000002   0.000002   0.000004 (  0.000003)

⚫ Second point: Pure Crystal is really, really fast in this benchmark!

Reference: The code I'm using for the benchmarks is listed in this gist.

Note: As mentioned, the Crystal version uses a primitive number type (UInt128). That explains a lot of the performance difference.

Crystal compilation optimizes your code

In the timings of the Crystal programs, the first one takes a couple more micro-seconds. However, if we swap the order in which the examples are run, the output is identical: the first one, whichever that is, takes a few micro-seconds more.

In conclusion, none of the proposed changes to the Ruby version of the code makes a dent in the Crystal version. This is not entirely Crystal's doing: it uses the LLVM backend, which generates very optimized binaries.

Quite frankly, I'm puzzled as to why Ruby's YJIT doesn't optimize this as well. Perhaps it will get there with time (I tested Ruby 3.3.1).

⚫ Third point: Crystal code is fast, even without tweaks

Maybe it's the plumbing that's slow?

Doesn't seem so. But to understand why, we need to discuss an important point: by default, the integration compiles the Crystal code without the --release flag. This makes sense: during development, you don't want the compilation to take a lot of time. Compiling in release mode makes efficient binaries, but at the cost of significantly increasing the compilation time.

When I tested the Prime Counting from the README file of the crystalruby page, using release mode, the time it takes to run the Crystal code is the same as the one from pure Crystal. For that, one needs to add the following code:

CrystalRuby.configure do |config|
  config.debug = false
end

So perhaps the timings from the Fibonacci example would look the same as with pure Crystal. I say perhaps because I stumbled across an issue that turned the integration unusable on that particular example.

🔴⚫ Fourth point: The integration doesn't produce efficient Crystal code by default.

Crystal/Ruby integration revisited

Crystal and Ruby are two wonderful languages, each with their pros and cons. Crystal's performance and low memory footprint is hardly contested, and can further be studied in the benchmarks of language and compilers (but be critical about benchmarks!).

Performance is not the only advantage of Crystal: its typechecker is another benefit that teams might want to use for safety-critical parts of an application. Or maybe there is an interesting shard to call from a gem… Whatever the reason, integrating Crystal code into Ruby is a very appealing tool to have in the dev toolbox.

It is common to call C functions from Ruby or Crystal. It's interesting to know that there are alternatives to bridge these two languages that share the same goal of writing beautiful programs, using a similar syntax. The mentioned crystalruby gem allows interfacing Ruby programs with Crystal, and the shard anyolite allows calling Ruby programs from Crystal.

🔴⚫ Fifth point: Ruby + Crystal FTW! ❤️

EDIT: I got twice a very good question: how do we know LLVM isn't optimizing it that much, that it just replaces the call to the Fibonacci function with the result? After all, the argument is fixed, it can calculate how much the result will be and just place that.

I missed this point in the post, although I originally thought about it. At the time of writing, I tried adding 45 + rand(1) as argument. This ensures the argument is not a literal number. It certainly impacts in the overall performance, and now it takes 1ms. Still very good, because it also counts the calls to rand! This is why I didn't see a problem and forgot to add this to the article.

However, with further inspection of the LLVM generated code, I found more! It optimizes the code nevertheless! It produces a sum of 1134903170 (result of fib(45)) with the million calls to rand(1)! I was totally mind-blowed by this. In any case, point to LLVM, and for Crystal to use it!

EDIT 2: GitHub's user @petr-fischer suggested to take the argument from the command line, in order to force LLVM to not optimize that much. With that change, times changes significantly, in particular we can see a difference from the second to the third version:

        user     system      total        real
    0.034982   0.000266   0.035248 (  0.035400)
    0.034268   0.000134   0.034402 (  0.034522)
    0.023234   0.000140   0.023374 (  0.023607)

I don't think the takeaways are any different: we're still talking of a significant reduction w.r.t. to the Ruby or Python versions. And as mentioned already, let me stress that a big part of this is using a primitive type (check this post by Ary that George Dietrich recommended in the forum).