DEV Community

Cover image for Ready, Set, Compile... you slow Camel
LNATION for LNATION

Posted on

Ready, Set, Compile... you slow Camel

"Perl is slow."

I've heard this for years, well since I started. You probably have too. And honestly? For a long time, I didn't have a great rebuttal. Sure, Perl's fast enough for most things, it's well known for text processing, glueing code and quick scripts. But when it came to object heavy code, the critics have a point.

We will begin by looking at the myth of perl being slow a little more deeply. Here's a benchmark between Perl and Python using CPU seconds, a fair comparison that measures actual work done:

=== PERL (5 CPU seconds per test) ===
Integer arithmetic             1,072,800/s
Float arithmetic                 398,800/s
String concat                    970,000/s
Array push/iterate               368,800/s
Hash insert/iterate               84,800/s
Function calls                   244,000/s
Regex match                   12,921,200/s

=== PYTHON (5 CPU seconds per test) ===
Integer arithmetic               777,200/s
Float arithmetic                 512,400/s
String concat                    627,200/s
List append/iterate              476,400/s
Dict insert/iterate              140,600/s
Function calls                   331,400/s
Regex match                   10,543,713/s
Enter fullscreen mode Exit fullscreen mode

The results are more nuanced than the "Perl is slow" narrative suggests:

Operation Winner Margin
Integer arithmetic Perl 1.4x faster
Float arithmetic Python 1.3x faster
String concat Perl 1.5x faster
Array/List ops Python 1.3x faster
Hash/Dict ops Python 1.7x faster
Function calls Python 1.4x faster
Regex match Perl 1.2x faster

Perl wins at what it's always been good at: integers, strings, and regex. Python wins at floats, data structures, and function calls areas where I am told Python 3.x has seen heavy optimisation work.

But here's the thing that surprised me: neither language is dramatically faster than the other for basic operations. The differences are measured in fractions, not orders of magnitude. So where does the "Perl is slow" reputation actually come from?

Object-oriented code. Let's run that same fair comparison:

=== Object creation + 2 method calls (5M iterations) ===
Perl bless:    4,155,178/s  (1.20 sec)
Python class:  5,781,818/s  (0.86 sec)
Enter fullscreen mode Exit fullscreen mode

Okay, this is not so bad. Perl's only 40% behind. But now let's look at what people actually use these days: Moo.

=== Object creation + 2 method calls (5M iterations) ===
Perl bless:    4,176,222/s  (1.20 sec)
Moo class:       843,708/s  (5.93 sec)
Python class:  5,590,052/s  (0.89 sec)
Enter fullscreen mode Exit fullscreen mode

Wait, what? Moo is 6.6x slower than Python. And it's 5x slower than plain bless.

This is layered with actual business logic is I guess where "Perl is slow" actually comes from. This all comes down to layers. Every Moo accessor isn't just a hash lookup, it's a stack of subroutine calls, each adding overhead:

$obj->name
  └─> accessor method (generated sub)
        └─> type constraint check
              └─> coercion check
                    └─> trigger check
                          └─> lazy builder check
                                └─> finally: $self->{name}
Enter fullscreen mode Exit fullscreen mode

Each of those subroutine calls means:

  • Push arguments onto the stack (~3-5 ops)
  • Create a new scope (localizing variables)
  • Execute the check (even if it's just "return true")
  • Pop the stack and return (~3-5 ops)

Even a "simple" Moo accessor with just a type constraint involves roughly 30+ additional operations compared to a plain hash access. The type constraint alone might call:

  1. has_type_constraint() - is there a constraint?
  2. type_constraint() - get the constraint object
  3. check() - call the constraint's check method
  4. The actual validation logic

Multiply that by two accessors per iteration, five million iterations, and suddenly you're spending 5 seconds instead of 1.

This is the trade off Moo makes: flexibility and safety for speed. And for most applications, it's the right trade off and even in python they do this with what they call pydantic that halfs the performance of python objects.

I've spent more time than I'd care to admit thinking about this question. Not in a "let's rewrite everything in Rust" kind of way, but genuinely asking: what would it take to make Perl's object system competitive with languages people actually consider fast?

The answer, it turns out, was inside a CPAN module first released on 'Mon Jul 24 11:23:25 2000'. This was highlighted to me by another works who I am indeed one of the three people who do not only read their blogs but also often finds themselves lost within their interesting coding patterns.

So this is the story of the four modules that changed how I think about Perl performance: Marlin, Meow, Inline and XS::JIT. They're different tools with different philosophies, but together they represent something I never quite expected to see Perl object access that's actually faster than Python's equivalent. Not "almost as fast." Faster.

The Marlin story: A faster fish in the Moose family

If you've written any serious Perl in the last fifteen years, you've probably used Moose. Or Moo. Or Mouse. The naming convention is... well, it's a thing we do now.

Marlin fits right into that tradition, and the name's not accidental. Marlins are among the fastest fish in the ocean. That's the pitch: everything you love about Moose-style OO, but with speed as a first-class concern.

Toby Inkster released Marlin in late 2025, and it caught my attention as I stated before, many of his projects do. I'd previously attempted to write a fast OO system myself (Meow), but was struggling to even compete with Moo despite being entirely XS. Partly ability, partly still learning, mostly not being in the right compile time stage.

With my interest piqued, I installed Marlin, played with the API, and ran some benchmarks:

Benchmark: 1,000,000 iterations
            Rate   Meow    Moo Marlin  Mouse
Meow    606,061/s     --    -1%   -45%   -47%
Moo     609,756/s     1%     --   -45%   -46%
Marlin 1,098,901/s    81%    80%     --    -3%
Mouse  1,136,364/s    87%    86%     3%     --
Enter fullscreen mode Exit fullscreen mode

Marlin performed well. Meow at that point was... not impressive. But I liked Marlin's API and, understanding my own implementation's limitations, I was satisfied enough with the speed to build my Claude modules around it, while also understanding it would likely improve in performance.

A few weeks later, and a lot happened in between, but on Friday evening I randomly decided to revisit my Meow directory. Could I fix some of the flaws based upon my recent learnings? I managed to, and saw a huge improvement in my own benchmarks. So I updated to the latest Marlin for a fair comparison.

I was expecting Meow to be faster now since I'm doing much less in this minimalist approach. But what I actually found surprised me:

Benchmark: 10,000,000 iterations
            Rate    Moo  Mouse   Meow Marlin
Moo     868,810/s     --   -47%   -60%   -81%
Mouse  1,626,016/s    87%     --   -26%   -64%
Meow   2,183,406/s   151%    34%     --   -52%
Marlin 4,504,505/s   418%   177%   106%     --
Enter fullscreen mode Exit fullscreen mode

Marlin had gotten dramatically faster, over 4x improvement from the version I'd first tested. Toby had clearly been busy. And while Meow had improved too, it was still only half of Marlin's speed.

This was the moment that changed everything. I needed to understand how Marlin achieved this. What was I missing?

Just in time optimisation

As I mentioned, I read other people's code. I read Toby's posts on Marlin and how he'd studied Mouse's optimisation strategy: only validate when you absolutely need to. But when I started tracing through Marlin's actual implementation, something clicked.

The key insight is in Marlin::Attribute::install_accessors. Here's what happens when Marlin sets up a reader:

if ( $type eq 'reader' and !$me->has_simple_reader and $me->xs_reader ) {
    $me->{_implementation}{$me->{$type}} = 'CXSR';  # Class::XSReader
}
elsif ( HAS_CXSA and $me->has_simple_reader ) {
    # Use Class::XSAccessor for simple cases
    Class::XSAccessor->import( class => $me->{package}, ... );
}
Enter fullscreen mode Exit fullscreen mode

Marlin makes a compile-time decision: what kind of accessor does this attribute actually need?

  • Simple getter (no default, no lazy, no type check on read)? → Use Class::XSAccessor, which is pure XS and blindingly fast
  • Getter with lazy default or type coercion? → Use Class::XSReader, which handles the complexity in optimised C
  • Something exotic (auto_deref, custom behaviour)? → Fall back to generated Perl

This is the magic. Most Moo-style accessors go through a generic code path that handles every possible feature, even features you're not using. Marlin analyses your attribute definition at compile time and generates the minimal accessor that satisfies your requirements.

Consider a read-only attribute with a type but no default:

# Moo accessor path:
$obj->name
   check if lazy builder needed     # nope, but we still check
   check if default needed          # nope, but we still check  
   check if coercion needed         # nope, but we still check
   finally: $self->{name}

# Marlin accessor (Class::XSAccessor):
$obj->name
   $self->{name}                    # that's it. One XS call.
Enter fullscreen mode Exit fullscreen mode

The type constraint? Marlin validates it in the constructor, not the getter. Once an object is built, reading an attribute is just a hash lookup: no validation, no subroutine calls, no stack manipulation.

This is why Marlin went from 1.1M ops/sec to 4.5M ops/sec between versions. Toby wasn't just optimising code. He was eliminating entire categories of runtime work by moving decisions to compile time.

The same principle applies to the constructor via Class::XSConstructor. Instead of a chain of Perl subroutines processing each attribute, Marlin generates a specialised XS constructor that knows exactly which attributes need defaults, which need type checking, and in what order to process them.

It's JIT compilation, but done at module load time rather than runtime. By the time your code calls ->new or ->name, all the decisions have been made. All that's left is the actual work.

This was my revelation: the path to fast Perl OO isn't avoiding features, it's avoiding runtime feature detection. Know what you need at compile time, generate optimised code for exactly that, and get out of the way.

Now the question became: could I apply this same principle to Meow? It was already setup to build a simple hash that represented the object, I had what I needed but I wanted to do this in a backwards compatible way.

Enter Inline::C

Armed with the understanding of why Marlin was fast, I had a hypothesis: if I could generate XS accessors at compile time tailored to each attribute's needs, Meow could achieve the same performance.

I needed to generate custom C code and then execute it, well for perl that was written by Ingy döt Net back in 2000 the Inline::C.

The idea was simple: when Meow sees ro name => Str, it should generate C code for an accessor that:

  1. Takes the object
  2. Returns the value at the slot index for name
  3. That's it. No method dispatch, no type checking, no feature checking.

I didn't want to just break everything so I leaned into the Moose catalog and added a make_immutable phase. When this is called it would compile the C code needed to generate an optimised XS package and this was fed into Inline::C. The first run would compile; subsequent runs would use the cached .so.

And it worked. I had to change the benchmark to CPU to get a fair result but I've also included a Cor test here which does not have type checking like Marlin or Meow.

Benchmark: running Cor, Marlin, Meow for at least 5 CPU seconds...
       Cor:  5 wallclock secs ( 5.13 usr +  0.02 sys =  5.15 CPU) @ 2,886,788/s
    Marlin:  5 wallclock secs ( 5.01 usr +  0.11 sys =  5.12 CPU) @ 4,523,074/s
      Meow:  5 wallclock secs ( 5.16 usr +  0.02 sys =  5.18 CPU) @ 4,558,344/s
Enter fullscreen mode Exit fullscreen mode

As you can see Meow had caught Marlin. Actually, it was slightly faster, 4.56M vs 4.52M ops/sec, but this would be expected as Meow does ALOT less than Marlin.

But my bottlekneck was now in Inline::C and well nobody wants to write C/XS let alone concatenate it.

  1. Startup overhead: First compilation was slow, several seconds for a complex class
  2. Dependencies: Inline::C pulls in Parse::RecDescent, adds complexity to the dependency chain
  3. Build process: It generates a full Makefile.PL and runs the ExtUtils::MakeMaker machinery
  4. Caching: The caching mechanism is designed for "write once" scripts, not dynamic code generation

For a proof of concept, Inline::C was perfect. But for a production module, I needed something leaner. That's when I started looking at what Inline::C actually does under the hood, and wondering how much of it I could strip away.

Under the hood: XS::JIT as the secret weapon

Inline::C proved the concept worked, but it came with baggage. Every compile spawned a full Makefile.PL build process. Dependencies bloated the install. And the caching system, designed for write-once scripts, wasn't ideal for dynamic code generation.

So I started picking apart what Inline::C actually does:

  1. Parse C code to find function signatures
  2. Generate XS wrapper code
  3. Generate a Makefile.PL
  4. Run perl Makefile.PL && make
  5. Load the resulting .so

And yes, this happens even when you use bind Inline C => ... instead of the use form. The bind keyword just defers compilation to runtime rather than compile time. It doesn't change what gets done, only when. You still get the full Parse::RecDescent parsing, the xsubpp processing, the MakeMaker dance. The only difference is whether it happens at use time or when bind is called.

Most of this was unnecessary for my use case. I didn't need function parsing, I already knew what functions I was generating. I didn't need XS wrappers, I was writing XS-native code directly. And I definitely didn't need the Makefile.PL dance.

XS::JIT strips all of that away. It's a single-purpose tool: take C code, compile it, load it, install the functions. No parsing. No xsubpp. No make. Direct compiler invocation.

Here's what the C API looks like:

#include "xs_jit.h"

/* Function mapping - where to install what */
XS_JIT_Func funcs[] = {
    { "Cat::new",  "cat_new",  0, 1 },  /* target, source, varargs, xs_native */
    { "Cat::name", "cat_name", 0, 1 },
    { "Cat::age",  "cat_age",  0, 1 },
};

/* Compile and install in one call */
int ok = xs_jit_compile(aTHX_
    c_code,           /* Your generated C code */
    "Meow::JIT::Cat", /* Unique name for caching */
    funcs,            /* Function mapping array */
    3,                /* Number of functions */
    "_CACHED_XS",     /* Cache directory */
    0                 /* Don't force recompile */
);
Enter fullscreen mode Exit fullscreen mode

That's it. One function call. The first time it runs, XS::JIT:

  1. Generates a boot function that registers all the XS functions
  2. Compiles directly with the system compiler (cc -shared -fPIC ...)
  3. Loads the .so with DynaLoader
  4. Installs each function into its target namespace

Subsequent runs? It hashes the C code, finds the cached .so, and just loads it. The compile step vanishes entirely.

The key insight is the is_xs_native flag. When set, XS::JIT creates a simple alias: no wrapper function, no stack manipulation, no overhead. Your C function is the XS function:

XS_EUPXS(cat_name) {
    dVAR; dXSARGS;
    SV *self = ST(0);
    AV *av = (AV*)SvRV(self);
    SV **slot = av_fetch(av, 0, 0);  /* slot 0 = name */
    ST(0) = slot ? *slot : &PL_sv_undef;
    XSRETURN(1);
}
Enter fullscreen mode Exit fullscreen mode

No wrapper. No intermediate calls.

This is exactly what Meow needed. During make_immutable, it:

  1. Analyses each attribute's requirements (type constraint? coercion? trigger?)
  2. Generates minimal XS accessor code for each one
  3. Generates an optimised XS constructor that handles all attributes in one pass
  4. Hands the code to XS::JIT for compilation
  5. Gets back installed functions ready to call

The entire JIT compilation happens once per class, at module load time. By the time your code runs, everything is native XS.

Comparing the approaches

Here's what actually happens at runtime for each framework:

Moo accessor call:

$obj->name
  → Perl method dispatch
    → Generated Perl subroutine
      → has_type_constraint() check
        → type_constraint() fetch
          → check() call
            → finally: $self->{name}
Enter fullscreen mode Exit fullscreen mode

Stack frames: 4-6. Operations: ~30.

Marlin accessor call (Class::XSAccessor):

$obj->name
  → Perl method dispatch
    → XS accessor
      → $self->{name}
Enter fullscreen mode Exit fullscreen mode

Stack frames: 1. Operations: ~5.

Note: Toby has some slot magic also

Meow accessor call (XS::JIT):

$obj->name
  → Perl method dispatch
    → XS accessor
      → $self->[SLOT_INDEX]
Enter fullscreen mode Exit fullscreen mode

Stack frames: 1. Operations: ~4 (arrays are slightly faster than hashes).

The benchmark results

With XS::JIT in place, here's where Meow now landed:

Benchmark: running Cor, Marlin for at least 5 CPU seconds... Marlin and Meow has type constraint checking
       Cor:  5 wallclock secs ( 5.13 usr +  0.02 sys =  5.15 CPU) @ 2886788.16/s (n=14866959)
    Marlin:  5 wallclock secs ( 5.01 usr +  0.11 sys =  5.12 CPU) @ 4523074.80/s (n=23158143)
      Meow:  5 wallclock secs ( 5.16 usr + -0.01 sys =  5.15 CPU) @ 5196218.06/s (n=26760523)
Benchmark: running Marlin, Meow, Moo, Mouse for at least 5 CPU seconds...
    Marlin:  5 wallclock secs ( 5.22 usr +  0.13 sys =  5.35 CPU) @ 4814728.04/s (n=25758795)
      Meow:  5 wallclock secs ( 5.23 usr +  0.01 sys =  5.24 CPU) @ 5203329.96/s (n=27265449)
       Moo:  4 wallclock secs ( 5.28 usr +  0.00 sys =  5.28 CPU) @ 860649.81/s (n=4544231)
     Mouse:  6 wallclock secs ( 5.29 usr +  0.01 sys =  5.30 CPU) @ 1603849.25/s (n=8500401)
            Rate    Moo  Mouse Marlin   Meow
Moo     860650/s     --   -46%   -82%   -83%
Mouse  1603849/s    86%     --   -67%   -69%
Marlin 4814728/s   459%   200%     --    -7%
Meow   5203330/s   505%   224%     8%     --
Enter fullscreen mode Exit fullscreen mode

I must be honest, around this time I had not implemented the full benchmarks against Perl and Python. I didn't fully understand the difference, so I had some thoughts that I was hitting limitations with my own hardware (it was late, or early in the morning). Anyway, I kept pushing and ran a benchmark where I accessed the slot directly as an array reference. This got me excited:

Meow (direct) 7,172,481/s     778%    347%     50%     14%
Enter fullscreen mode Exit fullscreen mode

I was seeing a huge improvement. I spent some time making an API that was a little nicer by exposing constants as slot indexes:

{
    package Cat 
    use Meow;
    ro name => Str;
    ro age => Int;
    make_immutable;  # Creates $Cat::NAME, $Cat::AGE
}

# Direct slot access
my $name = $cat->[$Cat::NAME];
Enter fullscreen mode Exit fullscreen mode

I was now on par with Python, but I wanted more. There had to be a way to get that array access without the ugly syntax.

So I dug deeper into Perl's internals and found the missing magic: cv_set_call_checker and custom ops.

The entersub bypass: Custom ops

Here's what normally happens when you call a method in Perl:

name($cat)
  → OP_ENTERSUB (the "call function" op)
    → Push arguments onto stack
    → Look up the CV (code value)
    → Set up new stack frame
    → Execute the XS function
    → Pop stack frame
    → Return
Enter fullscreen mode Exit fullscreen mode

Even for our minimal XS accessor, there's overhead: the entersub op itself, the stack frame setup, the CV lookup. What if we could eliminate all of that?

Perl provides a hook called cv_set_call_checker. It allows you to register a "call checker" function that runs at compile time when the parser sees a call to your subroutine. The checker can inspect the op tree and crucially replace it with something else entirely.

Here's what Meow does:

static void _register_inline_accessor(pTHX_ CV *cv, IV slot_index, int is_ro) {
    SV *ckobj = newSViv(slot_index);  /* Store slot index for later */
    cv_set_call_checker_flags(cv, S_ck_meow_get, ckobj, 0);
}
Enter fullscreen mode Exit fullscreen mode

When the checker sees name($cat), it:

  1. Extracts the $cat argument from the op tree
  2. Frees the entire entersub operation
  3. Creates a new custom op with the slot index baked in
  4. Returns that instead

The custom op is trivially simple:

static OP *S_pp_meow_get(pTHX) {
    dSP;
    SV *self = TOPs;
    PADOFFSET slot_index = PL_op->op_targ;  /* Baked into the op */

    SV **ary = AvARRAY((AV*)SvRV(self));
    SETs(ary[slot_index] ? ary[slot_index] : &PL_sv_undef);

    return NORMAL;
}
Enter fullscreen mode Exit fullscreen mode

That's the entire accessor. No function call. No stack frame. No CV lookup. The slot index is embedded directly in the op structure. The Perl runloop executes this op directly, it's as close to $cat->[$NAME] as you can get while still looking like name($cat).

This is the same technique that builtin::true and builtin::false use in Perl 5.36+. It's also how List::Util::first can be optimised when given a simple block.

The final benchmark

With custom ops in place via import_accessors, here's how the Perl OO frameworks compare:

Benchmark: running Marlin, Meow, Meow (direct), Meow (op), Moo, Mouse for at least 5 CPU seconds...
    Marlin:  6 wallclock secs ( 5.09 usr +  0.11 sys =  5.20 CPU) @ 4766685.58/s (n=24786765)
      Meow:  5 wallclock secs ( 5.29 usr +  0.01 sys =  5.30 CPU) @ 6289606.79/s (n=33334916)
Meow (direct):  5 wallclock secs ( 5.32 usr +  0.01 sys =  5.33 CPU) @ 7172480.86/s (n=38229323)
 Meow (op):  5 wallclock secs ( 5.16 usr +  0.01 sys =  5.17 CPU) @ 7394453.19/s (n=38229323)
       Moo:  4 wallclock secs ( 5.44 usr +  0.02 sys =  5.46 CPU) @ 816865.93/s (n=4460088)
     Mouse:  4 wallclock secs ( 5.18 usr +  0.01 sys =  5.19 CPU) @ 1605727.55/s (n=8333726)
                   Rate      Moo   Mouse  Marlin    Meow Meow (direct) Meow (op)
Moo            816866/s       --    -49%    -83%    -87%          -89%      -89%
Mouse         1605728/s      97%      --    -66%    -74%          -78%      -78%
Marlin        4766686/s     484%    197%      --    -24%          -34%      -36%
Meow          6289607/s     670%    292%     32%      --          -12%      -15%
Meow (direct) 7172481/s     778%    347%     50%     14%            --       -3%
Meow (op)     7394453/s     805%    361%     55%     18%            3%        --
Enter fullscreen mode Exit fullscreen mode

Now lets test that directly against python:

============================================================
Python Direct Benchmark (slots + property accessors)
============================================================
Python version: 3.9.6 (default, Dec  2 2025, 07:27:58)
[Clang 17.0.0 (clang-1700.6.3.2)]
Iterations: 5,000,000
Runs: 5
------------------------------------------------------------
Run 1: 0.649s (7,704,306/s)
Run 2: 0.647s (7,733,902/s)
Run 3: 0.646s (7,736,307/s)
Run 4: 0.648s (7,720,909/s)
Run 5: 0.649s (7,702,520/s)
------------------------------------------------------------
Median rate: 7,720,909/s
============================================================
============================================================
Perl/Meow Benchmark Comparison
============================================================
Perl version: 5.042000
Iterations: 5000000
Runs: 5
------------------------------------------------------------
Inline Op (one($foo)):
  Run 1: 0.638s (7,841,811/s)
  Run 2: 0.629s (7,954,031/s)
  Run 3: 0.631s (7,929,850/s)
  Run 4: 0.631s (7,926,316/s)
  Run 5: 0.633s (7,901,675/s)
  Median: 7,926,316/s
============================================================
Summary:
------------------------------------------------------------
  Inline Op:    7,926,316/s
============================================================
Enter fullscreen mode Exit fullscreen mode

Conclusion: Why JIT might be the right approach

Looking back at this journey, a pattern emerges. The fastest code isn't the cleverest code. It's the code that does the least work at runtime.

Moo is slow because it makes decisions at runtime that could be made at compile time. Every accessor call checks for features that aren't being used. Every type check goes through layers of indirection that exist to support edge cases.

Marlin proved that you could have Moo's features without Moo's overhead by making smart choices at compile time. If an accessor doesn't need lazy building, don't generate code that checks for lazy building.

Meow pushed this further: if you're going to generate code at compile time anyway, why not generate exactly the code you need? Not a generic accessor that handles many cases, but a specific accessor for this specific attribute on this specific class.

And XS::JIT made that practical. Without a lightweight JIT compiler, dynamic XS generation would require shipping a C toolchain with every module, or adding multi-megabyte dependencies. XS::JIT strips the problem down to its essence: take C code, compile it, load it.

The result is object access that competes with, and sometimes beats, languages that have had decades of optimisation work. Not because Perl's interpreter got faster, but because we stopped asking it to do unnecessary work.

Is this approach right for every project? No. Most applications don't need 7 million object accesses per second.

But for the times when performance matters (hot loops, high-frequency trading, real-time systems) it's good to know the ceiling isn't as low as we thought. Perl can be fast. We just needed to get out of its way.


The modules discussed in this post:

Top comments (0)