sentomk

Posted on May 8

How We Made a 128-Arm Pattern Match Run 10 Faster

#cpp #cpp17 #metaprogramming #performance

How We Made a 128-Arm Pattern Match Run 10× Faster

The 128-branch dispatch problem, and the compile-time optimization system that solved it.

Pattern matching is coming to C++ — eventually. C++26 might bring it. But if you're writing C++17 today and need to match on variants, enums, or arbitrary values with guards, destructuring, and combinators, you're on your own.

I've been building Patternia, a C++17 header-only pattern matching library, for the past year. The API is simple enough:

#include <ptn/patternia.hpp>
using namespace ptn;

auto result = match(subject) | on(
    val<Status::Ready>     >> []{ return "ready"; },
    val<Status::Pending>   >> []{ return "pending"; },
    is<Error>(ds(status))  >> [](int s){ return "error: " + std::to_string(s); },
    _                      >> []{ return "unknown"; }
);

Clean. But the API isn't the hard part. The hard part is making it fast.

The Problem

When you write a pattern match with 128 cases (not unusual for protocol parsers or command dispatchers), every on() call constructs those 128 case objects — guards, destructuring patterns, combinators — every single time the match is evaluated.

The naive approach: 10.79 nanoseconds per call.

A hand-written switch with 128 branches: 1.09 nanoseconds per call.

That's a 10× gap. If your library is slower than the code it replaces, nobody will use it.

The Solution: Automatic Dispatch Plan Selection

The core insight is that at compile time, the library knows everything about the patterns — which ones are compile-time constants, which ones match variant types, which ones have guards. So why not let the compiler pick the optimal dispatch strategy?

Patternia does this with a six-tier dispatch plan hierarchy, selected at compile time via dispatch_plan_selector:

1. static_literal_dense   — O(1) array lookup for val<V> compile-time values
2. literal_runtime_dense  — O(1) for runtime lit() values
3. literal_linear         — Direct dispatch for integral/enum subjects
4. variant_simple         — Precomputed case-index table by variant alt
5. variant_alt_bucketed   — Binary search over alt-bucketed cases
6. sequential             — Linear fallback

The selector analyzes the pattern set and picks the best strategy at compile time — no runtime overhead for the decision:

template <typename AnalysisResult>
struct dispatch_plan_selector {
    static constexpr dispatch_plan_kind kind =
        analysis_t::can_use_static_literal_dispatch
            ? dispatch_plan_kind::static_literal_dense
            : (analysis_t::can_use_runtime_literal_dense_dispatch
                   ? dispatch_plan_kind::literal_runtime_dense
                   : (analysis_t::can_use_simple_literal_dispatch
                          ? dispatch_plan_kind::literal_linear
                          : (analysis_t::can_use_simple_variant_dispatch
                                 ? dispatch_plan_kind::variant_simple
                                 : (analysis_t::can_use_variant_alt_dispatch
                                        ? dispatch_plan_kind::variant_alt_bucketed
                                        : dispatch_plan_kind::sequential))));
};

Each tier applies density heuristics to decide whether to engage. For example, static_literal_dense only activates when the value span is at most 4× the number of literal cases — ensuring the lookup table isn't too sparse. Tuning parameters are baked in:

k_static_literal_dense_dispatch_max_cases   = 256;
k_static_literal_dense_dispatch_max_span    = 512;
k_static_literal_dense_dispatch_max_density = 4;
k_runtime_literal_dense_dispatch_max_cases  = 256;
k_runtime_literal_dense_dispatch_max_span   = 512;
k_runtime_literal_dense_dispatch_max_density = 8;

The `[[no_unique_address]]` Bug

Here's where it got weird. The optimization system checks whether a pattern type is "cacheable" — empty types can be skipped, reducing the case construction overhead. It used std::is_empty for this check.

One pattern type, static_literal_pattern, holds a comparator:

template <auto V, typename Cmp = std::equal_to<>>
struct static_literal_pattern : base::pattern_base<static_literal_pattern<V, Cmp>> {
  [[no_unique_address]] Cmp cmp{};
  // ...
};

std::equal_to<> is technically an empty type. We annotated it with [[no_unique_address]]. But GCC applied the attribute inconsistently — sometimes the member was optimized away, sometimes not. When GCC left it in, std::is_empty returned false for the pattern type, and the cache path was silently skipped.

The fix: replace std::is_empty with a sizeof <= 1 heuristic. Works consistently across GCC, Clang, and MSVC.

Finding this bug required reading the generated assembly for a dozen compiler versions. The lesson: [[no_unique_address]] is a hint, not a guarantee.

Results

With automatic dispatch plan selection, the 128-branch match drops from 10.79 ns to 1.13 ns — within 4% of a hand-written switch (1.09 ns):

Approach	Latency
Raw `on()` (naive)	10.79 ns
PTN_ON macro (manual cache)	1.16 ns
Auto-cache (this work)	1.13 ns
Hand-written switch	1.09 ns

The key win: automatic optimization means users get the fast path without knowing dispatch plans exist. They just write match(subject) | on(...) and the compile-time analyzer does the rest.

Real-World Benchmarks

Beyond the 128-branch microbenchmark, Patternia holds its own in realistic scenarios:

Scenario	Patternia	Baseline	Result
Command parser	1.790 ns	Switch (2.258 ns)	+26% faster
Protocol router	1.628 ns	IfElse (1.866 ns)	+14% faster
Variant mixed	1.997 ns	std::visit (1.991 ns)	Tied

How It Compares

Patternia isn't the only C++ pattern matching library. But it brings a few things the others don't:

Feature	Patternia	matchit.cpp	Mach7
Pipeline syntax	✅	❌	❌
Compile-time dispatch optimization	✅	Limited	❌
Structural bindings	✅	❌	❌
Guard expressions	✅	❌	❌
Pattern combinators (any/all/neg)	✅	❌	❌
C++ standard	C++17	C++17	C++14
Header-only	✅	✅	✅

The automatic dispatch optimization is the differentiator. Other libraries can match — Patternia matches fast.

Try It

The library is a single header include away:

# CMake FetchContent, vcpkg, or just copy the headers
git clone https://github.com/sentomk/patternia

#include <ptn/patternia.hpp>
using namespace ptn;

// Your first pattern match, automatically optimized
auto result = match(value) | on(
    val<1>    >> []{ return "one"; },
    val<2>    >> []{ return "two"; },
    is<std::string>(ds(s)) >> [](auto& s){ return s; },
    _         >> []{ return "other"; }
);

The compile-time optimizer picks the right dispatch strategy. You don't have to think about it.

Patternia on GitHub — 179 stars, 438 commits, C++17 header-only.

DEV Community

How We Made a 128-Arm Pattern Match Run 10 Faster

How We Made a 128-Arm Pattern Match Run 10× Faster

The Problem

The Solution: Automatic Dispatch Plan Selection

The `[[no_unique_address]]` Bug

Results

Real-World Benchmarks

How It Compares

Try It

Top comments (0)

How We Made a 128-Arm Pattern Match Run 10× Faster

The Problem

The Solution: Automatic Dispatch Plan Selection

The [[no_unique_address]] Bug

Results

Real-World Benchmarks

How It Compares

Try It

The `[[no_unique_address]]` Bug