sentomk

Posted on May 3

The 128-arm pattern dispatch problem and how we got from 11ns to 1.1ns

#cpp #programming #tutorial #performance

The problem

Pattern matching on a large set of literal values looks clean in code but hits a wall at runtime. Every on() call constructs case objects for every arm.

With 128 arms, that is 128 object constructions per match call. At 11ns per call, this is fine for one-off use. Inside a hot loop, it is a disaster.

// Clean syntax, 128 case objects constructed per call
return match(x) | on(
  lit(0) >> handle_0,
  lit(1) >> handle_1,
  // ... 126 more arms
  _ >> default_handler
);

The cases are compile-time constants. The values do not change. But the library has to construct them every time because the on() call receives them as function arguments, and C++ evaluates arguments at the call site before any optimization pass can intervene.

This is not a bug in the library design. It is a structural cost of the pipeline syntax match(x) | on(...). The syntax gives you readable, composable matching. The cost is that the case list is a runtime object.

Or it was, until we found a way to make it not.

First attempt: static caching with `is_constant_evaluated()`

The idea was simple. If the match expression appears in a constexpr context, the compiler already knows the subject value and all the case values. We can unfold the entire match at compile time and bypass the runtime machinery entirely.

C++20 gave us std::is_constant_evaluated(). When this returns true, the compiler is evaluating this code in a constexpr context. We can take the static path:

template <typename Plan, typename Subject, typename CasesTuple>
constexpr auto dispatch(Subject& subject, CasesTuple&& cases) {
  if (std::is_constant_evaluated()) {
    // Compile-time path: unfold cases statically
    return static_dispatch<Plan>(subject, std::forward<CasesTuple>(cases));
  }
  // Runtime path: use the normal evaluation engine
  return runtime_dispatch<Plan>(subject, std::forward<CasesTuple>(cases));
}

Nice in theory. In practice, this only helps when the entire call site is constexpr. For runtime matches with compile-time-known case values, the cases are still constructed. The constexpr branch does not fire.

We needed something more aggressive: detect at compile time that every case in the list is a static literal, then build a dense dispatch table that the runtime evaluator can use as an O(1) lookup.

The trap: `[[no_unique_address]]` and `std::is_empty`

Static literal patterns look like empty types:

template <auto V, typename Cmp = std::equal_to<>>
struct static_literal_pattern {
  static constexpr auto value = V;
  [[no_unique_address]] Cmp cmp{};
};

std::equal_to<> is an empty type. [[no_unique_address]] should allow the compiler to optimize cmp to zero size, making the whole pattern trivially small.

We used std::is_empty as the gate for static caching eligibility:

// Naive check: if the pattern is empty, we can cache it
if constexpr (std::is_empty_v<pattern_t>) {
  // ... static cache path
}

This broke on GCC. The reason: [[no_unique_address]] is a permission, not a requirement. GCC does not apply it to non-empty base classes, even when the member is not a base class. The Cmp member, despite being an empty type, prevents the pattern from satisfying std::is_empty.

The static cache path was silently skipped. The dense dispatch table was never built. The 128-case match stayed at 11ns.

The fix: `sizeof <= 1`

The insight was that we do not actually need the type to be empty. We need it to be small enough that caching it is worthwhile. One byte is the smallest addressable unit. If a type is one byte, it can be treated as a tag with no runtime state.

// Fixed: check if the pattern is trivial to cache
template <typename T>
constexpr bool is_cacheable_pattern_v = sizeof(T) <= 1;

This works because static_literal_pattern has one static data member (value) and one potentially-zero-size member (cmp). On GCC, with [[no_unique_address]] not applied, the pattern might be 1 byte. On Clang, with the attribute applied, it might be 0 bytes (empty base optimization). Either way, sizeof <= 1 is true. The check is portable across GCC, Clang, and MSVC.

With this fix, the dispatch planner correctly identifies that all 128 cases are static literals. It builds a compile-time index table mapping subject values to case indices. The runtime evaluator becomes:

// Runtime: O(1) lookup with bounds check
if (subject >= min_value && subject <= max_value) {
  auto idx = case_index_table[subject - min_value];
  if (idx != k_invalid_case_index) {
    return invoke_handler(std::get<idx>(cases), subject);
  }
}
return otherwise_handler(subject);

The numbers

Benchmarked on an Intel i7-13700K, GCC 13.2, -O3 -march=native. Subject type: int, 128 literal arms, uniformly random input.

Approach	Latency	vs. switch
Original `on()` (constructs all cases)	10.79 ns	9.9x slower
`PTN_ON` macro (manual static, best practice)	1.16 ns	1.06x slower
Auto-cache (this fix)	1.13 ns	1.04x slower
Hand-written `switch` (theoretical lower bound)	1.09 ns	baseline

The auto-cache approach matches the previous best practice (the PTN_ON macro) and comes within 0.04ns of a hand-written switch statement. That 0.04ns gap comes from a branch predictor mispredict on the bounds check, which is not under library control.

For context: the original implementation was 9.9x slower than a hand-written switch. The fix brings it to 1.04x. A 10x improvement from changing one condition in the type traits.

What this says about library design

The interesting part is not the optimization itself. It is that the entire improvement came from a type trait that was answering the wrong question.

std::is_empty asks "does this type have zero size?" That is a property of the type in the abstract C++ machine. But the actual question we needed to answer was "can this pattern be cached without measurable overhead?" That is an engineering question, not a type theory one.

sizeof <= 1 answers the engineering question. It says: this type fits in one byte, which is the smallest unit the hardware can address. Whether the compiler achieves this through [[no_unique_address]], empty base optimization, or something else — we do not care. The result is portable and measurable.

This is a pattern I have seen across performance work: the clean theoretical property is not the one that matters. The dirty empirical threshold is.

Try it

The implementation is open source at github.com/sentomk/patternia. The dispatch planner lives in include/ptn/core/common/optimize.hpp. The static literal cache logic is in include/ptn/core/common/eval.hpp.

Header-only, C++17, no dependencies beyond the standard library.

DEV Community

The 128-arm pattern dispatch problem and how we got from 11ns to 1.1ns

The problem

First attempt: static caching with `is_constant_evaluated()`

The trap: `[[no_unique_address]]` and `std::is_empty`

The fix: `sizeof <= 1`

The numbers

What this says about library design

Try it

Top comments (0)

The problem

First attempt: static caching with is_constant_evaluated()

The trap: [[no_unique_address]] and std::is_empty

The fix: sizeof <= 1

The numbers

What this says about library design

Try it

First attempt: static caching with `is_constant_evaluated()`

The trap: `[[no_unique_address]]` and `std::is_empty`

The fix: `sizeof <= 1`