Rewriting a FIX Engine in C++23: What Got Simpler (and What Didn't)

#cpp #performance #programming #opensource

I've been working on a FIX protocol engine in C++23. Header-only, about 5K lines, compiled with -O2 -march=native on Clang 18. Parses an ExecutionReport in ~246 ns on my bench rig. QuickFIX does the same message in ~730 ns.

Before anyone gets excited: single core, pinned affinity, warmed cache, synthetic input. Not production traffic. The 3x gap will shrink on real messages with variable-length fields and optional tags. I know.

But the code that got there was more interesting to me than the final number. Most of the gains came from replacing stuff that QuickFIX had to build by hand because C++98 didn't have the tools.

The pool that disappeared

QuickFIX has a hand-rolled object pool. About 1,000 lines of allocation logic, intrusive free lists, manual cache line alignment. Made total sense when it was written. C++98 didn't give you anything better.

Now there's std::pmr::monotonic_buffer_resource. Stack buffer, pointer bump, reset between messages:

template <size_t Size>
class MonotonicPool : public std::pmr::memory_resource {
    alignas(64) std::array<char, Size> buffer_{};
    std::pmr::memory_resource* upstream_;
    std::pmr::monotonic_buffer_resource resource_;

public:
    MonotonicPool() noexcept
        : upstream_{std::pmr::null_memory_resource()}
        , resource_{buffer_.data(), buffer_.size(), upstream_} {}

    void reset() noexcept { resource_.release(); }
    // do_allocate/do_deallocate just forward to resource_
};

Call reset() after each message. P99 went from 780 ns to 56 ns. That's 14x on the tail, and it's basically just "stop hitting the allocator."

I also use mimalloc for per-session heaps. mi_heap_new() per session, mi_heap_destroy() on disconnect. Felt wasteful at first, like I was throwing away too much memory per session. But perf stat said otherwise so I stopped arguing.

consteval tag lookup

FIX messages are key-value pairs with integer tag numbers. Tag 35 is MsgType, tag 49 is SenderCompID, tag 55 is Symbol. QuickFIX resolves these with a switch statement, fifty-something cases.

C++23 lets you build the lookup table at compile time:

inline constexpr int MAX_COMMON_TAG = 200;

consteval std::array<TagEntry, MAX_COMMON_TAG> create_tag_table() {
    std::array<TagEntry, MAX_COMMON_TAG> table{};
    for (auto& entry : table) {
        entry = {"", false, false};
    }
    table[1]  = {TagInfo<1>::name, TagInfo<1>::is_header, TagInfo<1>::is_required};
    table[8]  = {TagInfo<8>::name, TagInfo<8>::is_header, TagInfo<8>::is_required};
    table[35] = {TagInfo<35>::name, TagInfo<35>::is_header, TagInfo<35>::is_required};
    // ~30 more entries
    return table;
}

inline constexpr auto TAG_TABLE = create_tag_table();

[[nodiscard]] inline constexpr std::string_view tag_name(int tag_num) noexcept {
    if (tag_num >= 0 && tag_num < MAX_COMMON_TAG) [[likely]] {
        return TAG_TABLE[tag_num].name;
    }
    return "";
}

Array index, O(1), zero branches at runtime. About 300 branches eliminated across the parser.

Field offsets use the same trick. QuickFIX stores them in a std::map<int, offset>, so every field access is a tree traversal. Here it's offsets_[tag]. Took me a while to get the constexpr initialization right for nested structs, but once it compiled it was basically free.

SIMD: the scenic route

FIX uses SOH (0x01) as the field delimiter. Scanning for it byte-by-byte is fine until your messages have 40+ fields.

Started with raw AVX2 intrinsics. Worked. Process 32 bytes, compare against SOH, extract positions from the bitmask:

const __m256i soh_vec = _mm256_set1_epi8(fix::SOH);

for (size_t i = 0; i < simd_end; i += 32) {
    __m256i chunk = _mm256_loadu_si256(
        reinterpret_cast<const __m256i*>(ptr + i));
    __m256i cmp = _mm256_cmpeq_epi8(chunk, soh_vec);
    uint32_t mask = static_cast<uint32_t>(_mm256_movemask_epi8(cmp));

    while (mask != 0) {
        int bit = __builtin_ctz(mask);   // lowest set bit
        result.push(static_cast<uint16_t>(i + bit));
        mask &= mask - 1;               // clear it
    }
}

Then I realized I'd need an AVX-512 path, an SSE path, and an ARM NEON path. Four copies of the same logic with different intrinsic names. Maintaining that sounded miserable.

Tried Highway (Google's portable SIMD library). Nice API, but the build dependency was heavy for a header-only project. Compile times went up noticeably. I spent a couple hours trying to make it work as a submodule before giving up.

Ended up on xsimd. Header-only, template-based, picks the instruction set at compile time:

template <typename Arch>
inline SohPositions scan_soh_xsimd(std::span<const char> data) noexcept {
    using batch_t = xsimd::batch<uint8_t, Arch>;
    constexpr size_t width = batch_t::size;

    const batch_t soh_vec(static_cast<uint8_t>(fix::SOH));
    // same loop, portable across architectures
}

Raw AVX2 was maybe 5% faster on the same hardware. I kept both paths in the repo but default to xsimd. The portability is worth 5%.

SOH scan throughput: 3.32 GB/s. Sounds impressive until you realize that's just finding delimiters. Actual parsing is slower. But it means delimiter scanning isn't the bottleneck anymore, which is the whole point.

What didn't get simpler

Session state. FIX sessions have sequence numbers, heartbeat timers, gap fill logic, reject handling. I was hoping std::expected would clean up the error propagation and... it helped a little. Like 10% less boilerplate. The complexity is in the protocol, not the language. It's a state machine with a lot of branches and I don't think any C++ standard is going to fix that.

Message type coverage. I've got 9 types (NewOrderSingle, ExecutionReport, the session-level ones). QuickFIX covers all of them. Adding a new type isn't hard, just tedious. Field definitions, validation rules, serialization. About a day per message type if you include tests. I got to nine and just... stopped. Started working on the transport layer instead because that was more interesting. Not my proudest engineering decision.

Header-only at 5K lines. Compiles in 2.8s on Clang, 4.1s on GCC. That's fine on my machine. No idea what happens on a CI runner with 2GB of RAM. I keep saying I'll add a compiled-library option. Haven't done it.

Benchmarks

$ ./bench --iterations=100000 --pin-cpu=3

ExecutionReport parse: 246 ns  (QuickFIX: 730 ns)
NewOrderSingle parse:  229 ns  (QuickFIX: 661 ns)
Field access (4):      11 ns   (QuickFIX: 31 ns)
Throughput:            4.17M msg/sec  (QuickFIX: 1.19M msg/sec)

Single core, RDTSCP timing, 100K iterations, synthetic messages. Not captured from a real feed. The gap will narrow on production traffic with variable-length fields and optional tags. I'm pretty confident the parser is faster, just not sure by how much once you leave the lab.

Where I am with it

Not production-ready. Parser and session layer work well enough to benchmark, but nobody should route real orders through this.

The thing that kept surprising me was how much of QuickFIX's complexity was the language, not the problem. PMR replaced a thousand-line pool. consteval eliminated a fifty-case switch. And xsimd collapsed four architecture-specific codepaths into one template. These aren't exotic features either, they just didn't exist in C++98. I don't know if this thing will ever cover all the message types QuickFIX does, but the parser core feels solid enough that I keep coming back to it on weekends.

GitHub: github.com/StratCraftsAI/NexusFIX

Still figuring out: whether header-only holds past 10K lines, how much the 3x gap closes on captured traffic, and which message types actually matter beyond the obvious nine. If you've worked with FIX and have opinions on any of that, I'm interested.

Part of NexusFix, an open-source FIX protocol engine in C++23.

Find me on StratCraft | GitHub