DEV Community

Piyush Kumar
Piyush Kumar

Posted on

How I Built a C++ Market Data Parser That Processes 5.5 Million Messages/Sec

Architectural Patterns for a 5.5M msgs/sec Market Data Parser in C++20

Processing raw market data feeds (like NASDAQ ITCH 5.0 or NSE FO) requires strict adherence to low-latency principles. Recently, I built a C++20 parser to ingest these feeds, normalize them, and manage a Limit Order Book (LOB).

By strictly controlling memory allocation and maximizing CPU cache locality, the parser achieves a throughput of ~5.5 Million messages/sec (169 MB/s) on an Apple Silicon (M-Series) processor, with a P50 latency of 84 nanoseconds per message.

This write-up covers the three primary technical patterns used to achieve this throughput.

High-Performance Market Data Parser

A production-grade, low-latency C++ market data engine designed for High-Frequency Trading (HFT) applications. It ingests raw exchange feeds (NASDAQ ITCH 5.0, NSE FO), standardizes them efficiently, and maintains a clean Limit Order Book (LOB) State of the World.

Current Benchmarks

Hardware Environment: Apple Silicon M-Series (Tested on 10-core CPU)

Component Throughput Items per Second Description
NSE Parser ~381 MB/s ~247k msgs/sec Zero-Copy Parser
NASDAQ ITCH ~169 MB/s ~5.5 Million msgs/sec Zero-Copy Parser
Order Book (Add) - ~9M to 22M ops/sec Core Engine insertion latency (varies by book size)
Order Book (Match) - ~236M to 291M ops/sec Core Engine exact match latency

Achieved via direct buffer casting, custom memory resources (PMR), and branch-free endian conversion.

End-to-End Execution (Real-World Data)

Processing a full 11.24 GB historical NASDAQ ITCH 5.0 file (01302019.NASDAQ_ITCH50):

  • Total Messages Parsed: 368,366,634
  • Execution Time: 97.48 seconds
  • Throughput: 3.77 Million…

1. Zero-Copy Parsing and Direct Buffer Casting

In high-throughput systems, copying data from an I/O buffer into application-level structs is prohibitively expensive.

To eliminate data copying during file ingestion, the parser uses memory-mapped files (mmap). This maps the entire binary PCAP/exchange file directly into the application's virtual address space.

Instead of parsing fields sequentially, we rely on direct buffer casting. Because exchange protocols like ITCH define strict, fixed-length binary message formats, we can define packed C++ structs that perfectly mirror the wire protocol.

// Example of direct casting from the mapped memory pointer
const auto* message = reinterpret_cast<const ItchAddOrderMessage*>(mapped_ptr);
Enter fullscreen mode Exit fullscreen mode

For string fields (like 8-byte stock tickers), the parser uses std::string_view. This avoids heap allocations entirely by simply wrapping a pointer to the mapped memory and a length.

Finally, to prevent instruction pipeline stalls during endianness conversion (ITCH uses Big-Endian), the parser relies on branch-free bitwise operations (__builtin_bswap or std::byteswap in C++23) to swap byte orders in registers.

2. Eliminating Heap Allocations with std::pmr

Updating the Limit Order Book requires generating standard Order objects. Using standard new or malloc here causes unacceptable non-deterministic latency due to OS context switching and heap fragmentation.

To bypass the standard heap, the parser uses Polymorphic Memory Resources (PMR) introduced in C++17, specifically the std::pmr::monotonic_buffer_resource.

This acts as an Arena Allocator. At initialization, we allocate a massive, contiguous block of memory.

std::array<std::byte, 1024 * 1024 * 500> buffer; // 500MB pre-allocated arena
std::pmr::monotonic_buffer_resource pool{buffer.data(), buffer.size()};
Enter fullscreen mode Exit fullscreen mode

When a new order arrives, memory is "allocated" simply by bumping a pointer forward within this arena. Deallocation is a no-op; the entire arena is simply discarded or reset at the end of the trading session. This guarantees O(1) allocation time and ensures that newly created objects reside in contiguous memory addresses.

3. Cache-Aligned Hash Maps (DenseMap)

When processing an order execution or cancellation, the engine must look up the original order by its unique Order ID.

Standard std::unordered_map uses separate chaining (arrays of linked lists). Traversing a linked list destroys CPU cache locality. A single L1 cache miss (fetching from main memory) can cost ~100ns, which is longer than our entire target P50 latency.

To solve this, the parser implements DenseMap, a custom open-addressing hash map.

In open-addressing, all key-value pairs are stored inline within a single flat std::vector. When a hash collision occurs, the map linearly probes the adjacent memory slots.

Because modern CPUs fetch memory in 64-byte cache lines, a linear probe almost guarantees that the probed memory address is already sitting in the L1 or L2 cache. This transforms a potentially expensive main-memory fetch into an ultra-fast L1 cache hit, keeping the instruction pipeline saturated.


Conclusion

Maximizing single-thread throughput is largely an exercise in mechanical sympathy. By utilizing memory-mapping, arena allocators (std::pmr), and cache-friendly data structures, you can bypass the OS and the heap entirely on the hot path.

You can view the full implementation and run the benchmarks yourself here: GitHub Repository.

Feedback on the architecture or suggestions for further micro-optimizations are welcome.

Top comments (0)