<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Piyush Kumar</title>
    <description>The latest articles on DEV Community by Piyush Kumar (@piyush_kumar_1809).</description>
    <link>https://dev.to/piyush_kumar_1809</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3898407%2F1a1cc6fc-c3e7-4cd0-8594-57c53883c966.jpg</url>
      <title>DEV Community: Piyush Kumar</title>
      <link>https://dev.to/piyush_kumar_1809</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/piyush_kumar_1809"/>
    <language>en</language>
    <item>
      <title>How I Built a C++ Market Data Parser That Processes 5.5 Million Messages/Sec</title>
      <dc:creator>Piyush Kumar</dc:creator>
      <pubDate>Sun, 26 Apr 2026 07:21:03 +0000</pubDate>
      <link>https://dev.to/piyush_kumar_1809/how-i-built-a-c-market-data-parser-that-processes-55-million-messagessec-1d6d</link>
      <guid>https://dev.to/piyush_kumar_1809/how-i-built-a-c-market-data-parser-that-processes-55-million-messagessec-1d6d</guid>
      <description>&lt;h2&gt;
  
  
  Architectural Patterns for a 5.5M msgs/sec Market Data Parser in C++20
&lt;/h2&gt;

&lt;p&gt;Processing raw market data feeds (like NASDAQ ITCH 5.0 or NSE FO) requires strict adherence to low-latency principles. Recently, I built a C++20 parser to ingest these feeds, normalize them, and manage a Limit Order Book (LOB). &lt;/p&gt;

&lt;p&gt;By strictly controlling memory allocation and maximizing CPU cache locality, the parser achieves a throughput of &lt;strong&gt;~5.5 Million messages/sec&lt;/strong&gt; (169 MB/s) on an Apple Silicon (M-Series) processor, with a P50 latency of &lt;strong&gt;84 nanoseconds&lt;/strong&gt; per message.&lt;/p&gt;

&lt;p&gt;This write-up covers the three primary technical patterns used to achieve this throughput.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/PIYUSH-KUMAR1809" rel="noopener noreferrer"&gt;
        PIYUSH-KUMAR1809
      &lt;/a&gt; / &lt;a href="https://github.com/PIYUSH-KUMAR1809/market-data-parser" rel="noopener noreferrer"&gt;
        market-data-parser
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;High-Performance Market Data Parser&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;A production-grade, low-latency C++ market data engine designed for High-Frequency Trading (HFT) applications. It ingests raw exchange feeds (NASDAQ ITCH 5.0, NSE FO), standardizes them efficiently, and maintains a clean Limit Order Book (LOB) State of the World.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Current Benchmarks&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Hardware Environment&lt;/strong&gt;: Apple Silicon M-Series (Tested on 10-core CPU)&lt;/p&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;Items per Second&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NSE Parser&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~381 MB/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~247k msgs/sec&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Zero-Copy Parser&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NASDAQ ITCH&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~169 MB/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~5.5 Million msgs/sec&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Zero-Copy Parser&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Order Book (Add)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~9M to 22M ops/sec&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Core Engine insertion latency (varies by book size)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Order Book (Match)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~236M to 291M ops/sec&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Core Engine exact match latency&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Achieved via direct buffer casting, custom memory resources (PMR), and branch-free endian conversion.&lt;/em&gt;&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;End-to-End Execution (Real-World Data)&lt;/h3&gt;
&lt;/div&gt;
&lt;p&gt;Processing a full &lt;strong&gt;11.24 GB&lt;/strong&gt; historical NASDAQ ITCH 5.0 file (&lt;code&gt;01302019.NASDAQ_ITCH50&lt;/code&gt;):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Total Messages Parsed&lt;/strong&gt;: 368,366,634&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution Time&lt;/strong&gt;: 97.48 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput&lt;/strong&gt;: 3.77 Million…&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/PIYUSH-KUMAR1809/market-data-parser" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;





&lt;h2&gt;
  
  
  1. Zero-Copy Parsing and Direct Buffer Casting
&lt;/h2&gt;

&lt;p&gt;In high-throughput systems, copying data from an I/O buffer into application-level structs is prohibitively expensive.&lt;/p&gt;

&lt;p&gt;To eliminate data copying during file ingestion, the parser uses memory-mapped files (&lt;code&gt;mmap&lt;/code&gt;). This maps the entire binary PCAP/exchange file directly into the application's virtual address space.&lt;/p&gt;

&lt;p&gt;Instead of parsing fields sequentially, we rely on &lt;strong&gt;direct buffer casting&lt;/strong&gt;. Because exchange protocols like ITCH define strict, fixed-length binary message formats, we can define packed C++ structs that perfectly mirror the wire protocol.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Example of direct casting from the mapped memory pointer&lt;/span&gt;
&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;ItchAddOrderMessage&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mapped_ptr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For string fields (like 8-byte stock tickers), the parser uses &lt;code&gt;std::string_view&lt;/code&gt;. This avoids heap allocations entirely by simply wrapping a pointer to the mapped memory and a length.&lt;/p&gt;

&lt;p&gt;Finally, to prevent instruction pipeline stalls during endianness conversion (ITCH uses Big-Endian), the parser relies on branch-free bitwise operations (&lt;code&gt;__builtin_bswap&lt;/code&gt; or &lt;code&gt;std::byteswap&lt;/code&gt; in C++23) to swap byte orders in registers.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Eliminating Heap Allocations with &lt;code&gt;std::pmr&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Updating the Limit Order Book requires generating standard &lt;code&gt;Order&lt;/code&gt; objects. Using standard &lt;code&gt;new&lt;/code&gt; or &lt;code&gt;malloc&lt;/code&gt; here causes unacceptable non-deterministic latency due to OS context switching and heap fragmentation.&lt;/p&gt;

&lt;p&gt;To bypass the standard heap, the parser uses &lt;strong&gt;Polymorphic Memory Resources (PMR)&lt;/strong&gt; introduced in C++17, specifically the &lt;code&gt;std::pmr::monotonic_buffer_resource&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This acts as an Arena Allocator. At initialization, we allocate a massive, contiguous block of memory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// 500MB pre-allocated arena&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;pmr&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;monotonic_buffer_resource&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;()};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a new order arrives, memory is "allocated" simply by bumping a pointer forward within this arena. Deallocation is a no-op; the entire arena is simply discarded or reset at the end of the trading session. This guarantees O(1) allocation time and ensures that newly created objects reside in contiguous memory addresses.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Cache-Aligned Hash Maps (DenseMap)
&lt;/h2&gt;

&lt;p&gt;When processing an order execution or cancellation, the engine must look up the original order by its unique Order ID. &lt;/p&gt;

&lt;p&gt;Standard &lt;code&gt;std::unordered_map&lt;/code&gt; uses separate chaining (arrays of linked lists). Traversing a linked list destroys CPU cache locality. A single L1 cache miss (fetching from main memory) can cost ~100ns, which is longer than our entire target P50 latency.&lt;/p&gt;

&lt;p&gt;To solve this, the parser implements &lt;code&gt;DenseMap&lt;/code&gt;, a custom open-addressing hash map.&lt;/p&gt;

&lt;p&gt;In open-addressing, all key-value pairs are stored inline within a single flat &lt;code&gt;std::vector&lt;/code&gt;. When a hash collision occurs, the map linearly probes the adjacent memory slots. &lt;/p&gt;

&lt;p&gt;Because modern CPUs fetch memory in 64-byte cache lines, a linear probe almost guarantees that the probed memory address is already sitting in the L1 or L2 cache. This transforms a potentially expensive main-memory fetch into an ultra-fast L1 cache hit, keeping the instruction pipeline saturated.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Maximizing single-thread throughput is largely an exercise in mechanical sympathy. By utilizing memory-mapping, arena allocators (&lt;code&gt;std::pmr&lt;/code&gt;), and cache-friendly data structures, you can bypass the OS and the heap entirely on the hot path.&lt;/p&gt;

&lt;p&gt;You can view the full implementation and run the benchmarks yourself here: &lt;a href="https://github.com/PIYUSH-KUMAR1809/market-data-parser" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Feedback on the architecture or suggestions for further micro-optimizations are welcome.&lt;/p&gt;

</description>
      <category>cpp</category>
      <category>performance</category>
      <category>showdev</category>
      <category>linux</category>
    </item>
  </channel>
</rss>
