<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Saksham Arora </title>
    <description>The latest articles on DEV Community by Saksham Arora  (@c0sbyy).</description>
    <link>https://dev.to/c0sbyy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3854210%2F5f0edccf-620b-4e2e-b670-069111a37e6f.png</url>
      <title>DEV Community: Saksham Arora </title>
      <link>https://dev.to/c0sbyy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/c0sbyy"/>
    <language>en</language>
    <item>
      <title>How I Built an HFT Matching Engine (And All The Things I Got Wrong)</title>
      <dc:creator>Saksham Arora </dc:creator>
      <pubDate>Mon, 13 Apr 2026 18:21:22 +0000</pubDate>
      <link>https://dev.to/c0sbyy/how-i-built-an-hft-matching-engine-and-all-the-things-i-got-wrong-e23</link>
      <guid>https://dev.to/c0sbyy/how-i-built-an-hft-matching-engine-and-all-the-things-i-got-wrong-e23</guid>
      <description>&lt;p&gt;I'm a second-year CSE undergrad. A few months ago I decided to build a limit order book matching engine in C++, the kind that sits inside exchanges and matches buy/sell orders.&lt;/p&gt;

&lt;p&gt;Here's the honest story of how it went from a fake benchmark to a real system processing 2.7 million orders per second.&lt;/p&gt;




&lt;h2&gt;
  
  
  The starting point: I had no idea what I was doing
&lt;/h2&gt;

&lt;p&gt;I knew I wanted to build "something HFT" because it sounded impressive. I didn't really understand what an order book was. I Googled "limit order book C++" and cobbled together a basic version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;std::map&lt;/code&gt; for storing price levels (because sorted, right?)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;std::vector&lt;/code&gt; for storing orders at each price&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;double&lt;/code&gt; for prices (because money has decimals, right?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It compiled. It ran. I was proud.&lt;/p&gt;

&lt;p&gt;Then I wrote a "stress test" that looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;processOrders&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;volatile&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;dummy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's not a stress test. That's a for loop that multiplies by 2. Nothing touches the order book. But it said "2.5 billion TPS" and I put it on LinkedIn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 1: If your benchmark seems too good to be true, it probably is.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The first real bug: floating point prices
&lt;/h2&gt;

&lt;p&gt;Once I started actually matching orders, I hit a weird bug where identical prices wouldn't match. Two orders at $100.50 would sometimes not cross:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;100.50&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;100.20&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;100.30&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// Should be 100.50&lt;/span&gt;
&lt;span class="c1"&gt;// a == b → FALSE (in some cases)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the classic IEEE 754 floating point problem. Real exchanges don't use floats for this exact reason.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; I switched everything to integer "ticks." $100.50 becomes &lt;code&gt;10050&lt;/code&gt;. Integer comparison is always exact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 2: Never use floating point for money. I thought I knew this but had to learn it the hard way.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The humbling rewrite: actually measuring performance
&lt;/h2&gt;

&lt;p&gt;When I put the real matching logic under a proper timer, the numbers were... not great. About 800K orders/sec. Not terrible, but not the "millions of TPS" I had been claiming.&lt;/p&gt;

&lt;p&gt;I ran a profiler (well, I added a bunch of &lt;code&gt;chrono&lt;/code&gt; timing). The bottleneck was clear: &lt;code&gt;std::map&lt;/code&gt;. Every time I inserted or looked up a price level, it was doing O(log n) tree traversal with pointer chasing.&lt;/p&gt;

&lt;p&gt;I read Ulrich Drepper's paper on memory and realized the problem: every &lt;code&gt;std::map&lt;/code&gt; node is heap-allocated at a random address. The CPU cache can't predict where the next node is, so every access is a cache miss (~100ns penalty).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; I replaced &lt;code&gt;std::map&lt;/code&gt; with a flat &lt;code&gt;std::vector&lt;/code&gt;, one entry per possible price tick. Price 10050 goes into &lt;code&gt;array[10050 - min_tick]&lt;/code&gt;. O(1) access, and the entries are contiguous in memory so the cache prefetcher pulls in adjacent levels for free.&lt;/p&gt;

&lt;p&gt;This alone got me from 800K to 1.3M orders/sec. A 1.6x improvement from just changing the data structure for cache friendliness, the algorithm (addOrder + match) was basically the same.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 3: Cache effects matter more than algorithmic complexity for small n.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The cancel problem: O(n) was killing me
&lt;/h2&gt;

&lt;p&gt;In real markets, a huge percentage of orders get cancelled. Market makers send an order, the price moves, they cancel it and send a new one. This happens constantly.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;std::deque&lt;/code&gt; as my order queue, cancelling meant scanning the entire queue to find the order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;begin&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;order_to_cancel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;erase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;O(n) per cancel. With thousands of orders per price level, this was brutal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Intrusive doubly-linked lists. Instead of putting orders INTO a container, each order carries its own &lt;code&gt;prev&lt;/code&gt; and &lt;code&gt;next&lt;/code&gt; pointers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;Order&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// ... data fields ...&lt;/span&gt;
    &lt;span class="n"&gt;Order&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;nullptr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;Order&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;nullptr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To cancel, you just relink the neighbors:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three pointer operations. O(1). This was the single biggest performance win in the project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 4: Sometimes adding 16 bytes to a struct saves you from O(n) operations. The tradeoff is always worth thinking about.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The matching hot path: death by allocation
&lt;/h2&gt;

&lt;p&gt;I profiled again (more &lt;code&gt;chrono&lt;/code&gt; timestamps) and found another bottleneck: the matching function was returning &lt;code&gt;std::vector&amp;lt;Trade&amp;gt;&lt;/code&gt;. Every call to &lt;code&gt;match()&lt;/code&gt; was:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Construct a vector (malloc internally)&lt;/li&gt;
&lt;li&gt;Push trades into it (possibly realloc)
&lt;/li&gt;
&lt;li&gt;Return it (move semantics help, but still)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Over 2 million orders, that's 2 million vector constructions. Each one hits the heap allocator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Callback templates. Instead of returning a vector, the caller passes a function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;book&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submitOrder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[](&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;Trade&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// do whatever you want with the trade&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For benchmarking, you pass &lt;code&gt;[](const Trade&amp;amp;) {}&lt;/code&gt;  a no-op. The compiler inlines it and removes it entirely. Zero cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 5: On the hot path, every allocation counts. Callbacks beat return values when you need maximum throughput.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;p&gt;After all three changes (flat array + intrusive lists + callback matching), the benchmark looked like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;p99&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tree + Deque (original)&lt;/td&gt;
&lt;td&gt;861K/s&lt;/td&gt;
&lt;td&gt;3000ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Array + Deque (intermediate)&lt;/td&gt;
&lt;td&gt;1.36M/s&lt;/td&gt;
&lt;td&gt;2200ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Array + Intrusive (final)&lt;/td&gt;
&lt;td&gt;2.57M/s&lt;/td&gt;
&lt;td&gt;900ns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;3x faster. But the tail latency improvement was even more dramatic, p999 went from 45μs to 3.4μs. That's 13x better worst-case behavior.&lt;/p&gt;

&lt;p&gt;The most satisfying part was that the improvements were cumulative and each one had a clear explanation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flat array: cache locality&lt;/li&gt;
&lt;li&gt;Intrusive list: O(1) cancel&lt;/li&gt;
&lt;li&gt;Callback: zero allocation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What I'd do differently next time
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Write the benchmark first.&lt;/strong&gt; If I had a proper benchmark from day one, I wouldn't have wasted time with the fake stress test.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read about real exchange architectures earlier.&lt;/strong&gt; The LMAX Disruptor paper and the Drepper memory paper should have been my starting points.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add proper observability.&lt;/strong&gt; I still don't have nanosecond-precision logging. Adding timing around every operation would have helped me find bottlenecks faster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't put impressive-sounding numbers on LinkedIn before verifying them.&lt;/strong&gt; "2.5 billion TPS" from an empty for-loop is not a metric anyone should brag about.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What I actually learned
&lt;/h2&gt;

&lt;p&gt;This project taught me more about systems programming than any course or textbook. Not because the code is particularly complex, it's ~1000 lines of C++17. But because every optimization forced me to understand a layer deeper:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why flat arrays beat trees → led me to understand CPU cache architecture&lt;/li&gt;
&lt;li&gt;Why intrusive lists beat containers → led me to understand memory layout&lt;/li&gt;
&lt;li&gt;Why callbacks beat return values → led me to understand compiler inlining&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're a student looking to learn systems programming, I'd recommend building something like this. Not because the world needs another matching engine, but because the process of making it fast teaches you things you can't learn any other way.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/saksham10arora-dotcom/Simple-HFT-Engine" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;&lt;/p&gt;

</description>
      <category>algorithms</category>
      <category>cpp</category>
      <category>performance</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
