<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Awneesh Tiwari</title>
    <description>The latest articles on DEV Community by Awneesh Tiwari (@awneesh_tiwari_84445a8ceb).</description>
    <link>https://dev.to/awneesh_tiwari_84445a8ceb</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3763466%2F722f91d0-d702-47f8-a99f-f097c23bc1ce.jpeg</url>
      <title>DEV Community: Awneesh Tiwari</title>
      <link>https://dev.to/awneesh_tiwari_84445a8ceb</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/awneesh_tiwari_84445a8ceb"/>
    <language>en</language>
    <item>
      <title>StrikeMQ vs Kafka: Benchmarking a 735KB Broker Against a 200MB JVM Giant</title>
      <dc:creator>Awneesh Tiwari</dc:creator>
      <pubDate>Tue, 10 Feb 2026 10:44:57 +0000</pubDate>
      <link>https://dev.to/awneesh_tiwari_84445a8ceb/strikemq-vs-kafka-benchmarking-a-735kb-broker-against-a-200mb-jvm-giant-491f</link>
      <guid>https://dev.to/awneesh_tiwari_84445a8ceb/strikemq-vs-kafka-benchmarking-a-735kb-broker-against-a-200mb-jvm-giant-491f</guid>
      <description>&lt;p&gt;Kafka is the gold standard for production event streaming. But for local development and testing, it's like driving a semi truck to the grocery store. I built &lt;a href="https://github.com/awneesht/Strike-mq" rel="noopener noreferrer"&gt;StrikeMQ&lt;/a&gt; — a Kafka-compatible broker in C++20 — specifically for the &lt;code&gt;localhost:9092&lt;/code&gt; use case. Here's how they compare with real numbers.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;StrikeMQ v0.1.4&lt;/strong&gt; — C++20, zero dependencies, single binary.&lt;br&gt;
&lt;strong&gt;Apache Kafka 3.7&lt;/strong&gt; — Running via &lt;code&gt;docker compose&lt;/code&gt; with KRaft (no ZooKeeper), default configuration.&lt;br&gt;
&lt;strong&gt;Hardware&lt;/strong&gt; — Apple M-series MacBook, 10 cores, 16GB RAM.&lt;/p&gt;

&lt;p&gt;All tests measure the same thing: a process listening on port 9092 that Kafka clients can produce to and consume from.&lt;/p&gt;


&lt;h2&gt;
  
  
  Binary Size
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;StrikeMQ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Runtime files&lt;/td&gt;
&lt;td&gt;~200MB (JVM + jars + config)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;735KB&lt;/strong&gt; (stripped, statically linked)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependencies&lt;/td&gt;
&lt;td&gt;JDK 11+, scripts, config dirs&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;StrikeMQ is &lt;strong&gt;272x smaller&lt;/strong&gt;. The entire binary — networking, Kafka protocol codec, storage engine, REST API, HTTP server — fits in less space than a single JPEG.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ls -lh strikemq
-rwxr-xr-x  1 user  staff  735K  strikemq

$ du -sh kafka_2.13-3.7.0/
207M    kafka_2.13-3.7.0/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Startup Time
&lt;/h2&gt;

&lt;p&gt;I measured time from process start to first successful produce (using &lt;code&gt;kcat&lt;/code&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;StrikeMQ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cold start to ready&lt;/td&gt;
&lt;td&gt;~8-15 seconds&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 10ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First produce accepted&lt;/td&gt;
&lt;td&gt;~10-20 seconds&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 50ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# StrikeMQ: instant&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;./strikemq &amp;amp; &lt;span class="nb"&gt;sleep &lt;/span&gt;0.1 &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"test"&lt;/span&gt; | kcat &lt;span class="nt"&gt;-b&lt;/span&gt; 127.0.0.1:9092 &lt;span class="nt"&gt;-P&lt;/span&gt; &lt;span class="nt"&gt;-t&lt;/span&gt; bench&lt;span class="o"&gt;)&lt;/span&gt;
real    0m0.112s

&lt;span class="c"&gt;# Kafka: wait for JVM warmup, controller election, log recovery...&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="k"&gt;until &lt;/span&gt;kcat &lt;span class="nt"&gt;-b&lt;/span&gt; 127.0.0.1:9092 &lt;span class="nt"&gt;-L&lt;/span&gt; 2&amp;gt;/dev/null&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do &lt;/span&gt;&lt;span class="nb"&gt;sleep &lt;/span&gt;0.5&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;done&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
real    0m12.438s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you're iterating on code and restarting your broker 50 times a day, those 12 seconds add up to &lt;strong&gt;10 minutes of daily waiting&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Memory Usage
&lt;/h2&gt;

&lt;p&gt;Measured after startup with no topics, then after producing 10,000 messages:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;State&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;StrikeMQ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Idle (no topics)&lt;/td&gt;
&lt;td&gt;~350MB RSS&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;~1.5MB&lt;/strong&gt; RSS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After 10K messages&lt;/td&gt;
&lt;td&gt;~400MB RSS&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;~2MB&lt;/strong&gt; + mmap'd segments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Theoretical minimum&lt;/td&gt;
&lt;td&gt;~200MB (JVM heap floor)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;&amp;lt; 1MB&lt;/strong&gt; (code + stack)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;StrikeMQ uses &lt;code&gt;mmap&lt;/code&gt; for storage segments. The OS manages page residency — only pages being read or written are in physical memory. The broker itself barely allocates heap. Kafka, by contrast, needs a JVM with a minimum heap, GC metadata, thread stacks for 50+ threads, and page cache for its own log segments.&lt;/p&gt;




&lt;h2&gt;
  
  
  Idle CPU
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;StrikeMQ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU at idle&lt;/td&gt;
&lt;td&gt;1-3% (GC cycles, thread scheduling)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.0%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;StrikeMQ uses &lt;code&gt;kqueue&lt;/code&gt; (macOS) / &lt;code&gt;epoll&lt;/code&gt; (Linux) event loops that block when there's nothing to do. No background GC, no periodic timers, no busy loops. The process is literally suspended by the kernel until a packet arrives.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# StrikeMQ idle for 60 seconds&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;top &lt;span class="nt"&gt;-pid&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;pgrep strikemq&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt; 1
PID    COMMAND  %CPU  MEM
12345  strikemq  0.0   1.5M
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Produce Latency — Microbenchmarks
&lt;/h2&gt;

&lt;p&gt;StrikeMQ's built-in benchmark suite measures the raw latency of core operations using TSC (Time Stamp Counter) for nanosecond-precision timing. 1 million samples each after a 10K warmup:&lt;/p&gt;

&lt;h3&gt;
  
  
  SPSC Ring Buffer (push + pop)
&lt;/h3&gt;

&lt;p&gt;The lock-free queue that passes connections from the acceptor thread to workers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Percentile&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;avg&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;19 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p50&lt;/td&gt;
&lt;td&gt;&amp;lt; 42 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99.9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;42 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;max&lt;/td&gt;
&lt;td&gt;13 us&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Memory Pool (alloc + free)
&lt;/h3&gt;

&lt;p&gt;Pre-allocated block pool with intrusive freelist:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Percentile&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;avg&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p50&lt;/td&gt;
&lt;td&gt;&amp;lt; 42 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99.9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;42 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;max&lt;/td&gt;
&lt;td&gt;7 us&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Log Append (1KB message)
&lt;/h3&gt;

&lt;p&gt;The full produce path — lock partition, memcpy into mmap'd segment, update offset index, unlock:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Percentile&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;avg&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;145 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p50&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;83 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;667 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99.9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.4 us&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;max&lt;/td&gt;
&lt;td&gt;15 us&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Kafka Header Decode
&lt;/h3&gt;

&lt;p&gt;Parsing a complete Kafka request header from raw bytes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Percentile&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;avg&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p50&lt;/td&gt;
&lt;td&gt;&amp;lt; 42 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99.9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;42 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;max&lt;/td&gt;
&lt;td&gt;15 us&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Every operation passes the sub-millisecond p99.9 check.&lt;/strong&gt; The log append — which is the actual disk write — completes in 145ns on average. That's because &lt;code&gt;mmap&lt;/code&gt; turns disk writes into memory copies; the OS flushes to disk asynchronously.&lt;/p&gt;




&lt;h2&gt;
  
  
  End-to-End Produce Latency
&lt;/h2&gt;

&lt;p&gt;For the full network round-trip (client -&amp;gt; TCP -&amp;gt; parse -&amp;gt; store -&amp;gt; respond -&amp;gt; client), measured with &lt;code&gt;kcat&lt;/code&gt; producing 1,000 individual messages:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;StrikeMQ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;p50&lt;/td&gt;
&lt;td&gt;~1-2ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 0.5ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99&lt;/td&gt;
&lt;td&gt;~5-10ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 1ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99.9&lt;/td&gt;
&lt;td&gt;~15-50ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 1ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;StrikeMQ's end-to-end produce stays under 1ms at p99.9. The path is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;recv() → parse Kafka header (16ns) → decode batch → lock partition mutex →
memcpy into mmap (145ns) → unlock → encode response → send()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No GC pauses. No thread context switches for common cases. No JIT warmup.&lt;/p&gt;




&lt;h2&gt;
  
  
  Consume Latency
&lt;/h2&gt;

&lt;p&gt;The fetch path is even faster because it's completely lock-free:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;recv() → parse header → binary search offset index → pointer into mmap → send()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zero copies of actual message data. The kernel's &lt;code&gt;send()&lt;/code&gt; reads directly from the mmap'd file pages. No deserialization, no buffer allocation, no locking.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resource Comparison Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;StrikeMQ&lt;/th&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Binary size&lt;/td&gt;
&lt;td&gt;200MB&lt;/td&gt;
&lt;td&gt;735KB&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;272x&lt;/strong&gt; smaller&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Startup time&lt;/td&gt;
&lt;td&gt;12s&lt;/td&gt;
&lt;td&gt;10ms&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1,200x&lt;/strong&gt; faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Idle memory&lt;/td&gt;
&lt;td&gt;350MB&lt;/td&gt;
&lt;td&gt;1.5MB&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;233x&lt;/strong&gt; less&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Idle CPU&lt;/td&gt;
&lt;td&gt;1-3%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Produce p99.9&lt;/td&gt;
&lt;td&gt;~15ms&lt;/td&gt;
&lt;td&gt;&amp;lt; 1ms&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;15x+&lt;/strong&gt; faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependencies&lt;/td&gt;
&lt;td&gt;JDK, scripts&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Threads at idle&lt;/td&gt;
&lt;td&gt;50+&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;4x&lt;/strong&gt; fewer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What This Means For You
&lt;/h2&gt;

&lt;p&gt;If you're running Kafka in &lt;code&gt;docker-compose.yml&lt;/code&gt; for local development, you're paying a &lt;strong&gt;12-second startup tax&lt;/strong&gt; and &lt;strong&gt;350MB memory overhead&lt;/strong&gt; every time. Multiply that across your team and your CI pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Developer laptop:&lt;/strong&gt; Swap Kafka for StrikeMQ in docker-compose. Same port, same protocol, same client code. Free up 350MB for your IDE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD integration tests:&lt;/strong&gt; Start StrikeMQ in 10ms instead of waiting 15 seconds for Kafka to boot. Your pipeline gets faster without changing a single test.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prototyping:&lt;/strong&gt; Want to test if Kafka is right for your architecture? Try the idea with StrikeMQ in seconds, not minutes.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What StrikeMQ Doesn't Do
&lt;/h2&gt;

&lt;p&gt;This isn't a production Kafka replacement. It deliberately trades durability and fault tolerance for speed and simplicity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No replication (single broker)&lt;/li&gt;
&lt;li&gt;No authentication (no SASL/SSL)&lt;/li&gt;
&lt;li&gt;Consumer group offsets are in-memory (lost on restart)&lt;/li&gt;
&lt;li&gt;No log compaction or retention enforcement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's a &lt;strong&gt;development tool&lt;/strong&gt;, like SQLite is to PostgreSQL or LocalStack is to AWS.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# macOS&lt;/span&gt;
brew tap awneesht/strike-mq
brew &lt;span class="nb"&gt;install &lt;/span&gt;strikemq

&lt;span class="c"&gt;# Or build from source (any platform)&lt;/span&gt;
git clone https://github.com/awneesht/Strike-mq.git
&lt;span class="nb"&gt;cd &lt;/span&gt;Strike-mq &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Release &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build
./build/strikemq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then point any Kafka client at &lt;code&gt;127.0.0.1:9092&lt;/code&gt;. Or use the built-in REST API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Produce via curl&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST localhost:8080/v1/topics/demo/messages &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"messages":[{"value":"hello"},{"key":"user-1","value":"world"}]}'&lt;/span&gt;

&lt;span class="c"&gt;# Peek at messages&lt;/span&gt;
curl &lt;span class="s2"&gt;"localhost:8080/v1/topics/demo/messages?offset=0&amp;amp;limit=10"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the benchmarks yourself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/strikemq_bench
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/awneesht/Strike-mq" rel="noopener noreferrer"&gt;github.com/awneesht/Strike-mq&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; MIT&lt;/p&gt;




&lt;p&gt;&lt;em&gt;All benchmarks run on Apple M-series, macOS, compiled with Clang -O2. Your numbers will vary. Kafka numbers are representative of default configurations — tuned Kafka will perform better, but will still carry the JVM baseline overhead. StrikeMQ numbers are from its built-in benchmark suite using TSC-based nanosecond timing.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>backend</category>
      <category>cpp</category>
      <category>performance</category>
      <category>kafka</category>
    </item>
    <item>
      <title>StrikeMQ vs Kafka: Benchmarking a 735KB Broker Against a 200MB JVM Giant</title>
      <dc:creator>Awneesh Tiwari</dc:creator>
      <pubDate>Tue, 10 Feb 2026 09:57:41 +0000</pubDate>
      <link>https://dev.to/awneesh_tiwari_84445a8ceb/blazemq-vs-kafka-benchmarking-a-735kb-broker-against-a-200mb-jvm-giant-1ma7</link>
      <guid>https://dev.to/awneesh_tiwari_84445a8ceb/blazemq-vs-kafka-benchmarking-a-735kb-broker-against-a-200mb-jvm-giant-1ma7</guid>
      <description>&lt;p&gt;Kafka is the gold standard for production event streaming. But for local development and testing, it's like driving a semi truck to the grocery store. I built &lt;a href="https://github.com/awneesht/Strike-mq" rel="noopener noreferrer"&gt;StrikeMQ&lt;/a&gt; — a Kafka-compatible broker in C++20 — specifically for the &lt;code&gt;localhost:9092&lt;/code&gt; use case. Here's how they compare with real numbers.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;StrikeMQ v0.1.4&lt;/strong&gt; — C++20, zero dependencies, single binary.&lt;br&gt;
&lt;strong&gt;Apache Kafka 3.7&lt;/strong&gt; — Running via &lt;code&gt;docker compose&lt;/code&gt; with KRaft (no ZooKeeper), default configuration.&lt;br&gt;
&lt;strong&gt;Hardware&lt;/strong&gt; — Apple M-series MacBook, 10 cores, 16GB RAM.&lt;/p&gt;

&lt;p&gt;All tests measure the same thing: a process listening on port 9092 that Kafka clients can produce to and consume from.&lt;/p&gt;


&lt;h2&gt;
  
  
  Binary Size
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;StrikeMQ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Runtime files&lt;/td&gt;
&lt;td&gt;~200MB (JVM + jars + config)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;735KB&lt;/strong&gt; (stripped, statically linked)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependencies&lt;/td&gt;
&lt;td&gt;JDK 11+, scripts, config dirs&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;StrikeMQ is &lt;strong&gt;272x smaller&lt;/strong&gt;. The entire binary — networking, Kafka protocol codec, storage engine, REST API, HTTP server — fits in less space than a single JPEG.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ls -lh strikemq
-rwxr-xr-x  1 user  staff  735K  strikemq

$ du -sh kafka_2.13-3.7.0/
207M    kafka_2.13-3.7.0/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Startup Time
&lt;/h2&gt;

&lt;p&gt;I measured time from process start to first successful produce (using &lt;code&gt;kcat&lt;/code&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;StrikeMQ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cold start to ready&lt;/td&gt;
&lt;td&gt;~8-15 seconds&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 10ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First produce accepted&lt;/td&gt;
&lt;td&gt;~10-20 seconds&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 50ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# StrikeMQ: instant&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;./strikemq &amp;amp; &lt;span class="nb"&gt;sleep &lt;/span&gt;0.1 &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"test"&lt;/span&gt; | kcat &lt;span class="nt"&gt;-b&lt;/span&gt; 127.0.0.1:9092 &lt;span class="nt"&gt;-P&lt;/span&gt; &lt;span class="nt"&gt;-t&lt;/span&gt; bench&lt;span class="o"&gt;)&lt;/span&gt;
real    0m0.112s

&lt;span class="c"&gt;# Kafka: wait for JVM warmup, controller election, log recovery...&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="k"&gt;until &lt;/span&gt;kcat &lt;span class="nt"&gt;-b&lt;/span&gt; 127.0.0.1:9092 &lt;span class="nt"&gt;-L&lt;/span&gt; 2&amp;gt;/dev/null&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do &lt;/span&gt;&lt;span class="nb"&gt;sleep &lt;/span&gt;0.5&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;done&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
real    0m12.438s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you're iterating on code and restarting your broker 50 times a day, those 12 seconds add up to &lt;strong&gt;10 minutes of daily waiting&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Memory Usage
&lt;/h2&gt;

&lt;p&gt;Measured after startup with no topics, then after producing 10,000 messages:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;State&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;StrikeMQ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Idle (no topics)&lt;/td&gt;
&lt;td&gt;~350MB RSS&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;~1.5MB&lt;/strong&gt; RSS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After 10K messages&lt;/td&gt;
&lt;td&gt;~400MB RSS&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;~2MB&lt;/strong&gt; + mmap'd segments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Theoretical minimum&lt;/td&gt;
&lt;td&gt;~200MB (JVM heap floor)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;&amp;lt; 1MB&lt;/strong&gt; (code + stack)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;StrikeMQ uses &lt;code&gt;mmap&lt;/code&gt; for storage segments. The OS manages page residency — only pages being read or written are in physical memory. The broker itself barely allocates heap. Kafka, by contrast, needs a JVM with a minimum heap, GC metadata, thread stacks for 50+ threads, and page cache for its own log segments.&lt;/p&gt;




&lt;h2&gt;
  
  
  Idle CPU
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;StrikeMQ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU at idle&lt;/td&gt;
&lt;td&gt;1-3% (GC cycles, thread scheduling)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.0%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;StrikeMQ uses &lt;code&gt;kqueue&lt;/code&gt; (macOS) / &lt;code&gt;epoll&lt;/code&gt; (Linux) event loops that block when there's nothing to do. No background GC, no periodic timers, no busy loops. The process is literally suspended by the kernel until a packet arrives.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# StrikeMQ idle for 60 seconds&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;top &lt;span class="nt"&gt;-pid&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;pgrep strikemq&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt; 1
PID    COMMAND  %CPU  MEM
12345  strikemq  0.0   1.5M
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Produce Latency — Microbenchmarks
&lt;/h2&gt;

&lt;p&gt;StrikeMQ's built-in benchmark suite measures the raw latency of core operations using TSC (Time Stamp Counter) for nanosecond-precision timing. 1 million samples each after a 10K warmup:&lt;/p&gt;

&lt;h3&gt;
  
  
  SPSC Ring Buffer (push + pop)
&lt;/h3&gt;

&lt;p&gt;The lock-free queue that passes connections from the acceptor thread to workers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Percentile&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;avg&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;19 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p50&lt;/td&gt;
&lt;td&gt;&amp;lt; 42 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99.9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;42 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;max&lt;/td&gt;
&lt;td&gt;13 us&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Memory Pool (alloc + free)
&lt;/h3&gt;

&lt;p&gt;Pre-allocated block pool with intrusive freelist:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Percentile&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;avg&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p50&lt;/td&gt;
&lt;td&gt;&amp;lt; 42 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99.9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;42 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;max&lt;/td&gt;
&lt;td&gt;7 us&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Log Append (1KB message)
&lt;/h3&gt;

&lt;p&gt;The full produce path — lock partition, memcpy into mmap'd segment, update offset index, unlock:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Percentile&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;avg&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;145 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p50&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;83 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;667 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99.9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.4 us&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;max&lt;/td&gt;
&lt;td&gt;15 us&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Kafka Header Decode
&lt;/h3&gt;

&lt;p&gt;Parsing a complete Kafka request header from raw bytes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Percentile&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;avg&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p50&lt;/td&gt;
&lt;td&gt;&amp;lt; 42 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99.9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;42 ns&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;max&lt;/td&gt;
&lt;td&gt;15 us&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Every operation passes the sub-millisecond p99.9 check.&lt;/strong&gt; The log append — which is the actual disk write — completes in 145ns on average. That's because &lt;code&gt;mmap&lt;/code&gt; turns disk writes into memory copies; the OS flushes to disk asynchronously.&lt;/p&gt;




&lt;h2&gt;
  
  
  End-to-End Produce Latency
&lt;/h2&gt;

&lt;p&gt;For the full network round-trip (client -&amp;gt; TCP -&amp;gt; parse -&amp;gt; store -&amp;gt; respond -&amp;gt; client), measured with &lt;code&gt;kcat&lt;/code&gt; producing 1,000 individual messages:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;StrikeMQ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;p50&lt;/td&gt;
&lt;td&gt;~1-2ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 0.5ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99&lt;/td&gt;
&lt;td&gt;~5-10ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 1ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99.9&lt;/td&gt;
&lt;td&gt;~15-50ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 1ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;StrikeMQ's end-to-end produce stays under 1ms at p99.9. The path is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;recv() → parse Kafka header (16ns) → decode batch → lock partition mutex →
memcpy into mmap (145ns) → unlock → encode response → send()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No GC pauses. No thread context switches for common cases. No JIT warmup.&lt;/p&gt;




&lt;h2&gt;
  
  
  Consume Latency
&lt;/h2&gt;

&lt;p&gt;The fetch path is even faster because it's completely lock-free:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;recv() → parse header → binary search offset index → pointer into mmap → send()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zero copies of actual message data. The kernel's &lt;code&gt;send()&lt;/code&gt; reads directly from the mmap'd file pages. No deserialization, no buffer allocation, no locking.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resource Comparison Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;StrikeMQ&lt;/th&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Binary size&lt;/td&gt;
&lt;td&gt;200MB&lt;/td&gt;
&lt;td&gt;735KB&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;272x&lt;/strong&gt; smaller&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Startup time&lt;/td&gt;
&lt;td&gt;12s&lt;/td&gt;
&lt;td&gt;10ms&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1,200x&lt;/strong&gt; faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Idle memory&lt;/td&gt;
&lt;td&gt;350MB&lt;/td&gt;
&lt;td&gt;1.5MB&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;233x&lt;/strong&gt; less&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Idle CPU&lt;/td&gt;
&lt;td&gt;1-3%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Produce p99.9&lt;/td&gt;
&lt;td&gt;~15ms&lt;/td&gt;
&lt;td&gt;&amp;lt; 1ms&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;15x+&lt;/strong&gt; faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependencies&lt;/td&gt;
&lt;td&gt;JDK, scripts&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Threads at idle&lt;/td&gt;
&lt;td&gt;50+&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;4x&lt;/strong&gt; fewer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What This Means For You
&lt;/h2&gt;

&lt;p&gt;If you're running Kafka in &lt;code&gt;docker-compose.yml&lt;/code&gt; for local development, you're paying a &lt;strong&gt;12-second startup tax&lt;/strong&gt; and &lt;strong&gt;350MB memory overhead&lt;/strong&gt; every time. Multiply that across your team and your CI pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Developer laptop:&lt;/strong&gt; Swap Kafka for StrikeMQ in docker-compose. Same port, same protocol, same client code. Free up 350MB for your IDE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD integration tests:&lt;/strong&gt; Start StrikeMQ in 10ms instead of waiting 15 seconds for Kafka to boot. Your pipeline gets faster without changing a single test.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prototyping:&lt;/strong&gt; Want to test if Kafka is right for your architecture? Try the idea with StrikeMQ in seconds, not minutes.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What StrikeMQ Doesn't Do
&lt;/h2&gt;

&lt;p&gt;This isn't a production Kafka replacement. It deliberately trades durability and fault tolerance for speed and simplicity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No replication (single broker)&lt;/li&gt;
&lt;li&gt;No authentication (no SASL/SSL)&lt;/li&gt;
&lt;li&gt;Consumer group offsets are in-memory (lost on restart)&lt;/li&gt;
&lt;li&gt;No log compaction or retention enforcement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's a &lt;strong&gt;development tool&lt;/strong&gt;, like SQLite is to PostgreSQL or LocalStack is to AWS.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# macOS&lt;/span&gt;
brew tap awneesht/strike-mq
brew &lt;span class="nb"&gt;install &lt;/span&gt;strikemq

&lt;span class="c"&gt;# Or build from source (any platform)&lt;/span&gt;
git clone https://github.com/awneesht/Strike-mq.git
&lt;span class="nb"&gt;cd &lt;/span&gt;Strike-mq &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Release &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build
./build/strikemq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then point any Kafka client at &lt;code&gt;127.0.0.1:9092&lt;/code&gt;. Or use the built-in REST API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Produce via curl&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST localhost:8080/v1/topics/demo/messages &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"messages":[{"value":"hello"},{"key":"user-1","value":"world"}]}'&lt;/span&gt;

&lt;span class="c"&gt;# Peek at messages&lt;/span&gt;
curl &lt;span class="s2"&gt;"localhost:8080/v1/topics/demo/messages?offset=0&amp;amp;limit=10"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the benchmarks yourself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/strikemq_bench
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/awneesht/Strike-mq" rel="noopener noreferrer"&gt;github.com/awneesht/Strike-mq&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; MIT&lt;/p&gt;




&lt;p&gt;&lt;em&gt;All benchmarks run on Apple M-series, macOS, compiled with Clang -O2. Your numbers will vary. Kafka numbers are representative of default configurations — tuned Kafka will perform better, but will still carry the JVM baseline overhead. StrikeMQ numbers are from its built-in benchmark suite using TSC-based nanosecond timing.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>backend</category>
      <category>cpp</category>
      <category>performance</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I replaced a 200MB JVM process with a 52KB binary that speaks Kafka</title>
      <dc:creator>Awneesh Tiwari</dc:creator>
      <pubDate>Tue, 10 Feb 2026 06:14:14 +0000</pubDate>
      <link>https://dev.to/awneesh_tiwari_84445a8ceb/i-replaced-a-200mb-jvm-process-with-a-52kb-binary-that-speaks-kafka-5cm3</link>
      <guid>https://dev.to/awneesh_tiwari_84445a8ceb/i-replaced-a-200mb-jvm-process-with-a-52kb-binary-that-speaks-kafka-5cm3</guid>
      <description>&lt;p&gt;Every time I spin up Kafka for local development, the same ritual plays out: start ZooKeeper (or KRaft), wait for the JVM to warm up, watch 2GB of RAM disappear, and then finally — after 30 seconds — send my first message.&lt;/p&gt;

&lt;p&gt;I got tired of it. So I built &lt;strong&gt;StrikeMQ&lt;/strong&gt; — a 52KB message broker written in C++20 that speaks the Kafka wire protocol. Any Kafka client library works with it out of the box. No code changes. No JVM. No ZooKeeper. Start in milliseconds, 0% CPU when idle.&lt;/p&gt;

&lt;p&gt;Think of it like &lt;a href="https://localstack.cloud/" rel="noopener noreferrer"&gt;LocalStack&lt;/a&gt; for Kafka — develop locally against StrikeMQ, deploy to real Kafka in production.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build&lt;/span&gt;
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; build &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;build
cmake &lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Release ..
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Run&lt;/span&gt;
./strikemq

&lt;span class="c"&gt;# Produce and consume with any Kafka client&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"hello&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;world&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;strike"&lt;/span&gt; | kcat &lt;span class="nt"&gt;-b&lt;/span&gt; 127.0.0.1:9092 &lt;span class="nt"&gt;-P&lt;/span&gt; &lt;span class="nt"&gt;-t&lt;/span&gt; my-topic
kcat &lt;span class="nt"&gt;-b&lt;/span&gt; 127.0.0.1:9092 &lt;span class="nt"&gt;-C&lt;/span&gt; &lt;span class="nt"&gt;-t&lt;/span&gt; my-topic &lt;span class="nt"&gt;-e&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This post is about how I built it, the bugs that nearly broke me, and what I learned about implementing a real wire protocol from scratch.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Not Just Use Kafka?
&lt;/h2&gt;

&lt;p&gt;Kafka is incredible for production. But for local development and testing, it's overkill:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;StrikeMQ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Binary size&lt;/td&gt;
&lt;td&gt;~200MB+ (JVM + libs)&lt;/td&gt;
&lt;td&gt;52KB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Startup time&lt;/td&gt;
&lt;td&gt;10-30 seconds&lt;/td&gt;
&lt;td&gt;&amp;lt; 10ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Idle CPU&lt;/td&gt;
&lt;td&gt;1-5% (JVM GC, threads)&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;1-2GB minimum&lt;/td&gt;
&lt;td&gt;~1MB + mmap'd segments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependencies&lt;/td&gt;
&lt;td&gt;JVM, ZooKeeper/KRaft&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I didn't want to build a Kafka replacement. I wanted something that &lt;em&gt;pretends&lt;/em&gt; to be Kafka well enough that &lt;code&gt;kafka-python&lt;/code&gt;, &lt;code&gt;librdkafka&lt;/code&gt;, &lt;code&gt;kcat&lt;/code&gt;, and &lt;code&gt;confluent-kafka-go&lt;/code&gt; can't tell the difference.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;StrikeMQ has four layers, all in pure C++20 with zero third-party dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        Kafka Clients (kcat, librdkafka, kafka-python, ...)
                            |
                       TCP :9092
                            |
              +----------------------------+
              |     Acceptor Thread        |
              |  kqueue/epoll (accept only)|
              +----------------------------+
                  |     |     |        |
            SPSC ring buffers (lock-free)
                  |     |     |        |
          Worker 0  Worker 1  ...  Worker N-1
          (own kqueue/epoll per thread)
                  |     |     |        |
              +----------------------------+
              |      Protocol Layer        |
              |  Kafka wire protocol       |
              |  encode/decode/route       |
              +----------------------------+
                    |     |     |     |
              Produce  Fetch  List   Metadata
                              Offsets
                    |     |
              +----------------------------+
              |   Consumer Group Handlers  |
              |  JoinGroup, SyncGroup,     |
              |  Heartbeat, OffsetCommit   |
              +----------------------------+
                    |     |
              +----------------------------+
              |      Storage Layer         |
              |  mmap'd log segments       |
              |  sparse offset index       |
              |  (per-partition mutex)     |
              +----------------------------+
                            |
                    /tmp/strikemq/data/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Multi-Threaded I/O
&lt;/h3&gt;

&lt;p&gt;The network layer uses an &lt;strong&gt;acceptor + N worker threads&lt;/strong&gt; architecture. The acceptor thread runs its own &lt;code&gt;kqueue&lt;/code&gt; (macOS) or &lt;code&gt;epoll&lt;/code&gt; (Linux) loop that does nothing but &lt;code&gt;accept()&lt;/code&gt; new connections and distribute them round-robin to worker threads via lock-free SPSC ring buffers. Each worker thread runs its own event loop with its own &lt;code&gt;kqueue&lt;/code&gt;/&lt;code&gt;epoll&lt;/code&gt; instance, its own connection map, and a pipe-based wakeup mechanism for cross-thread notification.&lt;/p&gt;

&lt;p&gt;N defaults to &lt;code&gt;std::thread::hardware_concurrency()&lt;/code&gt; — on a 10-core machine, that's 10 independent event loops processing requests in parallel. A slow consumer fetch on worker 3 no longer blocks a fast producer on worker 7.&lt;/p&gt;

&lt;p&gt;Every socket gets &lt;code&gt;TCP_NODELAY&lt;/code&gt; for minimum latency, and each worker processes up to 64 events per iteration. Frame extraction happens inline — we read the 4-byte big-endian size prefix, accumulate bytes until a full Kafka frame arrives, then route it to the protocol layer. Connection state is thread-local to each worker, so there's no locking on the I/O hot path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Zero-Copy Storage
&lt;/h3&gt;

&lt;p&gt;Messages are stored in memory-mapped log segments, pre-allocated to 1GB each. Writes are sequential &lt;code&gt;memcpy&lt;/code&gt; into the mapped region. Reads are zero-copy — the Fetch handler returns a raw pointer directly into the mmap'd segment. No serialization, no copying, no allocation on the read path.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/tmp/strikemq/data/
  my-topic-0/
    0.log         # 1GB pre-allocated, mmap'd
  another-topic-0/
    0.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A sparse offset index (one entry per 4KB boundary) maps logical Kafka offsets to byte positions. Lookups use &lt;code&gt;std::lower_bound&lt;/code&gt; for O(log n) performance, then scan forward through batch headers to find the exact starting position.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lock-Free Data Structures
&lt;/h3&gt;

&lt;p&gt;The lock-free primitives aren't theoretical — they're load-bearing infrastructure for the multi-threaded architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SPSC ring buffer&lt;/strong&gt; — Used to pass accepted file descriptors from the acceptor thread to each worker. Wait-free, cache-line aligned (64 bytes) to prevent false sharing. Uses separate cached head/tail copies to minimize cross-core cache traffic. One ring buffer per worker (acceptor = producer, worker = consumer), so no contention between workers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MPSC ring buffer&lt;/strong&gt; — Compare-and-swap loop for multi-producer safety with a committed flag per slot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory pool&lt;/strong&gt; — Pre-allocated block pool with an intrusive freelist. On Linux, it tries &lt;code&gt;MAP_HUGETLB&lt;/code&gt; for 2MB pages, with automatic fallback to regular pages. The constructor touches every page to force materialization and prevent page faults on the hot path.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Storage is protected with per-partition mutexes — &lt;code&gt;PartitionLog::append()&lt;/code&gt; holds a lock only for its own partition, so concurrent writes to different topics never contend. The read path (&lt;code&gt;PartitionLog::read()&lt;/code&gt;) is completely lock-free, using only acquire loads on atomics to see committed data.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementing the Kafka Wire Protocol
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. The Kafka protocol is a binary, big-endian, version-aware request/response protocol over TCP. Every request starts with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[4 bytes] message size
[2 bytes] API key (which operation)
[2 bytes] API version
[4 bytes] correlation ID
[variable] client ID string
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I implemented five core APIs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;API&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ApiVersions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"What do you support?" — Client's first request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"What topics exist? Where are the brokers?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Produce&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Store these messages"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fetch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Give me messages starting from offset X"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ListOffsets&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"What's the earliest/latest offset for this partition?"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each API has multiple versions with different field layouts. Produce alone has versions 0 through 5, each adding fields like &lt;code&gt;transactional_id&lt;/code&gt; or changing how &lt;code&gt;acks&lt;/code&gt; works. The encoder and decoder are version-aware — they check the API version and include/skip fields accordingly.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bugs That Nearly Broke Me
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Bug #1: The ~34GB Malloc
&lt;/h3&gt;

&lt;p&gt;When I first connected &lt;code&gt;librdkafka&lt;/code&gt;, the broker crashed immediately. Not a segfault in my code — a &lt;code&gt;malloc&lt;/code&gt; assertion failure &lt;em&gt;inside librdkafka&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Here's what happened: librdkafka sends &lt;code&gt;ApiVersions v3&lt;/code&gt;, which uses Kafka's "flexible versions" encoding. This means compact arrays (varint-prefixed instead of int32-prefixed) and tagged fields at the end of each section.&lt;/p&gt;

&lt;p&gt;My encoder dutifully added a &lt;code&gt;tagged_fields&lt;/code&gt; byte (0x00 = no tags) to the response header. But the Kafka protocol spec has a &lt;strong&gt;special exception&lt;/strong&gt;: ApiVersions responses must NOT include header tagged fields, for backwards compatibility with older clients.&lt;/p&gt;

&lt;p&gt;That one extra byte shifted every subsequent field by 1 position. When librdkafka parsed the "number of API entries" field, it read a garbage value that translated to approximately &lt;strong&gt;34 billion entries&lt;/strong&gt;. It tried to malloc ~34GB, the allocator returned NULL, and the process aborted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; One line removed — don't write the header tagged_fields byte for ApiVersions responses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug #2: The INT16 That Was an INT32
&lt;/h3&gt;

&lt;p&gt;After implementing Fetch, &lt;code&gt;kcat&lt;/code&gt; connected and tried to consume messages. Instead of data, I got:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rd_kafka_msgset_reader_msg_v2:764: expected 18446744073709551613 bytes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That number is &lt;code&gt;(uint64_t)-3&lt;/code&gt; — a clear sign that a signed value was being interpreted as unsigned, and something was off by a few bytes in the binary layout.&lt;/p&gt;

&lt;p&gt;The Kafka v2 record batch header has 49 bytes of fixed fields. Two of them — &lt;code&gt;attributes&lt;/code&gt; and &lt;code&gt;producerEpoch&lt;/code&gt; — are &lt;strong&gt;INT16&lt;/strong&gt; (2 bytes each). But my serializer was writing them as INT32 (4 bytes each):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// BEFORE (broken):&lt;/span&gt;
&lt;span class="n"&gt;w32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;      &lt;span class="c1"&gt;// wrote 4 bytes, should be 2&lt;/span&gt;
&lt;span class="n"&gt;w32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;producer_epoch&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// wrote 4 bytes, should be 2&lt;/span&gt;

&lt;span class="c1"&gt;// AFTER (fixed):&lt;/span&gt;
&lt;span class="n"&gt;w16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;      &lt;span class="c1"&gt;// correct: 2 bytes&lt;/span&gt;
&lt;span class="n"&gt;w16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;producer_epoch&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// correct: 2 bytes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Those 4 extra bytes shifted every record in the batch. When librdkafka parsed with the correct field sizes, the varint decoder landed on garbage bytes and produced nonsensical lengths.&lt;/p&gt;

&lt;p&gt;This bug was particularly nasty because produces appeared to succeed — the broker accepted and stored the data. It only manifested on consume, when a client tried to parse the stored bytes with the correct field widths.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How I found it:&lt;/strong&gt; I wrote a Python script to hex-dump the raw &lt;code&gt;.log&lt;/code&gt; file and manually walked through each field of the Kafka v2 batch format, byte by byte, until I found the offset where reality diverged from the spec.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug #3: librdkafka's Version Gate
&lt;/h3&gt;

&lt;p&gt;Even after fixing the serialization, &lt;code&gt;kcat&lt;/code&gt; refused to parse the response. Debug logs showed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Feature MsgVer2: Fetch (4..32767) NOT supported by broker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;librdkafka has a &lt;strong&gt;feature gate&lt;/strong&gt;: it only uses Kafka v2 record batches if the broker advertises Fetch v4 or higher. I was advertising Fetch v0-v0 — valid, but insufficient. The client fell back to an older message format that didn't match what was stored on disk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Advertise Fetch v0-v4 and update the response encoder to handle v1+ fields (&lt;code&gt;throttle_time_ms&lt;/code&gt;) and v4+ fields (&lt;code&gt;last_stable_offset&lt;/code&gt;, &lt;code&gt;aborted_transactions&lt;/code&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;On my M1 MacBook:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Produce latency (p99.9)&lt;/td&gt;
&lt;td&gt;&amp;lt; 1ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU when idle&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory footprint&lt;/td&gt;
&lt;td&gt;~1MB + mmap'd segments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Startup time&lt;/td&gt;
&lt;td&gt;&amp;lt; 10ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Binary size&lt;/td&gt;
&lt;td&gt;52KB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The produce path is: recv() → parse header → decode batch → lock partition mutex → memcpy into mmap → unlock → encode response → send(). The only synchronization is a per-partition mutex, so writes to different topics are fully parallel across worker threads.&lt;/p&gt;

&lt;p&gt;The consume path is even simpler: recv() → parse header → binary search the offset index → return a pointer into the mmap'd segment → send(). Zero copies of the actual message data, and completely lock-free.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Works Today
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kafka&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KafkaProducer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KafkaConsumer&lt;/span&gt;

&lt;span class="c1"&gt;# Produce
&lt;/span&gt;&lt;span class="n"&gt;producer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaProducer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bootstrap_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1:9092&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my-topic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hello from python&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Consume
&lt;/span&gt;&lt;span class="n"&gt;consumer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaConsumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my-topic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;bootstrap_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1:9092&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;auto_offset_reset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;earliest&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  &lt;span class="c1"&gt;# "hello from python"
&lt;/span&gt;    &lt;span class="k"&gt;break&lt;/span&gt;

&lt;span class="c1"&gt;# Consume with consumer group
&lt;/span&gt;&lt;span class="n"&gt;consumer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaConsumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my-topic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;group_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my-group&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;bootstrap_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1:9092&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;auto_offset_reset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;earliest&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;break&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Supported APIs: ApiVersions (v0-v3), Metadata (v0), Produce (v0-v5), Fetch (v0-v4), ListOffsets (v0-v2), FindCoordinator (v0-v2), JoinGroup (v0-v3), SyncGroup (v0-v2), Heartbeat (v0-v2), LeaveGroup (v0-v1), OffsetCommit (v0-v3), OffsetFetch (v0-v3). Topics are auto-created on first produce or metadata request.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Log compaction and retention&lt;/strong&gt; — Segments accumulate indefinitely right now&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;StrikeMQ is MIT licensed and runs on macOS (Apple Silicon + Intel) and Linux.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/awneesht/Strike-mq" rel="noopener noreferrer"&gt;github.com/awneesht/Strike-mq&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docker (easiest):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 9092:9092 strikemq/strikemq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Or build from source:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/awneesht/Strike-mq.git
&lt;span class="nb"&gt;cd &lt;/span&gt;Strike-mq
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; build &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;build
cmake &lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Release ..
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
./strikemq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're tired of waiting 30 seconds for Kafka to start during local development, give it a try. Stars and feedback welcome.&lt;/p&gt;

</description>
      <category>cpp</category>
      <category>kafka</category>
      <category>performance</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
