<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lorenzo Pompili</title>
    <description>The latest articles on DEV Community by Lorenzo Pompili (@lpworks).</description>
    <link>https://dev.to/lpworks</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3974712%2F30955bbf-3c61-4960-811e-671c9354b37a.jpg</url>
      <title>DEV Community: Lorenzo Pompili</title>
      <link>https://dev.to/lpworks</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lpworks"/>
    <language>en</language>
    <item>
      <title>Building a high-throughput BGP/BMP collector in Java with virtual threads</title>
      <dc:creator>Lorenzo Pompili</dc:creator>
      <pubDate>Mon, 08 Jun 2026 19:08:39 +0000</pubDate>
      <link>https://dev.to/lpworks/building-a-high-throughput-bgpbmp-collector-in-java-with-virtual-threads-4797</link>
      <guid>https://dev.to/lpworks/building-a-high-throughput-bgpbmp-collector-in-java-with-virtual-threads-4797</guid>
      <description>&lt;p&gt;Most of the "fast data pipeline" folklore in the JVM world ends at the same place: &lt;em&gt;go reactive, or go home&lt;/em&gt;. Netty, event loops, backpressure operators, the works. I wanted to find out whether Java 25's virtual threads let you write the boring, blocking, one-thread-per-connection version — and still move hundreds of thousands of messages a second. So I built &lt;strong&gt;jBMP&lt;/strong&gt;, a collector for the &lt;strong&gt;BGP Monitoring Protocol&lt;/strong&gt;, and pushed it until the database begged for mercy.&lt;/p&gt;

&lt;p&gt;This is the story of what I built, and — more usefully — the three times I was wrong about where the time was going.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's a BMP collector, and why care about throughput
&lt;/h2&gt;

&lt;p&gt;BMP (RFC 7854) is how a router streams its BGP state to a monitoring station: it opens a TCP session and pushes a firehose of &lt;em&gt;route monitoring&lt;/em&gt; messages (every prefix it learns), peer up/down events, and periodic statistics. A single big edge router or route reflector can dump &lt;strong&gt;millions&lt;/strong&gt; of prefixes when a session comes up. A collector that watches a few hundred routers has to absorb that initial-dump thundering herd without falling over.&lt;/p&gt;

&lt;p&gt;So the shape of the problem is: many long-lived TCP connections, each occasionally bursting huge volumes of structured binary messages that must be parsed and durably stored. Classic high-fan-in ingest.&lt;/p&gt;

&lt;p&gt;jBMP splits into three services around a pure-Java protocol library:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a &lt;strong&gt;collector&lt;/strong&gt; that terminates BMP/TCP, parses the BMP envelope and the carried BGP-4 messages, and produces them to Kafka;&lt;/li&gt;
&lt;li&gt;a &lt;strong&gt;consumer&lt;/strong&gt; that drains Kafka and bulk-loads PostgreSQL/TimescaleDB;&lt;/li&gt;
&lt;li&gt;a &lt;strong&gt;mock&lt;/strong&gt; router to generate load.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Decision 1: one virtual thread per router
&lt;/h2&gt;

&lt;p&gt;The collector's core is deliberately dumb:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// one of these runs per connected router, on its own virtual thread&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;readNextBmpMessage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;parse&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;enriched&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;enricher&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;enrich&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;publisher&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;publish&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enriched&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// to Kafka&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Blocking reads. Blocking parse. No callbacks, no state machine, no reassembly buffer that I have to thread through an event loop. Each router gets &lt;code&gt;Thread.ofVirtual().start(...)&lt;/code&gt;, and the JVM multiplexes thousands of these onto a handful of carrier threads. When a read blocks on the socket, the virtual thread is parked and its carrier is freed — exactly what an event loop does for you, except I never had to write the event loop.&lt;/p&gt;

&lt;p&gt;This matters beyond aesthetics. BMP framing is stateful (a 6-byte common header gives you the length, then you read the rest). In a reactive pipeline that becomes a tedious incremental decoder. With a virtual thread it's a plain &lt;code&gt;readFully&lt;/code&gt;. The code reads like the spec.&lt;/p&gt;

&lt;p&gt;The lesson that surprised me: at no point in the entire performance investigation below did the threading model show up as a bottleneck. Virtual threads did their job and got out of the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision 2: a custom binary wire format, not Protobuf
&lt;/h2&gt;

&lt;p&gt;The collector and consumer talk over Kafka. The obvious move is Protobuf or Avro. I didn't.&lt;/p&gt;

&lt;p&gt;A parsed route-monitoring message is &lt;em&gt;already&lt;/em&gt; a tight binary structure — prefixes are bytes, next-hops are bytes, AS-paths are arrays of integers. Re-encoding that into Protobuf means a schema round-trip, descriptor lookups, and a second allocation of everything. So jBMP ships a hand-rolled, length-prefixed binary codec: a one-byte presence bitmask for the optional extended families, then just the fields that are present. No reflection, no schema registry, a single &lt;code&gt;byte[]&lt;/code&gt; per message.&lt;/p&gt;

&lt;p&gt;Is this the right call for every project? No — you lose schema evolution tooling, and you own the forward-compatibility tests. But for a closed producer/consumer pair on the hot path, shaving the serialization layer to the bone is free throughput. (jBMP versions the format and keeps the decoder backward-compatible; that's the tax you pay for rolling your own.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision 3: bulk binary COPY into the database
&lt;/h2&gt;

&lt;p&gt;The consumer's job is to get rows into PostgreSQL/TimescaleDB as fast as the disk allows. Per-row &lt;code&gt;INSERT&lt;/code&gt;s are a non-starter at these rates. jBMP renders each Kafka poll-batch directly into PostgreSQL's &lt;strong&gt;binary&lt;/strong&gt; &lt;code&gt;COPY&lt;/code&gt; stream and streams it in one shot: &lt;code&gt;timestamptz&lt;/code&gt; as microseconds, &lt;code&gt;cidr&lt;/code&gt;/&lt;code&gt;inet&lt;/code&gt; in their native family-tagged form, AS-paths and communities as &lt;code&gt;int[]&lt;/code&gt;/&lt;code&gt;text[]&lt;/code&gt;, the structured families as &lt;code&gt;jsonb&lt;/code&gt;. The server ingests each value in its on-disk representation, skipping the text-parse-and-validate it does for a normal &lt;code&gt;INSERT&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Alongside the append-only history there's a &lt;em&gt;current-state&lt;/em&gt; projection (&lt;code&gt;rib_state&lt;/code&gt;): one row per &lt;code&gt;(peer, prefix)&lt;/code&gt;, kept up to date with an idempotent &lt;code&gt;INSERT … ON CONFLICT DO UPDATE&lt;/code&gt; / &lt;code&gt;DELETE&lt;/code&gt;. That projection is rebuildable from the history, so it runs on a &lt;strong&gt;single background worker off the commit path&lt;/strong&gt; — the consumer commits its Kafka offsets the moment the history is durable and lets the projection catch up behind it. Remember this detail; it comes back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now the part where I was wrong three times
&lt;/h2&gt;

&lt;p&gt;I built a benchmark — a mock pushing 50 routers × 10 peers × thousands of prefixes — and started measuring the consumer's drain rate into the database. Here's where intuition failed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrong #1: "it's the CPU / the decode"
&lt;/h2&gt;

&lt;p&gt;A Java Flight Recorder profile said otherwise. Out of an entire drain, there were only ~300 CPU execution samples — the consumer was barely running Java code. It was &lt;em&gt;blocked&lt;/em&gt;. Aggregating the &lt;code&gt;jdk.SocketRead&lt;/code&gt; events by remote port showed ~57 seconds of aggregate wait reading responses from the &lt;strong&gt;database&lt;/strong&gt;, and essentially nothing waiting on Kafka fetches. The bottleneck was I/O wait on the DB round trips, not the parser, not the codec. First lesson: &lt;strong&gt;profile before you optimise&lt;/strong&gt;; the wide allocation-heavy decode I was sure would dominate was a rounding error.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrong #2: "more parallelism will fix it"
&lt;/h2&gt;

&lt;p&gt;When I scaled the mock up, throughput &lt;em&gt;collapsed&lt;/em&gt; — big bursts then multi-second stalls. I assumed a checkpoint storm and started tuning WAL and &lt;code&gt;synchronous_commit&lt;/code&gt;. It changed nothing.&lt;/p&gt;

&lt;p&gt;So I did the thing I should have done first: I took a &lt;strong&gt;thread dump during a stall&lt;/strong&gt; and looked at &lt;code&gt;pg_stat_activity&lt;/code&gt; at the same instant. The database connections were almost all &lt;code&gt;idle&lt;/code&gt;, waiting on &lt;code&gt;ClientRead&lt;/code&gt; — i.e. waiting for &lt;em&gt;my&lt;/em&gt; client to send something. The bottleneck wasn't the DB at all in that moment. And the thread dump showed three consumer threads stuck deep inside &lt;code&gt;writeStats() → commit()&lt;/code&gt;, blocked on a socket read for the commit response.&lt;/p&gt;

&lt;p&gt;There it was. The low-volume statistics/peer writers were running &lt;strong&gt;inline on the same threads that do the route-monitor bulk&lt;/strong&gt; &lt;code&gt;COPY&lt;/code&gt;. The mock generated a burst of stats; each was its own little transaction; and the COPY threads that owned those partitions were head-of-line blocked behind thousands of tiny per-message round trips.&lt;/p&gt;

&lt;p&gt;The fix in Spring Kafka was to consume the low-volume topics on a &lt;strong&gt;separate listener container&lt;/strong&gt; with its own small thread pool, so a stats burst can never stall the bulk path. On the realistic load, sustained drain went from ~17k rows/s to ~110k. Second lesson: &lt;strong&gt;a thread dump taken during the stall is worth a hundred guesses&lt;/strong&gt;, and "add threads" is not a strategy — &lt;em&gt;where&lt;/em&gt; the work runs matters more than how much of it there is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrong #3: "look how much faster mine is"
&lt;/h2&gt;

&lt;p&gt;For a while my numbers looked spectacular — multiples faster than a comparable implementation on the same hardware. Then I checked the database and found the comparison was a lie I'd told myself.&lt;/p&gt;

&lt;p&gt;The Kafka partition key is the router identity. I had — "cleverly" — derived that identity from the router's advertised system name, so my mock's 50 simulated routers fanned out across ~35 partitions and 12 consumer threads. The implementation I was comparing against derived the identity from the source IP; my mock's routers all shared one IP, so it collapsed to &lt;strong&gt;one&lt;/strong&gt; partition and one consumer. I wasn't measuring better code. I was measuring 12× the parallelism.&lt;/p&gt;

&lt;p&gt;When I aligned the identity derivation and re-ran at honest parity — same partitions, same full schema — the gap evaporated: ~19k vs ~19k rows/s, within run-to-run noise. Third lesson, and the one I'd tattoo on a benchmark: &lt;strong&gt;most benchmark "wins" are configuration artifacts.&lt;/strong&gt; If your number is surprisingly good, the first hypothesis should be that the test is unfair, not that you're a genius.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it landed
&lt;/h2&gt;

&lt;p&gt;jBMP sustains tens of thousands of rows/second per partition into a network-attached TimescaleDB and scales near-linearly as traffic spreads across routers/partitions — hundreds of thousands of rows/second in bursts across a few dozen partitions. The real ceiling, at parity, is the database's write path, not the JVM.&lt;/p&gt;

&lt;p&gt;The three lessons generalise well beyond BMP:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Profile before optimising&lt;/strong&gt; — your intuition about the hot path is probably wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;During a stall, dump the threads and the DB sessions&lt;/strong&gt; — they'll tell you who's actually waiting on whom.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distrust a benchmark that flatters you&lt;/strong&gt; — equalise the configuration before you believe the number.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The code is open source (Apache-2.0): &lt;a href="https://github.com/lorenzopompili/jbmp" rel="noopener noreferrer"&gt;https://github.com/lorenzopompili/jbmp&lt;/a&gt;&lt;/p&gt;

</description>
      <category>java</category>
      <category>networking</category>
      <category>performance</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
