How Close Can C++ Inter-Process Calls Get to Raw Socket Speed?

#cpp #linux #performance #opensource

I wanted an honest answer to a question that's easy to wave hands at: when a C++ process calls another process — not raw bytes over a socket, but an actual typed call, with serialization, routing, and dispatch — how much do you actually pay for that, and how close can you get to the physical floor while still keeping it convenient?

So I measured it, end to end, and compared it against both a published competitor benchmark and the closest thing to a "bare metal" reference I could find.

The problem

Every C++ codebase that needs processes to talk to each other ends up choosing between two bad options. Something like D-Bus gives you structure but was never built for speed. So for anything performance-sensitive, people drop to raw sockets and hand-roll the rest themselves — framing, dispatch to the right thread, reconnection, all of it, every time, in every project. Cloud-native languages solved this a decade ago with batteries-included RPC and service infrastructure. C++ mostly didn't get that, and it's still common to see the same hand-rolled socket plumbing in 2026 that existed in 2010.

I've spent a long time building a C++17 service framework (areg-sdk, more on that below) specifically to remove that plumbing, and I wanted real numbers, not a feeling, for what it actually costs.

What I measured

The full call path, not a cherry-picked subset:

One-way (2 hops): sender serializes → TCP to a router → TCP to the receiver → dispatch onto the receiver's correct thread → deserialize → method call.

Round-trip (4 hops): the same path there, then back — sender → router → receiver → router → sender — with serialization and dispatch happening on both legs.

I went through a router instead of connecting processes directly because it removes hardcoded addresses and startup-order dependencies: any component can call any other, in any order, without knowing where anything physically lives. The cost is a longer path than a direct connection — which is exactly why I wanted to know how much longer.

Hardware: an i7-13700H laptop, DDR4, Ubuntu, performance governor. Not a tuned workstation, and not embedded-class ARM — I don't have that data yet (more below).

The numbers

One-way: ~12 μs. Round-trip: ~23.5 μs.

For a floor reference, a completely bare blocking socket call — zero framework, nothing — still costs ~4–11 μs round-trip even on dedicated hardware (Unix domain sockets, F. Werner, Max Planck Institute for Nuclear Physics, 2021). A full typed call through 4 hops and a router, serializing on both legs, lands within shouting distance of that.

I also ran the same payload size against a 2025 Hitachi Energy Research benchmark of ZeroMQ, NanoMsg, and NNG (arXiv:2508.07934), at ~1KB messages:

	Min (μs)	Mean (μs)	Conditions
Bare socket (UDS, no framework)	~4	—	Dedicated hardware
areg-sdk (this framework)	12.5	16.8	2 TCP hops, full serialize + dispatch, mobile CPU
NanoMsg	18.0	21.9	Raw transport, isolated cores, Xeon
ZeroMQ	22.0	27.5	Raw transport, isolated cores, Xeon
NNG	24.3	34.9	Raw transport, isolated cores, Xeon

On average: ~23% faster than NanoMsg, ~39% faster than ZeroMQ, ~52% faster than NNG.

Why this isn't a clean win

Every condition in that comparison favors them, not me. Their numbers are raw socket send/receive timestamps only — no serialization, no dispatch — with publisher and subscriber pinned to isolated cores on an 8-core Xeon workstation, sending at a steady, moderate rate. Mine is the complete call path, including two TCP hops through a router, on a mobile CPU with no core isolation. The comparison should favor them on every axis, and it's still close. Read it as directionally interesting, not a victory lap.

How — this is the part the other platforms didn't have room for

Three things got the latency down here, and I want to show one of them in actual code rather than just assert it.

Lock-free, move-based event queue. Every cross-thread and cross-process message in areg-sdk goes through a bounded MPSC (multi-producer, single-consumer) ring buffer — a Dmitry Vyukov-style lock-free design. Here's the real comment block from EventQueue.hpp, unedited:

/**
 * \brief   Bounded MPSC (multiple-producer / single-consumer) event queue.
 *
 *          Normal-priority events use a fixed array ring (Dmitry Vyukov bounded
 *          MPSC): push is a single CAS on a ticket plus a release store, pop is a
 *          single acquire load -- no node allocation and no heap traffic on the
 *          hot path. The ring size is fixed at construction (power of two).
 *
 *          HighPrio and CriticalPrio events use a separate SpinLock-guarded
 *          priority deque (unbounded, never dropped) to preserve front-insertion
 *          ordering semantics.
 */

And the cache-line padding that keeps producer and consumer cursors from fighting over the same cache line — this is the literal source, not a paraphrase:

    //////////////////////////////////////////////////////////////////////////
    // 128 bytes separates the producer written enqueue cursor from the
    // consumer written dequeue cursor. This covers the widest hardware cache
    // line in use: x86/x86-64 and ARM32 use 32–64 bytes; Apple Silicon uses
    // 128 bytes. Padding is conservative on 32-bit targets (wastes ~248 B)
    // but keeps the layout correct across all supported platforms.
    //////////////////////////////////////////////////////////////////////////
static constexpr uint32_t   AREG_MPSC_CACHE_LINE_SIZE{ 128u };

//!< Producer-written enqueue cursor - own cache line.
alignas(AREG_MPSC_CACHE_LINE_SIZE) std::atomic<size_t> mEnqueuePos;

//!< Consumer-written dequeue cursor - own cache line.
alignas(AREG_MPSC_CACHE_LINE_SIZE) std::atomic<size_t> mDequeuePos;

push_event moves the event into the ring — its own doc comment is explicit about it: "The event's shared buffer is transferred (O(1) — no data copy)." That's C++17 move semantics doing real work on the hot path, not a buzzword in a slide deck.

Architecture that survived three rewrites of that same queue. The queue you're looking at above is the third implementation. What stayed constant across all three rewrites was how messages get dispatched and delivered to components — only the internal push/pop mechanics changed. That stability is what made it safe to keep optimizing the queue without destabilizing everything built on top of it.

Everything else, nanosecond by nanosecond. Lock reduction, heap allocation avoidance, variable alignment, instruction count for hot-path operations — none of these show up as one big win, they show up as a sum.

What's missing

I don't have real numbers from actual embedded ARM hardware — Cortex-A53/A72 class boards, not a laptop CPU. If you run this on a Pi or similar and get numbers, I'd genuinely like to see them.

On the dependency question people always ask: no external packages need to be installed to build this — it's STL and POSIX/OS APIs, with ncurses as an optional, off-by-default dependency for an interactive console mode. SQLite is vendored (bundled source, used only for log persistence, not on this IPC path) — you don't need to install anything for it either. It's also been adopted and built on Alpine Linux by a project collaborator who's an Alpine package maintainer — not something I've put through CI myself yet, but documented here if you want to see it firsthand rather than take my word for it.

It assumes a private, trusted network — there's no TLS yet, so if your deployment needs encrypted transport even on an internal network, this isn't the right tool today.

Try it / dig in

I'm the author — areg-sdk, C++17, Apache 2.0, runs on Linux/Windows/macOS across x86, x86-64, arm32, and arm64.

Full methodology and every caveat I could think of: 08c-areg-vs-hitachi-benchmark.md
Runnable benchmark example: examples/30_publatency
Repo: github.com/aregtech/areg-sdk

If you've measured something similar — D-Bus, ZeroMQ, your own hand-rolled socket layer, anything — I'd like to compare notes. And separately: if anyone has real, measured evidence that C++20 beats C++17 on latency for this kind of workload, I want to see it. So far I haven't found any.