Samar Prakash

Posted on May 30 • Originally published at Medium

When Two Containers on the Same Host Are Shouting Through a Load Balancer

#java #docker #architecture #aws

Building a Unix-Domain-Socket IPC server for ECS-on-EC2 services that need to talk fast, cheap, and reliably

A while back I was looking at a flamegraph of a service that, on paper, should not have been having any performance problems. The producer and the consumer were the same Docker image's worth of trouble — colocated on the same EC2 host, in the same ECS cluster, sharing the same instance type, the same kernel, the same RAM. By every reasonable measure they were neighbours.

And yet every event was making a round trip that looked roughly like this: producer → kernel TCP stack → ENI on the producer task → AWS VPC → internal load balancer → ENI on the consumer task → kernel TCP stack → consumer. TLS handshake. HTTP framing. JSON over the wire. Connection pool. Retry policy. The whole circus.

I wasn't doing anything wrong. This is what the platform funnels you toward. ECS with awsvpc networking gives every task its own ENI. The default story for "service A talks to service B" is "give B a DNS name, put a load balancer in front of it, configure a security group, point A at the LB." Even if A and B are physically on the same box, the bytes are still leaving the kernel, traversing the VPC, and coming back.

There's a fix for this. It's been a fix for fifty-something years. It just hasn't been the default fix, because cloud-native architecture grew up assuming services would be scattered across hosts and the network was the abstraction that mattered.

This article is about building a proper IPC server using Unix Domain Sockets, deployed as a sidecar pattern on ECS-on-EC2, with a wire protocol robust enough to ship in production. We're going to design it from scratch — the transport choice, the wire format, the backpressure model, the failure modes, the deployment topology. I'll show you real pseudo-code from the implementation and call out the small number of places where, if you get it wrong, you'll spend a weekend debugging it.

The intended outcome is something you could lift the pattern from. The article is long because the problem isn't actually that simple once you get past the "just use a socket file" stage. But none of it is mystical. If you've written Netty code or a binary protocol parser before, you'll be fine. If you haven't, the early sections will land regardless.

The problem with "it's all just localhost"

The first thing every engineer reaches for when they realise two services live on the same host is 127.0.0.1. That worked beautifully in 1999. It works less well on modern container platforms, and the reason is worth understanding properly.

When you run an ECS task in awsvpc networking mode — and you probably are, because bridge mode has its own pile of caveats — every task gets its own ENI. AWS attaches that ENI to a dedicated network namespace inside the EC2 host's kernel. Task A's loopback interface is not the same loopback interface as task B's loopback interface. They both see 127.0.0.1, but those two 127.0.0.1s are different things. A connection to 127.0.0.1:8080 from inside task A will never reach a listener inside task B, even if they're on the same physical EC2 instance.

You can work around this with host networking mode, but then you've given up port isolation and you have to coordinate every port number across every container on the host. That trade ages badly.

The second thing engineers reach for is "well, fine, put a load balancer in front of it." This works. It also costs you:

A real ENI on each side (and ENIs are a rationed resource per instance type). A trip through the AWS VPC data plane. Potentially TLS termination and re-encryption. JSON serialization on the way out and deserialization on the way in. The overhead of HTTP itself: headers, status codes, content negotiation, the eternal question of whether to use keep-alive. And in some setups, an actual hop through an external NLB or ALB, which means the bytes leave the host entirely just to come back.

For request rates in the dozens or hundreds per second, none of this matters. For thousands per second per host, it starts mattering. For tens of thousands, it dominates your cost and your tail latency.

We need a transport that says: "I know we're on the same kernel. Just give me a pipe."

That transport is the Unix Domain Socket.

What a Unix Domain Socket actually is

If you already know, skip this section. If you've been writing distributed systems for years but never had a reason to use one, this is the five-minute version.

A UDS is a socket that uses a file path as its address instead of an IP and port. You bind to /var/run/something.sock, the kernel creates a special file at that path, and clients connect() to the same path. Once connected, both sides have a file descriptor that behaves almost exactly like a TCP socket: read(), write(), close(), the works.

The interesting differences:

There's no TCP/IP stack in the path. No headers, no checksums, no congestion control, no retransmits. The kernel just shuffles bytes from one process's send buffer to the other's receive buffer. Two copies, both inside the kernel: user-space buffer to kernel skb, kernel skb back to user-space buffer. That's it.

There's no network. The bytes never touch a NIC, never go through iptables, never see your VPC routing tables. If you can't reach the other side, it's because the file doesn't exist or you don't have permission to open it, not because some far-away switch is having a bad day.

There's no port collision. The address is a filesystem path, so two services on the same host can each have their own socket at their own path with zero coordination.

And the performance is genuinely excellent. A single UDS channel on a modern Linux server will sustain something on the order of 50–55 Gbit/s of throughput before you start seeing CPU saturation in the syscall layer. That's "I cannot saturate this with anything I am about to throw at it" territory for almost every application.

The one piece of cloud reality you have to manage is that UDS sockets live on the filesystem. If your producer container and your server container are different containers, they don't share a filesystem by default. You have to give them a shared volume — a host bind-mount works perfectly — so both can see the same socket file. We'll come back to this when we talk about deployment.

A quick bake-off against the alternatives

Before we commit, it's worth being honest about the options. I considered five transports for this problem; here's where I landed on each.

TCP loopback. Works fine when both endpoints are in the same network namespace, which on awsvpc ECS they aren't. You can switch the task to host networking and make it work, but you've coupled every container on the host into one port namespace forever. Hard pass.

HTTP/gRPC over loopback or over the LB. Easy to write, easy to operate, miserable per-message overhead. gRPC has its place when the consumer might one day be in a different region; it's wrong when the consumer is forty inches of copper trace away from the producer. Also, gRPC over a real socket on the same host still pays the cost of HTTP/2 framing, flow control, header compression — none of which buys you anything when the round-trip is microseconds.

Shared memory. The classic answer for "ridiculously fast IPC" — map a region into both processes, treat it as a ring buffer, get sub-microsecond latency. The cost is operational: you need a discipline around term buffers, you need to size memory carefully, you need to deal with the case where one side crashes mid-write. Tools like Aeron do this very well and are the right call when you genuinely need 250-nanosecond publish latency or hundreds of millions of messages per second. For "I need to push a few gigabytes per second between two containers with single-digit-millisecond budgets", shared memory is a Ferrari for a school run.

Named pipes / FIFOs. Half-duplex, no accept() model, no easy way to fan multiple clients into one server. Fine for shell pipelines, awkward for service-to-service IPC. Not a serious contender.

Unix Domain Sockets. Full-duplex byte stream. Connection-oriented (SOCK_STREAM), so we get FIFO ordering and per-channel isolation. Every language has battle-tested support. The Linux epoll event loop treats UDS exactly like TCP, so frameworks like Netty just work. Per-channel throughput well above what we need. No new infrastructure to operate.

The decision more or less makes itself once you write it out. UDS is unsexy, well-understood, and exactly the right tool. Aeron is a reasonable Phase 2 escape hatch if measurements ever say we need it, but the design rule is "do not add infrastructure complexity until measurements demand it."

Designing the wire protocol

A socket is a byte stream. The application has to invent the concept of "a message." This is the first place where careless design will haunt you, so let's be careful.

The rule I'll defend without hesitation: length-prefixed binary frames with a fixed-size, parseable-first-thing prelude.

Here's the layout I ended up with:

The frame is three regions stitched end-to-end on the byte stream:

A 16-byte fixed prelude that every implementation reads first. It contains the total frame length (uint32), the header length (uint16), a flags byte, a reserved byte, and a CRC32C over the header bytes (fixed32). All big-endian, all at known offsets.

A variable-length header, defined as a Protobuf message. It carries the frame type (HELLO, DATA, ACCEPTED, FAILED, CREDIT_UPDATE, DRAIN, PING, PONG), the protocol version, producer identity, sequence numbers, and any per-frame metadata. Because it's Protobuf, we can add new fields over time without breaking older clients.

A payload, which is opaque bytes. The server does not parse the payload. It does not know or care what's in it. From its perspective the payload is byte[], end of story. The application protocol defines what those bytes mean.

Three quick questions that always come up when I explain this design.

"Why not just one Protobuf message with the payload as a bytes field?" Because then your parser is in a chicken-and-egg situation. To know where the message ends, you have to parse the message; to parse the message, you have to have read all of it. Worse, if the first few bytes get corrupted, the Protobuf parser has no anchor — it can't tell you "the length field is wrong" because it doesn't know what the length field even is. The fixed prelude gives every reader a known landmark. Read 16 bytes, validate them, then trust the variable-length parts.

"Why is there a CRC if TCP / UDS is already reliable?" UDS is reliable across the wire. It is not reliable against your own bugs. A CRC on the header bytes will catch a serializer that wrote the wrong length, a buffer that got truncated on a partial write, a decoder that misaligned itself after an earlier malformed frame. The CRC is in there to protect you from yourself. (The flags byte reserves a second CRC slot for the payload, which we don't compute in v1 because the payload is opaque and the application owns its own integrity checks. The bit is reserved so we can turn it on later without a version bump.)

"Why CRC32C specifically?" It has hardware acceleration on modern x86-64 and ARM CPUs (CRC32C is in SSE 4.2 and the ARMv8 CRC extensions). Computing it costs effectively nothing per frame. The standard CRC32 would also work but is meaningfully slower in software.

Here's the encoder in pseudo-code, lightly cleaned up from what's actually running:

public static void encode(Frame frame, ByteBuf output, CRC32C scratch) {
    byte[] headerBytes = frame.header().toByteArray();
    int payloadLength = frame.payload().length;
    long frameLength = headerBytes.length + payloadLength;

    if (frameLength > MAX_FRAME_BYTES) throw new IllegalArgumentException(...);

    scratch.reset();
    scratch.update(headerBytes, 0, headerBytes.length);
    long headerCrc = scratch.getValue();

    output.writeInt((int) frameLength);     // 4
    output.writeShort(headerBytes.length);  // 2
    output.writeByte(0);                    // flags  (1)
    output.writeByte(0);                    // rsv    (1)
    output.writeInt((int) headerCrc);       // 4   -- total prelude = 16 bytes
    output.writeBytes(headerBytes);
    if (payloadLength > 0) output.writeBytes(frame.payload());
}

And the decoder, which is the more interesting half because it has to defend against partial reads and malicious peers:

public static DecodeResult tryDecode(ByteBuf input, int maxFrameBytes, CRC32C scratch) {
    if (input.readableBytes() < PRELUDE_BYTES) return INSUFFICIENT_BYTES;

    long frameLength = input.getUnsignedInt(readerIndex);
    if (frameLength > maxFrameBytes)
        throw new MalformedFrameException("frame too large");           // <-- before any alloc
    if (input.readableBytes() < PRELUDE_BYTES + frameLength)
        return INSUFFICIENT_BYTES;

    int headerLength = input.getUnsignedShort(readerIndex + 4);
    short flags      = input.getUnsignedByte(readerIndex + 6);
    short reserved   = input.getUnsignedByte(readerIndex + 7);
    long  headerCrc  = input.getUnsignedInt(readerIndex + 8);

    validatePreludeInvariants(flags, reserved, headerLength, frameLength);

    scratch.reset();
    scratch.update(asReadOnlySlice(input, headerStart, headerLength));
    if (scratch.getValue() != headerCrc)
        throw new MalformedFrameException("header CRC mismatch");

    FrameHeader header = FrameHeader.parseFrom(slice(input, headerStart, headerLength));
    byte[] payload = copyPayload(input, headerStart + headerLength, frameLength - headerLength);
    input.skipBytes(PRELUDE_BYTES + (int) frameLength);
    return new Decoded(new Frame(header, payload));
}

The discipline that makes this safe: reject before you allocate. The frameLength > maxFrameBytes check happens before we allocate anything for the payload. If a hostile or buggy peer sends a frame claiming to be 2 GB, we throw immediately and close the connection. We never give them the satisfaction of making us reserve 2 GB of direct memory.

On the Netty side, this lives in a ByteToMessageDecoder subclass. One small detail that matters at scale: we use COMPOSITE_CUMULATOR instead of the default MERGE_CUMULATOR. The default copies every fragmented inbound read into a new contiguous buffer. The composite cumulator keeps the fragments as a chain and doesn't copy until something actually needs a contiguous view. At sustained high throughput, this is the difference between hot loops that allocate and hot loops that don't.

Frame types and the handshake

There are nine frame types in this protocol. They split into three groups.

The handshake pair: HELLO and HELLO_ACK. The client sends HELLO immediately on channelActive, advertising the protocol version range it can speak and its identity (producer ID, producer epoch). The server picks the highest mutually-supported version and replies with HELLO_ACK containing the negotiated version and the initial credit budget it's willing to grant. If there is no overlap, the server replies with accepted=false and a rejection reason, then closes the connection.

private OptionalInt negotiateVersion(HelloFrame hello) {
    int producerMin = hello.getMinProtocolVersion();
    int producerMax = hello.getMaxProtocolVersion();
    if (producerMax < serverConfig.minProtocolVersion()) return OptionalInt.empty();
    if (producerMin > serverConfig.maxProtocolVersion()) return OptionalInt.empty();
    return OptionalInt.of(Math.min(producerMax, serverConfig.maxProtocolVersion()));
}

This is the entire version-negotiation algorithm. It's six lines because it should be six lines. The temptation to invent something fancier here is real — feature flags per version, optional capabilities, etc. — but every layer of cleverness you add to the handshake is a layer your operations team will be debugging at 3 AM. Two integers, pick the overlap, ship it.

The data path: DATA and DATA_BATCH. A single chunk goes in DATA. Multiple chunks coalesced into one wire frame go in DATA_BATCH, with the header carrying an array of per-chunk slice descriptors (offset, length, event count, sequence). One write syscall ships a hundred chunks. The kernel never sees us spamming.

The control path: ACCEPTED, FAILED, CREDIT_UPDATE, DRAIN, PING, PONG. These are how the server tells the client what happened to its data and how the two sides keep the channel alive.

The full handshake-to-publish-to-drain story looks like this:

The producer connects to the socket file (retrying with exponential backoff and jitter if the server isn't up yet — more on that in the lifecycle section). It immediately writes HELLO. The server reads it, negotiates a version, allocates per-connection state, and writes HELLO_ACK. From that moment, the connection is in the "data plane" — both sides are free to send and receive frames in any order.

The producer then writes DATA frames as fast as its credit budget allows. Each frame has a producer_sequence field that's monotonically increasing per producer. The server reads them, runs them through its decoder, gates them against its server-side credit window, and hands them to a pluggable ChunkSink. The sink returns either Accepted or Failed(reason, retryable). On Accepted, the server appends the chunk identity to a per-channel "pending ACCEPTED" list. On Failed, it flushes the pending list and emits a FAILED frame.

Here's the thing I want to highlight. Netty hands you frames in batches — it reads as many bytes as the socket has and then walks them through the pipeline. When the batch is done, it calls channelReadComplete. We hook into that hook to flush the pending ACCEPTED list as a single frame containing every chunk identity that landed in this batch.

@Override
public void channelReadComplete(ChannelHandlerContext context) {
    flushPendingAcceptedAcks(context);  // one ack frame for N data frames
    context.fireChannelReadComplete();
}

At sustained billions of frames per day, this is the single most important throughput optimization in the protocol. Without it, every DATA round-trips against its own ACCEPTED. With it, a burst of N data frames generates approximately one ack frame. The ack channel stays under 1% of the forward traffic.

The reverse trick: when something fails, the FAILED frame carries the restored credit footprint. The producer reserved credit when it sent the chunk; the failed ack tells it "you can take that credit back, plus this amount of additional credit I freed on your behalf." A single round-trip closes the negative path. There's no ambiguity about who owns the credit footprint of a rejected chunk and no separate CREDIT_UPDATE frame needed.

Three-dimensional backpressure (this is the part most designs get wrong)

Anyone who's built a queue-like thing has shipped a one-dimensional credit system: "you can have N in-flight requests." It works fine until somebody sends a single request that contains 50 megabytes of payload while another somebody sends ten thousand requests that are 12 bytes each. Both saturate the system in completely different ways, and a one-dimensional limit can't see either one coming.

So we do it in three dimensions. A producer is admitted to send a chunk if and only if:

There is at least one chunk-slot available (caps the count of in-flight chunks).
There are enough raw-event slots available to cover the chunk's event count (caps the aggregate cardinality).
There are enough payload-byte slots available to cover the chunk's payload size (caps the aggregate bytes in flight).

The acquire is all-or-nothing. If any dimension is short, the chunk is rejected with REJECTED_NO_CREDIT and the producer is expected to retry later.

public synchronized boolean tryAcquire(int slots, long events, long payloadBytes) {
    if (availableChunkSlots < slots
            || availableRawEvents < events
            || availablePayloadBytes < payloadBytes) {
        return false;
    }
    availableChunkSlots -= slots;
    availableRawEvents -= events;
    availablePayloadBytes -= payloadBytes;
    return true;
}

The producer enforces this and the server enforces an identical copy of it on its side. If the producer is buggy or malicious and tries to send beyond its credit, the server's credit gate catches it and the offending chunk gets FAILED(retryable=true, reason="credit exhausted"). This is defence in depth: the client controls its own behaviour, but the server is the source of truth for what's actually safe.

Credit is restored on three events: ACCEPTED ack received, FAILED ack received (with the restored amount carried in the frame), or connection drop (every outstanding chunk's footprint is freed so the producer is never permanently blocked).

Layered on top of this is Netty's channel-level backpressure. The pipeline is configured with a WriteBufferWaterMark. When the outbound buffer crosses the high mark, channel.isWritable() flips to false; when it drains below the low mark, it flips back. A well-behaved producer checks isWritable() before submitting, which means even within its credit budget the producer naturally pauses when the socket's send buffer is full. Below all of that, the kernel's UDS send buffer applies its own pressure: if it's full, write() blocks (or writev() returns a short write). The application credit window, Netty's watermark, and the kernel buffer form three concentric rings, each catching a different misbehaviour.

The "retry-after" hint deserves a mention. When the server sends a credit update with no available room, it can attach a retry_after_millis field. The producer treats that as actionable — sleep for that duration before retrying, and if the producer is itself fronted by an HTTP API, convert it to a 429 with a Retry-After header. Backpressure should always be observable to the original caller. Silent drops are how you lose a weekend chasing data loss that isn't actually data loss.

Concurrency without sadness

Netty's threading model is famously easy to get wrong. The single cardinal rule, the one I've now seen broken in production at three different companies: do not block on a Netty event loop thread.

The event loop owns the channel. It reads bytes off the socket, runs them through the decoder, dispatches to the protocol handler, runs the encoder, writes bytes back. One thread, many channels, no contention per channel. If you block that thread — on a synchronous database call, on a blocking ring-buffer publish, on a thread sleep, on a lock that's held by an offline service — you don't just stall one channel. You stall every channel that thread owns. Under sustained load that's the entire fleet of producers connected to that server.

The way out of this is discipline about where work happens. The server pipeline does exactly this much work on the event loop:

The IO event loop runs FlushConsolidationHandler → IdleStateHandler → FrameDecoder → FrameEncoder → protocol handler. All of those are CPU-only operations. The protocol handler does Protobuf parsing, CRC checking, version negotiation, credit accounting, ack accumulation. None of it talks to disk or to the network or to a database.

When a chunk needs to actually go somewhere — be aggregated, compressed, written to durable storage, fanned out to a notification stream — the protocol handler hands it to a ChunkSink interface. The sink is pluggable. In the simplest case (the v1 implementation), it might just enqueue the chunk into a lock-free ring buffer and return immediately. The actual heavy lifting happens on separate worker pools that the event loop never touches.

public interface ChunkSink {
    SinkVerdict accept(FrameHeader header, byte[] payload);
    sealed interface SinkVerdict {
        record Accepted() implements SinkVerdict {}
        record Failed(String reason, boolean retryable) implements SinkVerdict {}
    }
}

The contract: accept returns synchronously, never blocks, never throws on transient backpressure. Backpressure is communicated by returning Failed(retryable=true). The sink may publish into a queue or a ring buffer, but it must publish non-blockingly. If the queue is full, the sink rejects the chunk. The credit system upstream means this should be a rare event; if it isn't, your queue is sized wrong or your downstream pipeline can't keep up and you should be observing it on a dashboard, not absorbing it as backpressure on the IO loop.

The pluggability matters operationally too. When you're testing the transport in isolation, you wire in a sink that just buffers everything in memory. When you're benchmarking the protocol, you wire in a sink that drops the chunks on the floor. When you're running in production, you wire in the real ingestion pipeline. The transport never changes.

A quick note on thread counts. The boss event loop accepts new connections; for UDS this is single-threaded by nature, so one thread is plenty. The worker event loops handle the actual channels; 2–4 threads on a moderately-sized host will saturate UDS throughput well before they saturate the CPU. There's no value in throwing more threads at this. More threads means more context switching, not more throughput. The expensive work happens elsewhere.

The socket file problem nobody talks about

The socket is a file. Files persist across process restarts. This sounds like a footnote until it bites you.

Two scenarios that have to be handled:

Scenario one: the server died ungracefully and left a stale socket file behind. When the new server starts up, bind() on that path will fail with "address already in use." You have to clean up the stale file. But: how do you know it's stale? What if another instance of the server is actually running on that path because of a bug in your orchestration?

The wrong fix is to blindly unlink() the file before binding. That wins you a five-second outage when you accidentally remove the live server's socket and silently strand every producer trying to talk to it.

The right fix is to probe before deleting. Try to connect() to the socket as a client. If the connect succeeds, somebody is actually listening — refuse to boot and let the operator figure out what went wrong. If the connect fails with the expected "no listener" error, the file is stale and safe to delete.

public void prepareForBind() throws IOException {
    if (!Files.exists(socketPath)) return;
    if (probeForLiveListener())
        throw new IOException("Refusing to bind: another process is listening on " + socketPath);
    Files.delete(socketPath);
}

private boolean probeForLiveListener() {
    try (SocketChannel probe = SocketChannel.open(StandardProtocolFamily.UNIX)) {
        return probe.connect(UnixDomainSocketAddress.of(socketPath));
    } catch (IOException refused) {
        return false;  // expected on stale file
    }
}

Scenario two: a rolling deployment. During a deploy, the new server might bind a new socket at the same path while the old server's shutdown cleanup is still in flight. If the old server then does Files.delete(socketPath), it just removed the new server's socket. Now every connected client gets a silent disconnect and no new connections can be established.

The fix is to record the inode (file key) of the socket file at bind time and only delete on shutdown if the current inode still matches.

public void recordBoundInode() {
    boundFileKey = readFileKey(socketPath).orElse(null);
}

public void removeAfterShutdown() {
    if (boundFileKey == null) return;
    Optional<Object> currentKey = readFileKey(socketPath);
    if (currentKey.isEmpty()) return;
    if (!Objects.equals(currentKey.get(), boundFileKey)) {
        log.info("Socket file no longer matches our inode; skipping delete");
        return;
    }
    Files.deleteIfExists(socketPath);
}

Both of these are about ten lines of code. Neither of them is in any tutorial I've ever read. Both will eventually save you an outage.

There's also the question of where to put the socket file. The answer is "on a path that's a host bind-mount visible to every container that needs it, and is on a real local filesystem." /var/run/<service>/<service>.sock is the conventional choice. Do not put it on a network filesystem. Do not put it on tmpfs shared between containers without thinking about it. Do not put it in a Docker volume that uses overlay2 semantics. A boring local-filesystem path under /var/run will not surprise you. Anything else might.

Permissions: directory mode 0660 or 0770, owned by a UID/GID shared between the producer and server containers. The kernel enforces filesystem permissions on the socket file — if the producer's user can't open the file, the connection fails the same way as if the directory didn't exist. That's your access control, and it's stronger than any token-based scheme because it's enforced by the kernel before your code runs.

Draining, reconnecting, and not losing data when the server goes away

Failure handling deserves its own section because the easy story is wrong.

The easy story: "the server crashed and came back; the client reconnects and keeps going." That's true, but it elides the question of what happens to the chunks the client had in flight when the server died.

The model we commit to is at-least-once delivery with idempotent retry. The contract has three pieces.

First, every chunk has a stable identity. The header carries (producer_id, producer_epoch, producer_sequence). producer_id is the client's identity. producer_epoch is a monotonically-increasing counter that the client bumps every time it restarts. producer_sequence is monotonically-increasing within an epoch. The triple is unique for the lifetime of the producer.

Second, on disconnect, every in-flight chunk's future fails with LOST_ON_DISCONNECT and its credit footprint is released. The application sees the failure and chooses what to do.

public void failAll(String reason, Consumer<PendingEntry> onEachEntry) {
    for (var entry : snapshot()) {
        var removed = pending.remove(entry.getKey());
        if (removed == null) continue;
        onEachEntry.accept(removed);  // restores credit
        removed.future().complete(PublishResult.failure(LOST_ON_DISCONNECT, reason));
    }
}

Third, the application's retry policy resubmits the chunk with the same logical content but a new (producer_epoch, producer_sequence). The server, on the other side, might write the same content twice — once before the crash, once after the retry. The downstream consumer dedups by a stable application-level ID (the request ID or event ID carried in the payload).

This is not exactly-once. Exactly-once is a research topic and the people who claim to have shipped it usually haven't, or have shipped it inside a single transactional boundary that isn't your boundary. At-least-once plus idempotent consumers is the boring, working answer. It's what every well-behaved event pipeline does. It's what we do here.

Graceful shutdown is its own dance. When the server gets a SIGTERM, it:

Stops returning healthy from its container health check, so the orchestrator stops sending it new traffic.
Broadcasts a DRAIN frame to every connected client. The DRAIN carries a deadline (server's stated time to finish draining).
Stops accepting new connections.
Waits for the drain grace period to let connected clients finish what they were doing.
Closes the remaining channels.
Removes the socket file (inode-checked).
Exits.

The client, on receiving DRAIN, transitions to a DRAINING state. New publish() calls fail fast with REJECTED_DRAINING. In-flight chunks are allowed to ack out. A local timer also force-closes the channel at the smaller of (server-stated deadline, client local shutdown budget), so a server that crashes mid-drain can't leave the client sitting forever.

Reconnect uses bounded exponential backoff with full jitter. The first attempt happens after a small initial delay (50 ms is reasonable), each subsequent attempt doubles up to a ceiling (5 seconds), and the actual delay is uniformly distributed between 0 and the current ceiling. Jitter is the difference between "everyone retries at the same instant and DOSes the recovering server" and "retries are spread smoothly across the recovery window." It's also one line of code, so there is no excuse not to do it.

public synchronized Optional<Duration> nextDelay() {
    if (maxAttempts >= 0 && attemptCount >= maxAttempts) return Optional.empty();
    long unjittered = exponentialDelayMillis(attemptCount++);
    long jittered = jitterSupplier.applyAsLong(Math.max(1L, unjittered));
    return Optional.of(Duration.ofMillis(Math.max(1L, jittered)));
}

PING/PONG keeps the channel warm and lets each side detect a dead peer. The Netty IdleStateHandler is configured with a reader-idle timeout (no inbound activity for X seconds → close the channel) and a writer-idle timeout (no outbound activity for Y seconds → emit a PING). Reader-idle closes are caught by the reconnect loop. Writer-idle PINGs solve the "TCP connection is silently dead because a middlebox dropped state" problem, which UDS doesn't have, but the same code path also gives you a periodic liveness signal for free, so we keep it.

Deploying this thing on ECS

The deployment topology has a few pieces that have to fit together correctly.

The server runs as an ECS daemon service. Set schedulingStrategy=DAEMON. The cluster will place exactly one task on every container instance. This is what gives you the "one IPC server per host" invariant. Without daemon scheduling, you'd have to coordinate placement constraints by hand, and you'd eventually end up with hosts that have zero servers or hosts that have two servers competing for the same socket file.

Producers run as a regular ECS service. Spread placement is fine. The producer doesn't need to know which host it's on; it just connects to a fixed socket path. Whatever host it lands on, there'll be a server listening at that path because daemon scheduling guarantees it.

Both run in awsvpc networking mode. The hot path is UDS, so the loopback namespace separation that bites TCP localhost doesn't matter to us at all. You can pick whatever networking mode is most convenient for the other traffic the containers carry.

The socket directory is a host bind-mount. Both the producer task definition and the server task definition mount /var/run/<service> from the host into the container. Bind mount, not emptyDir, not tmpfs. The reason is durability across container restarts: an emptyDir is a fresh volume each time the task starts, so a producer restart would lose access to the running server's socket file. A host bind mount lives on the EC2 instance's local filesystem and persists for the life of the host.

The server's container health check returns healthy only after the socket is bound. The health check should do a real connect() to the socket and immediately close. Don't use a TCP health check; the server isn't listening on TCP. Don't use a process health check; the process can be running but not yet bound. The check needs to verify the actual integration point.

The producer does not depend on any ECS attribute or service discovery. It does an exponential-backoff connect loop against the socket path. If the server isn't ready yet, the producer retries. If the server is restarting, the producer retries. If the server gets replaced during a rolling deploy, the producer's reconnect loop handles it. This is much simpler than wiring up cross-task readiness signals through PutAttributes or service discovery, and it works no matter what the orchestrator is doing.

Per-host resources matter. The server is a long-lived JVM with a lot of pooled direct memory. Plan for:

Native epoll transport (io.netty.transport.noNative=false), which on Linux is automatic but worth verifying you're picking up the right native library for your architecture (linux-x86_64 vs linux-aarch64).
-XX:MaxDirectMemorySize sized to cover your pooled allocator high-water mark plus a comfortable safety margin. Direct memory is allocated outside the heap and will OOM your container if you under-size it.
memlock=unlimited and the IPC_LOCK capability if you're planning to lock pages, which you probably aren't in v1 but might in a future phase.
A stopTimeout long enough to cover your drain budget plus a buffer. If the orchestrator SIGKILLs you mid-drain, your producers will get hard disconnects instead of graceful drains.

A walk through one publish

Let's trace one chunk end-to-end so the pieces fit together.

The application calls client.publish(chunk) from any thread. The client checks its state (must be STARTED, not DRAINING or CLOSING), checks the handshake is complete, and tries to acquire 3-D credit for this chunk's footprint. If credit fails, the call returns a completed future with REJECTED_NO_CREDIT. If state fails, similar.

Assuming everything's good, the client assigns a producer_sequence, registers the chunk in its PendingChunkRegistry keyed by (producer_epoch, producer_sequence), gets back a CompletableFuture<PublishResult>, schedules a per-chunk publish timeout, and calls channel.writeAndFlush on the Netty channel.

long sequence = producerSequence.getAndIncrement();
var key = new ChunkKey(producerEpoch, sequence);
var future = pendingChunks.register(key, 1, eventCount, payloadBytes);
schedulePublishTimeout(key);

var header = FrameHeader.newBuilder()
    .setFrameType(FRAME_TYPE_DATA)
    .setProducerId(producerId).setProducerEpoch(producerEpoch)
    .setProducerSequence(sequence)
    .setEventCount(eventCount).setPayloadBytes(payloadBytes)
    .build();

channel.writeAndFlush(Frame.of(header, request.payload()))
       .addListener(onWriteComplete);   // releases credit on write failure

return future;

Netty's encoder serializes the frame using the wire format, writes it into a pooled direct buffer, and hands it to the kernel via writev. The kernel copies it into the receiving UDS socket's buffer.

On the server side, the worker event loop wakes up on the epoll readiness notification, reads the bytes, and runs them through the pipeline. The decoder pulls one frame out of the buffer, the protocol handler routes it by frame_type to handleData, the credit gate checks that the chunk fits, the chunk sink processes it. Assuming Accepted, the chunk's identity goes onto the pending ACCEPTED list.

Other frames in the same read batch are processed the same way. At the end of the batch, channelReadComplete fires and we flush the pending list as a single ACCEPTED frame containing every chunk identity from this batch.

The ACCEPTED frame goes back through the same wire format, back through the kernel, back to the client. The client's decoder pulls it out, hands it to the protocol handler, which looks up each chunk identity in its registry, completes each future with PublishResult.accepted(), and restores the credit footprint.

The application's await() on the future returns. The whole round-trip, on a moderate-sized r6i instance under reasonable load, is in the low tens of microseconds. The protocol contributes effectively nothing to that — most of the time is the syscall to write the bytes, the syscall to read them, and a tiny amount of Netty bookkeeping. The 50+ Gbit/s per-channel headroom that UDS gives you means we can do this thousands of times per millisecond without breaking a sweat.

When this is the wrong answer

Honesty: not every problem wants this transport. Here's when I'd reach for something else.

Producer and consumer aren't on the same host. Then the whole premise is gone. Use a real network protocol. gRPC is fine. Plain HTTP is fine. Whatever you'd normally reach for is fine.

You need cross-language interop. Netty is JVM. The pattern translates to Go or Rust or Python without much trouble — the wire format is language-neutral by design — but if you want a single off-the-shelf library that already works in five languages, gRPC over UDS gets you part of the way there with less custom code.

You need exactly-once semantics. You don't. Nobody does. Build idempotency into your consumers and stop fighting the universe.

Your throughput is genuinely tiny. If you're shipping ten requests a second, the operational cost of any custom protocol is going to outweigh the savings versus just calling an HTTP endpoint. Use HTTP, save your weekends.

You need sub-microsecond latency. Now we're in shared-memory territory. Aeron, LMAX Disruptor patterns, kernel-bypass networking. UDS is fast but it's still doing two kernel copies; shared memory does zero. If you need that floor, build it.

You can't get a host bind-mount. Some platforms (managed Kubernetes flavours with strict security profiles, certain serverless container runtimes) don't let you bind-mount host paths. UDS is mostly off the table in those environments. You're back to network protocols.

For the broad middle — same host, container-to-container, throughput in the gigabytes-per-second range, latency budget in the milliseconds, willing to write a small amount of careful protocol code — this design is hard to beat.

Closing thoughts

The cloud-native consensus, for very good reasons, is "treat the network as the abstraction." A service has a name, you talk to it over a name-resolved endpoint, you don't care where it is. That model has earned its place.

But it has a cost, and the cost is paid every time two services that are on the same host are forced to pretend they aren't. The default tooling will route their bytes through ENIs and load balancers and TLS handshakes, and you'll see the bill at the end of the month and the tail latency on your dashboards and the strange flamegraphs in your profilers.

UDS is the escape hatch. It's been sitting there in the kernel the whole time. The amount of code needed to use it well is not large — the implementation that backs this article is something like 2,000 lines of Java including tests — but the design has to be careful about a handful of specific things: the fixed prelude trick, the three-dimensional credit model, the ack coalescing on channelReadComplete, the inode-checked socket cleanup, the discipline of never blocking on the IO event loop, the boring choice to commit to at-least-once delivery and idempotent consumers.

If you do those things right, you get a transport that is fast, cheap, observable, and doesn't lie to you. That's a good list.

I'd encourage anyone running a service-to-service workflow on the same host to at least measure what the current implementation is costing them. The number is usually larger than people expect, and the fix is much smaller than people expect.

The pattern works. Steal it.

DEV Community