DEV Community: Samar Prakash

When Two Containers on the Same Host Are Shouting Through a Load Balancer

Samar Prakash — Sat, 30 May 2026 09:55:10 +0000

Building a Unix-Domain-Socket IPC server for ECS-on-EC2 services that need to talk fast, cheap, and reliably

A while back I was looking at a flamegraph of a service that, on paper, should not have been having any performance problems. The producer and the consumer were the same Docker image's worth of trouble — colocated on the same EC2 host, in the same ECS cluster, sharing the same instance type, the same kernel, the same RAM. By every reasonable measure they were neighbours.

And yet every event was making a round trip that looked roughly like this: producer → kernel TCP stack → ENI on the producer task → AWS VPC → internal load balancer → ENI on the consumer task → kernel TCP stack → consumer. TLS handshake. HTTP framing. JSON over the wire. Connection pool. Retry policy. The whole circus.

I wasn't doing anything wrong. This is what the platform funnels you toward. ECS with awsvpc networking gives every task its own ENI. The default story for "service A talks to service B" is "give B a DNS name, put a load balancer in front of it, configure a security group, point A at the LB." Even if A and B are physically on the same box, the bytes are still leaving the kernel, traversing the VPC, and coming back.

There's a fix for this. It's been a fix for fifty-something years. It just hasn't been the default fix, because cloud-native architecture grew up assuming services would be scattered across hosts and the network was the abstraction that mattered.

This article is about building a proper IPC server using Unix Domain Sockets, deployed as a sidecar pattern on ECS-on-EC2, with a wire protocol robust enough to ship in production. We're going to design it from scratch — the transport choice, the wire format, the backpressure model, the failure modes, the deployment topology. I'll show you real pseudo-code from the implementation and call out the small number of places where, if you get it wrong, you'll spend a weekend debugging it.

The intended outcome is something you could lift the pattern from. The article is long because the problem isn't actually that simple once you get past the "just use a socket file" stage. But none of it is mystical. If you've written Netty code or a binary protocol parser before, you'll be fine. If you haven't, the early sections will land regardless.

The problem with "it's all just localhost"

The first thing every engineer reaches for when they realise two services live on the same host is 127.0.0.1. That worked beautifully in 1999. It works less well on modern container platforms, and the reason is worth understanding properly.

When you run an ECS task in awsvpc networking mode — and you probably are, because bridge mode has its own pile of caveats — every task gets its own ENI. AWS attaches that ENI to a dedicated network namespace inside the EC2 host's kernel. Task A's loopback interface is not the same loopback interface as task B's loopback interface. They both see 127.0.0.1, but those two 127.0.0.1s are different things. A connection to 127.0.0.1:8080 from inside task A will never reach a listener inside task B, even if they're on the same physical EC2 instance.

You can work around this with host networking mode, but then you've given up port isolation and you have to coordinate every port number across every container on the host. That trade ages badly.

The second thing engineers reach for is "well, fine, put a load balancer in front of it." This works. It also costs you:

A real ENI on each side (and ENIs are a rationed resource per instance type). A trip through the AWS VPC data plane. Potentially TLS termination and re-encryption. JSON serialization on the way out and deserialization on the way in. The overhead of HTTP itself: headers, status codes, content negotiation, the eternal question of whether to use keep-alive. And in some setups, an actual hop through an external NLB or ALB, which means the bytes leave the host entirely just to come back.

For request rates in the dozens or hundreds per second, none of this matters. For thousands per second per host, it starts mattering. For tens of thousands, it dominates your cost and your tail latency.

We need a transport that says: "I know we're on the same kernel. Just give me a pipe."

That transport is the Unix Domain Socket.

What a Unix Domain Socket actually is

If you already know, skip this section. If you've been writing distributed systems for years but never had a reason to use one, this is the five-minute version.

A UDS is a socket that uses a file path as its address instead of an IP and port. You bind to /var/run/something.sock, the kernel creates a special file at that path, and clients connect() to the same path. Once connected, both sides have a file descriptor that behaves almost exactly like a TCP socket: read(), write(), close(), the works.

The interesting differences:

There's no TCP/IP stack in the path. No headers, no checksums, no congestion control, no retransmits. The kernel just shuffles bytes from one process's send buffer to the other's receive buffer. Two copies, both inside the kernel: user-space buffer to kernel skb, kernel skb back to user-space buffer. That's it.

There's no network. The bytes never touch a NIC, never go through iptables, never see your VPC routing tables. If you can't reach the other side, it's because the file doesn't exist or you don't have permission to open it, not because some far-away switch is having a bad day.

There's no port collision. The address is a filesystem path, so two services on the same host can each have their own socket at their own path with zero coordination.

And the performance is genuinely excellent. A single UDS channel on a modern Linux server will sustain something on the order of 50–55 Gbit/s of throughput before you start seeing CPU saturation in the syscall layer. That's "I cannot saturate this with anything I am about to throw at it" territory for almost every application.

The one piece of cloud reality you have to manage is that UDS sockets live on the filesystem. If your producer container and your server container are different containers, they don't share a filesystem by default. You have to give them a shared volume — a host bind-mount works perfectly — so both can see the same socket file. We'll come back to this when we talk about deployment.

A quick bake-off against the alternatives

Before we commit, it's worth being honest about the options. I considered five transports for this problem; here's where I landed on each.

TCP loopback. Works fine when both endpoints are in the same network namespace, which on awsvpc ECS they aren't. You can switch the task to host networking and make it work, but you've coupled every container on the host into one port namespace forever. Hard pass.

HTTP/gRPC over loopback or over the LB. Easy to write, easy to operate, miserable per-message overhead. gRPC has its place when the consumer might one day be in a different region; it's wrong when the consumer is forty inches of copper trace away from the producer. Also, gRPC over a real socket on the same host still pays the cost of HTTP/2 framing, flow control, header compression — none of which buys you anything when the round-trip is microseconds.

Shared memory. The classic answer for "ridiculously fast IPC" — map a region into both processes, treat it as a ring buffer, get sub-microsecond latency. The cost is operational: you need a discipline around term buffers, you need to size memory carefully, you need to deal with the case where one side crashes mid-write. Tools like Aeron do this very well and are the right call when you genuinely need 250-nanosecond publish latency or hundreds of millions of messages per second. For "I need to push a few gigabytes per second between two containers with single-digit-millisecond budgets", shared memory is a Ferrari for a school run.

Named pipes / FIFOs. Half-duplex, no accept() model, no easy way to fan multiple clients into one server. Fine for shell pipelines, awkward for service-to-service IPC. Not a serious contender.

Unix Domain Sockets. Full-duplex byte stream. Connection-oriented (SOCK_STREAM), so we get FIFO ordering and per-channel isolation. Every language has battle-tested support. The Linux epoll event loop treats UDS exactly like TCP, so frameworks like Netty just work. Per-channel throughput well above what we need. No new infrastructure to operate.

The decision more or less makes itself once you write it out. UDS is unsexy, well-understood, and exactly the right tool. Aeron is a reasonable Phase 2 escape hatch if measurements ever say we need it, but the design rule is "do not add infrastructure complexity until measurements demand it."

Designing the wire protocol

A socket is a byte stream. The application has to invent the concept of "a message." This is the first place where careless design will haunt you, so let's be careful.

The rule I'll defend without hesitation: length-prefixed binary frames with a fixed-size, parseable-first-thing prelude.

Here's the layout I ended up with:

The frame is three regions stitched end-to-end on the byte stream:

A 16-byte fixed prelude that every implementation reads first. It contains the total frame length (uint32), the header length (uint16), a flags byte, a reserved byte, and a CRC32C over the header bytes (fixed32). All big-endian, all at known offsets.

A variable-length header, defined as a Protobuf message. It carries the frame type (HELLO, DATA, ACCEPTED, FAILED, CREDIT_UPDATE, DRAIN, PING, PONG), the protocol version, producer identity, sequence numbers, and any per-frame metadata. Because it's Protobuf, we can add new fields over time without breaking older clients.

A payload, which is opaque bytes. The server does not parse the payload. It does not know or care what's in it. From its perspective the payload is byte[], end of story. The application protocol defines what those bytes mean.

Three quick questions that always come up when I explain this design.

"Why not just one Protobuf message with the payload as a bytes field?" Because then your parser is in a chicken-and-egg situation. To know where the message ends, you have to parse the message; to parse the message, you have to have read all of it. Worse, if the first few bytes get corrupted, the Protobuf parser has no anchor — it can't tell you "the length field is wrong" because it doesn't know what the length field even is. The fixed prelude gives every reader a known landmark. Read 16 bytes, validate them, then trust the variable-length parts.

"Why is there a CRC if TCP / UDS is already reliable?" UDS is reliable across the wire. It is not reliable against your own bugs. A CRC on the header bytes will catch a serializer that wrote the wrong length, a buffer that got truncated on a partial write, a decoder that misaligned itself after an earlier malformed frame. The CRC is in there to protect you from yourself. (The flags byte reserves a second CRC slot for the payload, which we don't compute in v1 because the payload is opaque and the application owns its own integrity checks. The bit is reserved so we can turn it on later without a version bump.)

"Why CRC32C specifically?" It has hardware acceleration on modern x86-64 and ARM CPUs (CRC32C is in SSE 4.2 and the ARMv8 CRC extensions). Computing it costs effectively nothing per frame. The standard CRC32 would also work but is meaningfully slower in software.

Here's the encoder in pseudo-code, lightly cleaned up from what's actually running:

public static void encode(Frame frame, ByteBuf output, CRC32C scratch) {
    byte[] headerBytes = frame.header().toByteArray();
    int payloadLength = frame.payload().length;
    long frameLength = headerBytes.length + payloadLength;

    if (frameLength > MAX_FRAME_BYTES) throw new IllegalArgumentException(...);

    scratch.reset();
    scratch.update(headerBytes, 0, headerBytes.length);
    long headerCrc = scratch.getValue();

    output.writeInt((int) frameLength);     // 4
    output.writeShort(headerBytes.length);  // 2
    output.writeByte(0);                    // flags  (1)
    output.writeByte(0);                    // rsv    (1)
    output.writeInt((int) headerCrc);       // 4   -- total prelude = 16 bytes
    output.writeBytes(headerBytes);
    if (payloadLength > 0) output.writeBytes(frame.payload());
}

And the decoder, which is the more interesting half because it has to defend against partial reads and malicious peers:

public static DecodeResult tryDecode(ByteBuf input, int maxFrameBytes, CRC32C scratch) {
    if (input.readableBytes() < PRELUDE_BYTES) return INSUFFICIENT_BYTES;

    long frameLength = input.getUnsignedInt(readerIndex);
    if (frameLength > maxFrameBytes)
        throw new MalformedFrameException("frame too large");           // <-- before any alloc
    if (input.readableBytes() < PRELUDE_BYTES + frameLength)
        return INSUFFICIENT_BYTES;

    int headerLength = input.getUnsignedShort(readerIndex + 4);
    short flags      = input.getUnsignedByte(readerIndex + 6);
    short reserved   = input.getUnsignedByte(readerIndex + 7);
    long  headerCrc  = input.getUnsignedInt(readerIndex + 8);

    validatePreludeInvariants(flags, reserved, headerLength, frameLength);

    scratch.reset();
    scratch.update(asReadOnlySlice(input, headerStart, headerLength));
    if (scratch.getValue() != headerCrc)
        throw new MalformedFrameException("header CRC mismatch");

    FrameHeader header = FrameHeader.parseFrom(slice(input, headerStart, headerLength));
    byte[] payload = copyPayload(input, headerStart + headerLength, frameLength - headerLength);
    input.skipBytes(PRELUDE_BYTES + (int) frameLength);
    return new Decoded(new Frame(header, payload));
}

The discipline that makes this safe: reject before you allocate. The frameLength > maxFrameBytes check happens before we allocate anything for the payload. If a hostile or buggy peer sends a frame claiming to be 2 GB, we throw immediately and close the connection. We never give them the satisfaction of making us reserve 2 GB of direct memory.

On the Netty side, this lives in a ByteToMessageDecoder subclass. One small detail that matters at scale: we use COMPOSITE_CUMULATOR instead of the default MERGE_CUMULATOR. The default copies every fragmented inbound read into a new contiguous buffer. The composite cumulator keeps the fragments as a chain and doesn't copy until something actually needs a contiguous view. At sustained high throughput, this is the difference between hot loops that allocate and hot loops that don't.

Frame types and the handshake

There are nine frame types in this protocol. They split into three groups.

The handshake pair: HELLO and HELLO_ACK. The client sends HELLO immediately on channelActive, advertising the protocol version range it can speak and its identity (producer ID, producer epoch). The server picks the highest mutually-supported version and replies with HELLO_ACK containing the negotiated version and the initial credit budget it's willing to grant. If there is no overlap, the server replies with accepted=false and a rejection reason, then closes the connection.

private OptionalInt negotiateVersion(HelloFrame hello) {
    int producerMin = hello.getMinProtocolVersion();
    int producerMax = hello.getMaxProtocolVersion();
    if (producerMax < serverConfig.minProtocolVersion()) return OptionalInt.empty();
    if (producerMin > serverConfig.maxProtocolVersion()) return OptionalInt.empty();
    return OptionalInt.of(Math.min(producerMax, serverConfig.maxProtocolVersion()));
}

This is the entire version-negotiation algorithm. It's six lines because it should be six lines. The temptation to invent something fancier here is real — feature flags per version, optional capabilities, etc. — but every layer of cleverness you add to the handshake is a layer your operations team will be debugging at 3 AM. Two integers, pick the overlap, ship it.

The data path: DATA and DATA_BATCH. A single chunk goes in DATA. Multiple chunks coalesced into one wire frame go in DATA_BATCH, with the header carrying an array of per-chunk slice descriptors (offset, length, event count, sequence). One write syscall ships a hundred chunks. The kernel never sees us spamming.

The control path: ACCEPTED, FAILED, CREDIT_UPDATE, DRAIN, PING, PONG. These are how the server tells the client what happened to its data and how the two sides keep the channel alive.

The full handshake-to-publish-to-drain story looks like this:

The producer connects to the socket file (retrying with exponential backoff and jitter if the server isn't up yet — more on that in the lifecycle section). It immediately writes HELLO. The server reads it, negotiates a version, allocates per-connection state, and writes HELLO_ACK. From that moment, the connection is in the "data plane" — both sides are free to send and receive frames in any order.

The producer then writes DATA frames as fast as its credit budget allows. Each frame has a producer_sequence field that's monotonically increasing per producer. The server reads them, runs them through its decoder, gates them against its server-side credit window, and hands them to a pluggable ChunkSink. The sink returns either Accepted or Failed(reason, retryable). On Accepted, the server appends the chunk identity to a per-channel "pending ACCEPTED" list. On Failed, it flushes the pending list and emits a FAILED frame.

Here's the thing I want to highlight. Netty hands you frames in batches — it reads as many bytes as the socket has and then walks them through the pipeline. When the batch is done, it calls channelReadComplete. We hook into that hook to flush the pending ACCEPTED list as a single frame containing every chunk identity that landed in this batch.

@Override
public void channelReadComplete(ChannelHandlerContext context) {
    flushPendingAcceptedAcks(context);  // one ack frame for N data frames
    context.fireChannelReadComplete();
}

At sustained billions of frames per day, this is the single most important throughput optimization in the protocol. Without it, every DATA round-trips against its own ACCEPTED. With it, a burst of N data frames generates approximately one ack frame. The ack channel stays under 1% of the forward traffic.

The reverse trick: when something fails, the FAILED frame carries the restored credit footprint. The producer reserved credit when it sent the chunk; the failed ack tells it "you can take that credit back, plus this amount of additional credit I freed on your behalf." A single round-trip closes the negative path. There's no ambiguity about who owns the credit footprint of a rejected chunk and no separate CREDIT_UPDATE frame needed.

Three-dimensional backpressure (this is the part most designs get wrong)

Anyone who's built a queue-like thing has shipped a one-dimensional credit system: "you can have N in-flight requests." It works fine until somebody sends a single request that contains 50 megabytes of payload while another somebody sends ten thousand requests that are 12 bytes each. Both saturate the system in completely different ways, and a one-dimensional limit can't see either one coming.

So we do it in three dimensions. A producer is admitted to send a chunk if and only if:

There is at least one chunk-slot available (caps the count of in-flight chunks).
There are enough raw-event slots available to cover the chunk's event count (caps the aggregate cardinality).
There are enough payload-byte slots available to cover the chunk's payload size (caps the aggregate bytes in flight).

The acquire is all-or-nothing. If any dimension is short, the chunk is rejected with REJECTED_NO_CREDIT and the producer is expected to retry later.

public synchronized boolean tryAcquire(int slots, long events, long payloadBytes) {
    if (availableChunkSlots < slots
            || availableRawEvents < events
            || availablePayloadBytes < payloadBytes) {
        return false;
    }
    availableChunkSlots -= slots;
    availableRawEvents -= events;
    availablePayloadBytes -= payloadBytes;
    return true;
}

The producer enforces this and the server enforces an identical copy of it on its side. If the producer is buggy or malicious and tries to send beyond its credit, the server's credit gate catches it and the offending chunk gets FAILED(retryable=true, reason="credit exhausted"). This is defence in depth: the client controls its own behaviour, but the server is the source of truth for what's actually safe.

Credit is restored on three events: ACCEPTED ack received, FAILED ack received (with the restored amount carried in the frame), or connection drop (every outstanding chunk's footprint is freed so the producer is never permanently blocked).

Layered on top of this is Netty's channel-level backpressure. The pipeline is configured with a WriteBufferWaterMark. When the outbound buffer crosses the high mark, channel.isWritable() flips to false; when it drains below the low mark, it flips back. A well-behaved producer checks isWritable() before submitting, which means even within its credit budget the producer naturally pauses when the socket's send buffer is full. Below all of that, the kernel's UDS send buffer applies its own pressure: if it's full, write() blocks (or writev() returns a short write). The application credit window, Netty's watermark, and the kernel buffer form three concentric rings, each catching a different misbehaviour.

The "retry-after" hint deserves a mention. When the server sends a credit update with no available room, it can attach a retry_after_millis field. The producer treats that as actionable — sleep for that duration before retrying, and if the producer is itself fronted by an HTTP API, convert it to a 429 with a Retry-After header. Backpressure should always be observable to the original caller. Silent drops are how you lose a weekend chasing data loss that isn't actually data loss.

Concurrency without sadness

Netty's threading model is famously easy to get wrong. The single cardinal rule, the one I've now seen broken in production at three different companies: do not block on a Netty event loop thread.

The event loop owns the channel. It reads bytes off the socket, runs them through the decoder, dispatches to the protocol handler, runs the encoder, writes bytes back. One thread, many channels, no contention per channel. If you block that thread — on a synchronous database call, on a blocking ring-buffer publish, on a thread sleep, on a lock that's held by an offline service — you don't just stall one channel. You stall every channel that thread owns. Under sustained load that's the entire fleet of producers connected to that server.

The way out of this is discipline about where work happens. The server pipeline does exactly this much work on the event loop:

The IO event loop runs FlushConsolidationHandler → IdleStateHandler → FrameDecoder → FrameEncoder → protocol handler. All of those are CPU-only operations. The protocol handler does Protobuf parsing, CRC checking, version negotiation, credit accounting, ack accumulation. None of it talks to disk or to the network or to a database.

When a chunk needs to actually go somewhere — be aggregated, compressed, written to durable storage, fanned out to a notification stream — the protocol handler hands it to a ChunkSink interface. The sink is pluggable. In the simplest case (the v1 implementation), it might just enqueue the chunk into a lock-free ring buffer and return immediately. The actual heavy lifting happens on separate worker pools that the event loop never touches.

public interface ChunkSink {
    SinkVerdict accept(FrameHeader header, byte[] payload);
    sealed interface SinkVerdict {
        record Accepted() implements SinkVerdict {}
        record Failed(String reason, boolean retryable) implements SinkVerdict {}
    }
}

The contract: accept returns synchronously, never blocks, never throws on transient backpressure. Backpressure is communicated by returning Failed(retryable=true). The sink may publish into a queue or a ring buffer, but it must publish non-blockingly. If the queue is full, the sink rejects the chunk. The credit system upstream means this should be a rare event; if it isn't, your queue is sized wrong or your downstream pipeline can't keep up and you should be observing it on a dashboard, not absorbing it as backpressure on the IO loop.

The pluggability matters operationally too. When you're testing the transport in isolation, you wire in a sink that just buffers everything in memory. When you're benchmarking the protocol, you wire in a sink that drops the chunks on the floor. When you're running in production, you wire in the real ingestion pipeline. The transport never changes.

A quick note on thread counts. The boss event loop accepts new connections; for UDS this is single-threaded by nature, so one thread is plenty. The worker event loops handle the actual channels; 2–4 threads on a moderately-sized host will saturate UDS throughput well before they saturate the CPU. There's no value in throwing more threads at this. More threads means more context switching, not more throughput. The expensive work happens elsewhere.

The socket file problem nobody talks about

The socket is a file. Files persist across process restarts. This sounds like a footnote until it bites you.

Two scenarios that have to be handled:

Scenario one: the server died ungracefully and left a stale socket file behind. When the new server starts up, bind() on that path will fail with "address already in use." You have to clean up the stale file. But: how do you know it's stale? What if another instance of the server is actually running on that path because of a bug in your orchestration?

The wrong fix is to blindly unlink() the file before binding. That wins you a five-second outage when you accidentally remove the live server's socket and silently strand every producer trying to talk to it.

The right fix is to probe before deleting. Try to connect() to the socket as a client. If the connect succeeds, somebody is actually listening — refuse to boot and let the operator figure out what went wrong. If the connect fails with the expected "no listener" error, the file is stale and safe to delete.

public void prepareForBind() throws IOException {
    if (!Files.exists(socketPath)) return;
    if (probeForLiveListener())
        throw new IOException("Refusing to bind: another process is listening on " + socketPath);
    Files.delete(socketPath);
}

private boolean probeForLiveListener() {
    try (SocketChannel probe = SocketChannel.open(StandardProtocolFamily.UNIX)) {
        return probe.connect(UnixDomainSocketAddress.of(socketPath));
    } catch (IOException refused) {
        return false;  // expected on stale file
    }
}

Scenario two: a rolling deployment. During a deploy, the new server might bind a new socket at the same path while the old server's shutdown cleanup is still in flight. If the old server then does Files.delete(socketPath), it just removed the new server's socket. Now every connected client gets a silent disconnect and no new connections can be established.

The fix is to record the inode (file key) of the socket file at bind time and only delete on shutdown if the current inode still matches.

public void recordBoundInode() {
    boundFileKey = readFileKey(socketPath).orElse(null);
}

public void removeAfterShutdown() {
    if (boundFileKey == null) return;
    Optional<Object> currentKey = readFileKey(socketPath);
    if (currentKey.isEmpty()) return;
    if (!Objects.equals(currentKey.get(), boundFileKey)) {
        log.info("Socket file no longer matches our inode; skipping delete");
        return;
    }
    Files.deleteIfExists(socketPath);
}

Both of these are about ten lines of code. Neither of them is in any tutorial I've ever read. Both will eventually save you an outage.

There's also the question of where to put the socket file. The answer is "on a path that's a host bind-mount visible to every container that needs it, and is on a real local filesystem." /var/run/<service>/<service>.sock is the conventional choice. Do not put it on a network filesystem. Do not put it on tmpfs shared between containers without thinking about it. Do not put it in a Docker volume that uses overlay2 semantics. A boring local-filesystem path under /var/run will not surprise you. Anything else might.

Permissions: directory mode 0660 or 0770, owned by a UID/GID shared between the producer and server containers. The kernel enforces filesystem permissions on the socket file — if the producer's user can't open the file, the connection fails the same way as if the directory didn't exist. That's your access control, and it's stronger than any token-based scheme because it's enforced by the kernel before your code runs.

Draining, reconnecting, and not losing data when the server goes away

Failure handling deserves its own section because the easy story is wrong.

The easy story: "the server crashed and came back; the client reconnects and keeps going." That's true, but it elides the question of what happens to the chunks the client had in flight when the server died.

The model we commit to is at-least-once delivery with idempotent retry. The contract has three pieces.

First, every chunk has a stable identity. The header carries (producer_id, producer_epoch, producer_sequence). producer_id is the client's identity. producer_epoch is a monotonically-increasing counter that the client bumps every time it restarts. producer_sequence is monotonically-increasing within an epoch. The triple is unique for the lifetime of the producer.

Second, on disconnect, every in-flight chunk's future fails with LOST_ON_DISCONNECT and its credit footprint is released. The application sees the failure and chooses what to do.

public void failAll(String reason, Consumer<PendingEntry> onEachEntry) {
    for (var entry : snapshot()) {
        var removed = pending.remove(entry.getKey());
        if (removed == null) continue;
        onEachEntry.accept(removed);  // restores credit
        removed.future().complete(PublishResult.failure(LOST_ON_DISCONNECT, reason));
    }
}

Third, the application's retry policy resubmits the chunk with the same logical content but a new (producer_epoch, producer_sequence). The server, on the other side, might write the same content twice — once before the crash, once after the retry. The downstream consumer dedups by a stable application-level ID (the request ID or event ID carried in the payload).

This is not exactly-once. Exactly-once is a research topic and the people who claim to have shipped it usually haven't, or have shipped it inside a single transactional boundary that isn't your boundary. At-least-once plus idempotent consumers is the boring, working answer. It's what every well-behaved event pipeline does. It's what we do here.

Graceful shutdown is its own dance. When the server gets a SIGTERM, it:

Stops returning healthy from its container health check, so the orchestrator stops sending it new traffic.
Broadcasts a DRAIN frame to every connected client. The DRAIN carries a deadline (server's stated time to finish draining).
Stops accepting new connections.
Waits for the drain grace period to let connected clients finish what they were doing.
Closes the remaining channels.
Removes the socket file (inode-checked).
Exits.

The client, on receiving DRAIN, transitions to a DRAINING state. New publish() calls fail fast with REJECTED_DRAINING. In-flight chunks are allowed to ack out. A local timer also force-closes the channel at the smaller of (server-stated deadline, client local shutdown budget), so a server that crashes mid-drain can't leave the client sitting forever.

Reconnect uses bounded exponential backoff with full jitter. The first attempt happens after a small initial delay (50 ms is reasonable), each subsequent attempt doubles up to a ceiling (5 seconds), and the actual delay is uniformly distributed between 0 and the current ceiling. Jitter is the difference between "everyone retries at the same instant and DOSes the recovering server" and "retries are spread smoothly across the recovery window." It's also one line of code, so there is no excuse not to do it.

public synchronized Optional<Duration> nextDelay() {
    if (maxAttempts >= 0 && attemptCount >= maxAttempts) return Optional.empty();
    long unjittered = exponentialDelayMillis(attemptCount++);
    long jittered = jitterSupplier.applyAsLong(Math.max(1L, unjittered));
    return Optional.of(Duration.ofMillis(Math.max(1L, jittered)));
}

PING/PONG keeps the channel warm and lets each side detect a dead peer. The Netty IdleStateHandler is configured with a reader-idle timeout (no inbound activity for X seconds → close the channel) and a writer-idle timeout (no outbound activity for Y seconds → emit a PING). Reader-idle closes are caught by the reconnect loop. Writer-idle PINGs solve the "TCP connection is silently dead because a middlebox dropped state" problem, which UDS doesn't have, but the same code path also gives you a periodic liveness signal for free, so we keep it.

Deploying this thing on ECS

The deployment topology has a few pieces that have to fit together correctly.

The server runs as an ECS daemon service. Set schedulingStrategy=DAEMON. The cluster will place exactly one task on every container instance. This is what gives you the "one IPC server per host" invariant. Without daemon scheduling, you'd have to coordinate placement constraints by hand, and you'd eventually end up with hosts that have zero servers or hosts that have two servers competing for the same socket file.

Producers run as a regular ECS service. Spread placement is fine. The producer doesn't need to know which host it's on; it just connects to a fixed socket path. Whatever host it lands on, there'll be a server listening at that path because daemon scheduling guarantees it.

Both run in awsvpc networking mode. The hot path is UDS, so the loopback namespace separation that bites TCP localhost doesn't matter to us at all. You can pick whatever networking mode is most convenient for the other traffic the containers carry.

The socket directory is a host bind-mount. Both the producer task definition and the server task definition mount /var/run/<service> from the host into the container. Bind mount, not emptyDir, not tmpfs. The reason is durability across container restarts: an emptyDir is a fresh volume each time the task starts, so a producer restart would lose access to the running server's socket file. A host bind mount lives on the EC2 instance's local filesystem and persists for the life of the host.

The server's container health check returns healthy only after the socket is bound. The health check should do a real connect() to the socket and immediately close. Don't use a TCP health check; the server isn't listening on TCP. Don't use a process health check; the process can be running but not yet bound. The check needs to verify the actual integration point.

The producer does not depend on any ECS attribute or service discovery. It does an exponential-backoff connect loop against the socket path. If the server isn't ready yet, the producer retries. If the server is restarting, the producer retries. If the server gets replaced during a rolling deploy, the producer's reconnect loop handles it. This is much simpler than wiring up cross-task readiness signals through PutAttributes or service discovery, and it works no matter what the orchestrator is doing.

Per-host resources matter. The server is a long-lived JVM with a lot of pooled direct memory. Plan for:

Native epoll transport (io.netty.transport.noNative=false), which on Linux is automatic but worth verifying you're picking up the right native library for your architecture (linux-x86_64 vs linux-aarch64).
-XX:MaxDirectMemorySize sized to cover your pooled allocator high-water mark plus a comfortable safety margin. Direct memory is allocated outside the heap and will OOM your container if you under-size it.
memlock=unlimited and the IPC_LOCK capability if you're planning to lock pages, which you probably aren't in v1 but might in a future phase.
A stopTimeout long enough to cover your drain budget plus a buffer. If the orchestrator SIGKILLs you mid-drain, your producers will get hard disconnects instead of graceful drains.

A walk through one publish

Let's trace one chunk end-to-end so the pieces fit together.

The application calls client.publish(chunk) from any thread. The client checks its state (must be STARTED, not DRAINING or CLOSING), checks the handshake is complete, and tries to acquire 3-D credit for this chunk's footprint. If credit fails, the call returns a completed future with REJECTED_NO_CREDIT. If state fails, similar.

Assuming everything's good, the client assigns a producer_sequence, registers the chunk in its PendingChunkRegistry keyed by (producer_epoch, producer_sequence), gets back a CompletableFuture<PublishResult>, schedules a per-chunk publish timeout, and calls channel.writeAndFlush on the Netty channel.

long sequence = producerSequence.getAndIncrement();
var key = new ChunkKey(producerEpoch, sequence);
var future = pendingChunks.register(key, 1, eventCount, payloadBytes);
schedulePublishTimeout(key);

var header = FrameHeader.newBuilder()
    .setFrameType(FRAME_TYPE_DATA)
    .setProducerId(producerId).setProducerEpoch(producerEpoch)
    .setProducerSequence(sequence)
    .setEventCount(eventCount).setPayloadBytes(payloadBytes)
    .build();

channel.writeAndFlush(Frame.of(header, request.payload()))
       .addListener(onWriteComplete);   // releases credit on write failure

return future;

Netty's encoder serializes the frame using the wire format, writes it into a pooled direct buffer, and hands it to the kernel via writev. The kernel copies it into the receiving UDS socket's buffer.

On the server side, the worker event loop wakes up on the epoll readiness notification, reads the bytes, and runs them through the pipeline. The decoder pulls one frame out of the buffer, the protocol handler routes it by frame_type to handleData, the credit gate checks that the chunk fits, the chunk sink processes it. Assuming Accepted, the chunk's identity goes onto the pending ACCEPTED list.

Other frames in the same read batch are processed the same way. At the end of the batch, channelReadComplete fires and we flush the pending list as a single ACCEPTED frame containing every chunk identity from this batch.

The ACCEPTED frame goes back through the same wire format, back through the kernel, back to the client. The client's decoder pulls it out, hands it to the protocol handler, which looks up each chunk identity in its registry, completes each future with PublishResult.accepted(), and restores the credit footprint.

The application's await() on the future returns. The whole round-trip, on a moderate-sized r6i instance under reasonable load, is in the low tens of microseconds. The protocol contributes effectively nothing to that — most of the time is the syscall to write the bytes, the syscall to read them, and a tiny amount of Netty bookkeeping. The 50+ Gbit/s per-channel headroom that UDS gives you means we can do this thousands of times per millisecond without breaking a sweat.

When this is the wrong answer

Honesty: not every problem wants this transport. Here's when I'd reach for something else.

Producer and consumer aren't on the same host. Then the whole premise is gone. Use a real network protocol. gRPC is fine. Plain HTTP is fine. Whatever you'd normally reach for is fine.

You need cross-language interop. Netty is JVM. The pattern translates to Go or Rust or Python without much trouble — the wire format is language-neutral by design — but if you want a single off-the-shelf library that already works in five languages, gRPC over UDS gets you part of the way there with less custom code.

You need exactly-once semantics. You don't. Nobody does. Build idempotency into your consumers and stop fighting the universe.

Your throughput is genuinely tiny. If you're shipping ten requests a second, the operational cost of any custom protocol is going to outweigh the savings versus just calling an HTTP endpoint. Use HTTP, save your weekends.

You need sub-microsecond latency. Now we're in shared-memory territory. Aeron, LMAX Disruptor patterns, kernel-bypass networking. UDS is fast but it's still doing two kernel copies; shared memory does zero. If you need that floor, build it.

You can't get a host bind-mount. Some platforms (managed Kubernetes flavours with strict security profiles, certain serverless container runtimes) don't let you bind-mount host paths. UDS is mostly off the table in those environments. You're back to network protocols.

For the broad middle — same host, container-to-container, throughput in the gigabytes-per-second range, latency budget in the milliseconds, willing to write a small amount of careful protocol code — this design is hard to beat.

Closing thoughts

The cloud-native consensus, for very good reasons, is "treat the network as the abstraction." A service has a name, you talk to it over a name-resolved endpoint, you don't care where it is. That model has earned its place.

But it has a cost, and the cost is paid every time two services that are on the same host are forced to pretend they aren't. The default tooling will route their bytes through ENIs and load balancers and TLS handshakes, and you'll see the bill at the end of the month and the tail latency on your dashboards and the strange flamegraphs in your profilers.

UDS is the escape hatch. It's been sitting there in the kernel the whole time. The amount of code needed to use it well is not large — the implementation that backs this article is something like 2,000 lines of Java including tests — but the design has to be careful about a handful of specific things: the fixed prelude trick, the three-dimensional credit model, the ack coalescing on channelReadComplete, the inode-checked socket cleanup, the discipline of never blocking on the IO event loop, the boring choice to commit to at-least-once delivery and idempotent consumers.

If you do those things right, you get a transport that is fast, cheap, observable, and doesn't lie to you. That's a good list.

I'd encourage anyone running a service-to-service workflow on the same host to at least measure what the current implementation is costing them. The number is usually larger than people expect, and the fix is much smaller than people expect.

The pattern works. Steal it.

Stop Paying a Streaming Bus to Carry Bytes That Live for Ninety Seconds

Samar Prakash — Sat, 30 May 2026 09:48:20 +0000

How a shared filesystem became the cheapest, fastest outbox I've ever built — and why FSx for OpenZFS is the version of that idea that finally scales

I was staring at an AWS bill last quarter where a single Kinesis Data Streams line item was costing more than the entire S3 footprint sitting behind it. The events on that stream had a useful lifetime of about ninety seconds. They were written by one service, read by another, processed, and dropped. We were paying full streaming-bus price for bytes that barely outlived a TCP timeout.

That bill is what got me thinking about transitional data as a category that deserves its own architecture, and about why every "use the right tool" instinct I had — Kinesis, Kafka, MSK — was the wrong tool for this particular shape of work. The right tool, it turns out, is a filesystem. Specifically, AWS FSx for OpenZFS, used as an outbox between producers and consumers, with only a tiny pointer message traveling through whatever messaging bus you already have.

This article is the case for that pattern. It's also the design, the failure modes, the code, the cost math, and the honest list of when not to do it. I'll walk you through the architecture from first principles, show you the safe-write protocol that makes it correct under crashes and concurrent retries, compare the cost against Kinesis, MSK and EFS at a realistic petabyte-class workload, and explain why the recent addition of FSx Intelligent-Tiering changes the cost story in a way that makes the pattern attractive even for teams that don't ingest petabytes.

If you've ever felt the queasy sensation of paying twice for the same bytes — once to land on a stream, again to land in storage — this is for you.

What "transitional data" actually means

Most data falls into one of two cleanly shaped buckets. Durable data is the stuff you keep — user records, orders, financial events, audit trails. It needs to live for years; you pay storage costs for those years and you get value over those years. Streaming data is data you process in motion — clickstreams, telemetry, alerts — where the value is in real-time consumption.

Transitional data is the awkward middle child. It's data that:

Is produced in continuous high volume.
Is consumed shortly after production — usually within seconds or minutes, almost never more than an hour or two.
After consumption, is either archived for compliance/audit or deleted entirely.
Has no value sitting on a transport between producer and consumer beyond getting from one side to the other.

The classic examples are event-pipeline workloads: a producer ingests external events, enriches them, then hands them to one or more downstream consumers for further processing. The "stream" in the middle is conceptually a pipe, not a database. The events have ordering constraints within partitions and at-least-once delivery requirements, but nobody is querying the stream itself — it's purely a carrier.

The instinct, drilled into us by ten years of cloud-native architecture talks, is to put transitional data on a streaming bus. Kinesis. MSK. Pub/Sub. Event Hubs. It's the default. It's what the conference slides recommend. It's what every reference architecture diagram shows.

And it works fine — right up until your volume gets serious. At that point, the streaming bus stops being a transport and starts being the largest line item on your bill.

The Kinesis math at scale, written out plainly

Let me walk through the actual numbers. A Kinesis Data Streams shard in provisioned mode gives you, per the AWS service limits documentation, 1 MB/s of write throughput OR 1,000 records per second, whichever you hit first. Read throughput is 2 MB/s per shard (5 transactions per second). Whichever cap you smash into first is the cap you actually have.

Suppose you're ingesting at a steady 700 MB/s, which is what a healthy event pipeline at ~100 TB/day looks like. You need 700 shards just to hold the write rate, before any consumer fan-out, before any headroom for spikes, before any consideration of hot keys.

Hot keys make the picture worse. The shard you land on is determined by hash(partitionKey) % shardCount. If 5% of your traffic comes from one customer or one stream of telemetry, that 5% all goes to one shard. One shard, 1 MB/s. The other 699 shards sit there underutilized while your hot shard throttles. You can solve this with key spreading, but key spreading breaks per-key ordering, which is often the entire reason you picked a partitioned stream in the first place. So you scale up — to 1,000 shards, to 1,500 shards — to give the hot key room.

Shards are billed per shard-hour. At list price the per-shard cost is small, but at 700–1,500 shards it adds up to tens of thousands of dollars per month before you've ingested a single byte. Add PUT payload unit charges (each 25 KB chunk counts as a unit, so 700 MB/s of small events is around 28,000 units per second), enhanced fan-out per consumer, extended retention, and you're well into the high five figures per month for a transport that exists solely to move bytes that won't be relevant in five minutes.

On-demand mode looks nicer until you read the small print. It's billed per GB ingested, per GB retrieved, plus a baseline stream-hour fee. At 100 TB/day, the per-GB charges alone clear $30K/month before you count consumer reads.

You can argue with the exact dollar figures — they vary by region, by reserved-capacity discounts, by your actual fan-out fanout pattern — but the shape is the same regardless. The cost grows linearly with the bytes you push through, and the bytes you push through are exactly the same bytes you have to store somewhere durable anyway. You are paying twice.

MSK isn't the answer either

The instinct after "Kinesis is expensive" is "let's use Kafka instead." Kafka is open-source, it's mature, the ecosystem is enormous. AWS gives us MSK so we don't have to operate the brokers ourselves. Surely that's cheaper.

It's not, for this shape of workload. Let me show the math.

MSK pricing, as of 2026, looks roughly like this for provisioned clusters: broker instances at around $0.20/hour for the small kafka.m7g.large class and proportionally more for larger ones, EBS storage at $0.10 per GB-month for the broker volumes, plus inter-AZ data transfer for replication. MSK Express adds another $0.01/GB ingested. MSK Serverless charges $0.75 per cluster-hour, $0.0015 per partition-hour, $0.10/GB in and $0.05/GB out.

For a 100 TB/day workload you'll want at least 6–9 brokers of a non-trivial instance class to handle the throughput with replication-factor-3 durability. That's $3K–$6K/month in broker hours alone. Then you need EBS sized to hold a few hours of buffer per broker — say 2 TB per broker times 9 brokers times $0.10/GB-month, another $1,800/month. Then cross-AZ replication of every byte ingested: at 100 TB/day across three AZs you're shifting roughly 200 TB/day cross-AZ for replication, and AWS charges around $0.01/GB for that — call it $60K/month if you're unlucky on routing.

Add the operational burden: MSK frees you from broker installs but not from partition rebalancing decisions, broker right-sizing, version upgrades, ZooKeeper-to-KRaft migration, ACL management, monitoring, and the eternal "why is my consumer lag spiking" investigations. That's not a billed line item but it's a real cost.

Rough total for the same 100 TB/day shape: $70K–$90K/month, give or take. Comparable to Kinesis, more operational headache, no architectural advantage for transitional data because — and this is the key point — you are still paying a transport service to carry every single byte through brokers and across AZs, even though those bytes are going to be discarded within minutes.

The rule of thumb in the industry has settled at MSK being roughly 3–5× more expensive than Kinesis once you count operational overhead. For specifically transitional data, where the value of the byte-time on the wire is approximately zero, either choice is the wrong economic shape.

There has to be a different model.

The mental shift: payload on a filesystem, pointer on the bus

Here's the model that actually fits. It's the outbox pattern, but the outbox is a shared filesystem instead of an in-process table.

The producer doesn't push bytes through a transport. Instead it writes a file to a shared filesystem — a real file, on a real path, with normal POSIX semantics. Then it publishes a tiny pointer message through whatever bus you have lying around — Kinesis, SNS, SQS, even a small Kafka topic. The message is a hundred bytes: file path, batch id, size, checksum, maybe a couple of routing keys.

The consumer reads the pointer message, mounts the same filesystem, opens the file at the given path, streams the bytes, processes them, and acknowledges the message. The bytes never traverse the transport. The transport only carries metadata about where to find the bytes.

That diagram is the entire pattern at the architectural level. Three boxes. The interesting part is everything you don't see in it: backpressure, atomic writes, idempotent retry, partitioning, batching, compression, file-system tuning. We'll get to all of that.

But first, why this change of frame is so cost-effective:

The bus carries small messages, so the bus cost collapses. Pointer messages are roughly two hundred bytes each. A bus that was strained to carry 700 MB/s of payload is now carrying maybe 200 KB/s of pointers. Shard counts can drop by an order of magnitude or more. One real architecture I've seen reduced a Kinesis shard count from 700 to 32 — a 95% reduction in transport spend, with no change to the actual byte volume being moved.
The filesystem is billed for what it stores, not for what passes through it. You pay for the bytes that exist, not the bytes that have existed and have already been consumed and deleted. If your retention is one hour, you pay for one hour's worth of bytes resident at any given moment.
The filesystem gives you primitives the bus doesn't. Snapshots. Random access. Concurrent multi-reader semantics. Standard POSIX tools to inspect, validate, and debug. Your data-engineering team can ls your transitional data. They cannot ls a Kinesis stream.
The compute path is the same speed or faster. A modern NFS mount on the same VPC moves data faster than any HTTP-based streaming bus. Once you understand the math, the streaming-bus model is the slow option, not the fast one.

So why didn't everyone do this ten years ago? Because the filesystem options didn't scale. EFS exists, but at the throughput required for petabyte-class workloads EFS becomes the most expensive option in the room. We'll get to that comparison. The reason this pattern is suddenly viable is that FSx for OpenZFS, particularly with Intelligent-Tiering, gives you the performance and the multi-AZ durability of a real production filesystem at a cost that beats every transport on the market.

Why FSx for OpenZFS, specifically

There are at least six AWS storage services you could plausibly use here. Let's eliminate them one by one.

S3 direct from producers. Tempting because S3 is cheap, well-understood, and infinite. But: producers write small files (a few hundred KB to a few MB per micro-batch), and S3 has a small-files problem. Per-PUT overhead, eventual-consistency caveats on listings (mostly fixed but historically a footgun), and a multipart-upload model that's clumsy for the size of files you actually produce. Consumers also have to do an LIST or rely on S3 event notifications, which adds latency and complexity. S3 is the destination tier, not the working tier.

EFS Standard. Native NFS, multi-AZ, easy to mount everywhere. At $0.30/GB-month for storage and elastic throughput costing $0.03/GB read and $0.06/GB written, a 100 TB/day workload runs about $300K/month in throughput charges alone in elastic mode. EFS Provisioned trades that for a fixed throughput fee — at the throughput levels we need, around $90K/month. Still expensive, and EFS latency is consistently in the single-digit-millisecond range rather than the sub-millisecond range you'd want for a hot working tier.

FSx for Lustre. Genuinely fast — sub-millisecond latency, hundreds of GB/s aggregate throughput in larger deployments, the workhorse of HPC. But it's single-AZ only, and the failure model for transitional data really wants multi-AZ. You can stitch together cross-AZ Lustre with replication, but the cost balloons and you lose simplicity. Lustre also requires a kernel module on clients, which is operationally awkward in mixed container environments.

FSx for Windows / FSx for ONTAP. Both work, both support multi-AZ, both add complexity (SMB licensing, ONTAP feature surface). Neither is wrong, neither is the obvious right.

FSx for OpenZFS. Multi-AZ with synchronous replication to a standby in another AZ. NFS protocol (v3 / v4.0 / v4.1 / v4.2) — clients are the standard Linux kernel NFS client, no special drivers. SSD-backed. Sub-millisecond latency. Native LZ4 compression at the file-system level. POSIX semantics, including the all-important atomic-rename guarantee we need for safe writes. Snapshots, encryption at rest, KMS integration. And — this is the new part — an Intelligent-Tiering storage class that prices in at roughly 85% less than the SSD class.

The recent Intelligent-Tiering announcement is what tipped this from "viable" to "obvious." Let's look at what it actually gives you.

Intelligent-Tiering, and why it matches transitional-data access patterns perfectly

FSx Intelligent-Tiering is a separate storage class for FSx for OpenZFS that introduces three tiers within a single namespace. Per the AWS announcement:

Frequent Access — data touched within the last 30 days. Baseline tier, sub-millisecond reads from cache, full performance.
Infrequent Access — data not touched for 30 to 90 days. Roughly 44% cheaper than Frequent Access.
Archive Instant Access — data not touched for 90+ days. Roughly 65% cheaper than Infrequent Access. Still online, no restore needed; first-byte latency in the tens of milliseconds.

Multiply the discounts and you get a storage cost in the Archive tier that's roughly 80% lower than the Frequent Access baseline. The marketing line is "up to 85% lower than the existing SSD storage class," which checks out arithmetically.

A file system using Intelligent-Tiering supports up to 400K IOPS and 20 GB/s of throughput, with a minimum provisioned throughput floor of 160 MBps. You can optionally provision an SSD read cache on top of the tiered storage to keep hot files at sub-millisecond latency even after they've technically migrated to a colder tier.

Here's why this matches transitional data so perfectly: the access pattern for an outbox is "hot for minutes, cold forever after." A batch file is written, read once or twice by consumers within the first few minutes, and then never touched again unless someone is doing forensic analysis. That's exactly the pattern Intelligent-Tiering is optimized for. The hot files stay on the fast tier and serve consumers at sub-ms latency; the post-consumption files quietly slide down to the Archive Instant Access tier, where they cost almost nothing but are still readable on demand if you need to audit.

You don't have to design this. You don't have to set up lifecycle rules. The file system does it. From the application's perspective, every file is at the same path in the same namespace. The pricing optimization happens underneath.

The architecture in two passes

Let's walk through the producer side and the consumer side in turn. This is the actual shape of a working implementation, lightly cleaned up from a real codebase.

Producer pass

A producer's job is to take a batch of events, compress them, write the result durably to FSx, and publish a notification message to the bus. The full sequence is:

1.  Build payload                  → ZSTD-compressed protobuf in a direct ByteBuffer
2.  Resolve partition folder       → sanitize id, validate against FSx root
3.  Create batch directory          → /fsx/<part>/year=Y/month=M/day=D/hour=H/batch-<uuid>/
4.  Reserve in-flight bytes         → byte-based backpressure (more on this below)
5.  Write to a unique temp file     → data.bin.tmp.<uuid>.1
6.  fsync the temp file             → optional, controlled by config
7.  Atomic rename to data.bin       → Files.move(tmp, final, ATOMIC_MOVE)
8.  Publish pointer message         → { filePath, batchId, sizeBytes, crc32c }
9.  Release in-flight bytes         → permit auto-closed

Step 5 onwards is the safe-write protocol, and it's the heart of correctness. We'll spend the next section on it. Steps 1–4 are application logic and bookkeeping; they're conceptually simple but the byte-based backpressure deserves its own discussion.

Consumer pass

Consumers do the inverse:

1.  Pull notification message from bus
2.  Validate pointer (path within allowed root, checksum optional)
3.  Open the file via the local NFS mount
4.  Stream the bytes — gather them, decompress, deserialize
5.  Process the events
6.  Acknowledge / delete the bus message

There's no LIST, no scan, no polling for new files. Consumers are driven by the bus. The bus tells them what to read; the filesystem holds what they read.

This separation matters operationally. A slow consumer doesn't back up the producer's writes — the producer's writes are already complete on the filesystem. A bus outage doesn't stall the producer beyond the publish step; the file is already durable. A consumer crash doesn't lose data; the next consumer pulls the same notification (because we hadn't acked) and reads the same file.

The safe-write protocol, in detail

The single most important piece of code in this entire pattern is how a producer commits a batch to the filesystem. Get this wrong and you have torn writes, partial reads, racing retries deleting each other's work. Get it right and you have an outbox that survives crashes without complicated coordination.

The protocol has six steps. Each one is doing real work.

Step 1 — create the batch directory. Files.createDirectories(batchDirectory). POSIX mkdir -p. This is idempotent and cheap. The batch directory is computed deterministically from the partition id and a UUID-based batch id, with time-based folder buckets above it to avoid one directory eventually holding millions of entries.

Step 2 — open a unique temp file. Critically, the temp filename includes a per-attempt unique token. The template I use is something like data.bin.tmp.<uuid>.<attempt>. The reason for the uniqueness is that retries must not collide with each other or with a previous crashed attempt. If attempt 1 crashed mid-write and left a data.bin.tmp.abc123.1 orphan, attempt 2 must open data.bin.tmp.def456.2 — a different name — so neither attempt steps on the other. The StandardOpenOption.CREATE_NEW flag enforces this: the open fails if the file already exists, which is exactly what we want.

Step 3 — vectored write. This is where the implementation rewards careful engineering. A naïve writer would loop over composite buffer components and issue one write() per component. A composite payload of a hundred small buffers becomes a hundred syscalls.

Instead we use FileChannel.write(ByteBuffer[]), which is the JDK's bridge to the kernel's writev syscall. The kernel takes a whole array of buffer descriptors and does the gather in one call. Many small components, one syscall.

private PayloadWriteOutcome writeGatheredBuffers(...) {
    final CRC32C checksumCrc32c = checksumEnabled ? new CRC32C() : null;
    long totalBytesWritten = 0L;
    int nextBufferIndex = nextReadableBufferIndex(payloadBuffers, 0);

    while (nextBufferIndex < payloadBuffers.size()) {
        ByteBuffer[] gatherSources = nextGatherSources(payloadBuffers, nextBufferIndex);
        updateChecksum(checksumCrc32c, gatherSources);
        while (hasRemaining(gatherSources)) {
            long bytesWritten = fileChannel.write(gatherSources);  // writev under the hood
            totalBytesWritten += bytesWritten;
        }
        nextBufferIndex = nextReadableBufferIndex(payloadBuffers,
                                                  nextBufferIndex + gatherSources.length);
    }
    return new PayloadWriteOutcome(totalBytesWritten,
            checksumCrc32c == null ? null : checksumCrc32c.getValue());
}

There's a chunking detail: most kernels cap the gather size somewhere around IOV_MAX (1024 on Linux). The code respects that with GATHER_WRITE_BUFFER_LIMIT = 1024. A payload with more components is gathered in 1024-buffer slices.

The CRC32C is computed during the same pass — one walk over the bytes, two outputs. Computing the checksum as a separate post-pass would double the byte traffic for nothing.

Step 4 — fsync. Optional. fileChannel.force(forceMetadata) issues fdatasync (or fsync if metadata is included) to push the bytes to persistent storage before returning. On a synchronously-replicated multi-AZ FSx, this also blocks until the standby AZ has acknowledged the write. It's expensive (in latency, not bytes) but it gives you durability before the rename. Most workloads can skip it if they tolerate the small window between rename and standby-AZ sync; high-durability workloads should turn it on.

Step 5 — atomic rename. Files.move(tempFile, finalFile, StandardCopyOption.ATOMIC_MOVE). POSIX guarantees that a rename(2) is atomic on the same filesystem: an observer either sees the old name or the new name, never both, never neither, never a partial state. Translated to our case: a consumer either sees data.bin (and can read the full payload) or doesn't see it at all (and will retry later). It never sees a partial data.bin.

This guarantee is what makes the whole pattern correct. There is no other coordination, no manifest file, no two-phase commit. The single atomic rename is the publish moment.

There's a subtle case: ATOMIC_MOVE providers on some implementations will replace an existing target without erroring. Before issuing the move we therefore probe for an existing data.bin. If it exists, the producer treats this as an idempotent retry — more on idempotency in a moment.

Step 6 — publish pointer. Only after the rename succeeds do we publish the notification message. The order matters: if we published first and then the rename failed, consumers would chase a phantom file.

The pointer message is intentionally tiny:

{
  "filePath": "/fsx/<partition>/year=2026/month=05/day=24/hour=14/batch-abc123/data.bin",
  "batchId": "abc123",
  "partitionId": "<partition>",
  "sizeBytes": 524288,
  "crc32c": "f3c19a2b"
}

Two hundred bytes, give or take. The bus carries the pointer. The filesystem holds the payload. The transport bill goes from "scale with bytes" to "scale with message count" — and message count for a typical batching workload is in the hundreds-per-second range, well within any bus's comfort zone.

Idempotent retry without re-reading the bytes

Retries are unavoidable. Producers crash, networks blip, NFS metadata operations stall. The framework has to handle all of these without producing duplicate files or losing the in-flight write.

The idempotency rule I use is intentionally simple: if the final data.bin already exists at the expected path, treat it as a successful prior write — without re-reading the bytes.

if (this.fileSystemOperations.exists(finalFilePath)) {
    return existingResultAsync(request, finalFilePath, payload, expectedBytes, attempt);
}

// in toExistingResult:
long existingBytes = this.fileSystemOperations.size(finalFilePath);
if (existingBytes != expectedBytes) {
    throw new FsxWriteException("Existing FSx file does not match expected payload size: " + finalFilePath);
}
return toWriteResult(request, finalFilePath, existingBytes, expectedChecksumCrc32c, attempt, true);

The size-only check is deliberate. Re-reading a multi-megabyte compressed payload over NFS just to verify a checksum would double the IO cost of every retry, and the atomic-rename protocol already ensures that if the final file exists, it contains a complete, validated payload. The size check is a cheap sanity guardrail against bizarre cases where another process wrote a different file with the same name.

If the size matches, we return success with an idempotentExistingWrite=true flag. The caller sees a normal success result and continues. The notification publish step is then idempotent on its own end — most buses dedup on a message id you can derive deterministically from the batch id.

This is at-least-once delivery, not exactly-once. The consumer side has to dedup by batchId. That's the standard contract for event pipelines and it's fine; building exactly-once on top of at-least-once with idempotent consumers is a solved problem.

Byte-based backpressure, not request-based

Naive backpressure limits the number of concurrent operations. "No more than 16 writes at a time." That works when every write is roughly the same size. It breaks the moment one write is 10 KB and the next is 100 MB. The 16-operation limit lets a single bad batch consume gigabytes of in-flight memory while the limiter thinks it's still doing the right thing.

Instead, the limiter I use bounds bytes in flight, not operation count:

final class FsxInFlightByteLimiter {
    private final long maxInFlightBytes;
    private final int maxInFlightOperations;
    private long inFlightBytes;
    private int inFlightOperations;
    private final Queue<PendingAcquire> pendingAcquires = new ArrayDeque<>();

    CompletableFuture<Permit> acquire(long requestedBytes, Duration timeout,
                                      ScheduledExecutorService scheduler) {
        PendingAcquire pending = new PendingAcquire(toReservedBytes(requestedBytes));
        synchronized (monitor) {
            if (pendingAcquires.isEmpty() && canAcquire(pending)) {
                acquireNow(pending);
                completeNow(pending);
            } else {
                pendingAcquires.add(pending);
            }
        }
        scheduler.schedule(() -> timeout(pending), timeout.toMillis(), TimeUnit.MILLISECONDS);
        return pending.future();
    }

    private boolean canAcquire(PendingAcquire pending) {
        return inFlightOperations < maxInFlightOperations
                && inFlightBytes + pending.reservedBytes() <= maxInFlightBytes;
    }
}

Two dimensions:

maxInFlightOperations — caps the number of concurrent writes (still useful to prevent thundering herds on the file I/O thread pool).
maxInFlightBytes — caps the aggregate payload size of concurrent writes (the real protection against memory blow-up).

Both must be satisfied before a write is admitted. When neither is satisfied, the request is queued; when capacity becomes available (a running write completes), the queue is drained in FIFO order. The whole thing is non-blocking — callers get a CompletableFuture<Permit> and can compose it into their own async chain.

The detail that matters in production: oversized requests (those larger than maxInFlightBytes on their own) need to be handled. The implementation clamps reservedBytes to the limit, so an oversized write can run, but it runs alone. The caller is responsible for not handing in 10 GB requests; the framework's protection is against the runaway accumulation of "merely large" requests, not against a single pathologically-large one.

Why is this fancy enough to need its own class? Because the difference between request-based and byte-based limiting is the difference between "the limiter does what I think it's doing" and "memory exploded at 3 AM because one batch was unusually large." Subtle bugs there are expensive to debug.

Naming, layout, and the partition directory tree

The filesystem layout matters more than it might seem. Here's the convention I use:

/fsx-root/
  <partitionId>/
    year=2026/
      month=05/
        day=24/
          hour=14/
            batch-<uuid>/
              data.bin
              [optional metadata files]

A few rationales worth calling out:

Partition as the top-level grouping. Within a single partition, writes are serialized by the producer's own logic. Across partitions, writes are independent. Putting partition first in the path lets consumers scan or process one partition without walking the whole tree, and lets retention policies operate at partition granularity.

Hive-style time-bucket folders. year=Y/month=M/day=D/hour=H/ is the convention Hive and Spark and a hundred analytics tools recognize. If you ever want to plug a query engine over the FSx tree (DuckDB, Presto, anything), the partitioning is already in a shape it understands. More importantly, time-bucketing prevents any single directory from accumulating millions of entries — a real performance issue for NFS metadata operations.

Per-batch directory. Each batch gets its own folder, not just its own file. This gives you room to add sidecar files later (index files, per-batch metadata, manifest JSON) without breaking the existing readers. The batch id is a UUID, so collisions are not a concern.

data.bin as the final filename. Boring, predictable, descriptive. The temp file uses the same final name with a .tmp.<token>.<attempt> suffix so the atomic rename target is unambiguous.

There's also a critical security check: the partition id is sanitized and validated to stay under the configured FSx root. A producer cannot pass a partition id like ../../../../etc/passwd and write outside the intended tree. The sanitizer rejects path-traversal characters and the framework asserts the resolved path is a descendant of the root before doing anything with it.

The Java library question, and why "kernel NFS + plain NIO" wins

When you start looking for a Java library to talk to FSx OpenZFS, you find a mess. There's no official AWS SDK for FSx data-plane operations — the AWS SDK only handles the management plane (create/destroy file systems). For the actual file I/O you're on your own.

I went down the rabbit hole:

aws-sdk-java-v2 + S3AsyncClient + S3 Access Points for FSx. Works if you expose your FSx file system through an S3 access point. Mature, async, multipart, retries, integrates with the AWS SDK ecosystem. The downside is the S3-style access has tens-of-milliseconds latency rather than the sub-millisecond NFS-mount latency. For batch jobs that's fine; for hot transitional data you give up most of the perf advantage.
dCache/nfs4j. A pure-Java NFSv3 and NFSv4 implementation. The most serious Java NFS library available. It's actively maintained, has a JMH benchmarks module, runs on Java 17. If you absolutely need to write your own NFS client (or extend one), this is where you'd start. But it's not a turnkey FSx client — you'd be building protocol code, and AWS's own performance documentation is opinionated about mount-time options that are easier to apply with the kernel client.
EMCECS/nfs-client-java. NFSv3 only, dependent on a years-old version of Netty. Workable for legacy use, not a foundation for a petabyte-scale system in 2026.
SMBJ, jcifs. SMB protocol clients. Wrong protocol family — FSx for OpenZFS doesn't speak SMB.
Hadoop / Spark NFS connectors. Useful for ideas about request pipelining, not a foundation.

The conclusion I came to is the boring one: mount the FSx file system with the kernel NFS client, and use the standard Java NIO FileChannel / AsynchronousFileChannel against the mount point. AWS's tuning guidance is much more opinionated about client-side mount options than about language libraries. Specifically:

Use nconnect=16 (or up to your kernel's supported maximum) to parallelize NFS over 16 TCP connections.
Set rsize=1048576 and wsize=1048576 for 1 MiB read/write chunks.
Use NFSv4.1 for the locking and the cleaner failover semantics.
Place producers and consumers in the same AZ as the file system primary for sub-ms latency and to avoid cross-AZ data transfer charges.

Once you do those things, the kernel does the heavy lifting. Java NIO becomes the thinnest possible wrapper over the syscalls. You're competing with the kernel for performance, and the kernel wins.

So the implementation pattern is:

producer/consumer container
        │
        ▼
   /fsx-root/  (Linux NFS mount, nconnect=16, NFSv4.1)
        │
        ▼
   AsynchronousFileChannel, FileChannel.write(ByteBuffer[])

That's it. No custom protocol code. No special drivers. No language-specific clients. Just the things the kernel is already tuned to do well.

Failure semantics, in painful detail

The honest list of what happens when things go wrong.

Producer crashes after writing the temp file but before the atomic rename. The temp file is an orphan. No consumer ever sees it (consumers only look for data.bin). The producer's next attempt uses a new unique temp filename, so it doesn't collide. The orphan is cleaned up by a TTL job or a periodic sweep — the framework explicitly does not handle this, because the cleanup cadence is a deployment decision.

Producer crashes after the rename but before publishing the notification. This is the dangerous one. The file is on FSx, but no consumer knows it exists. On the next retry, the producer attempts to write again with the same batch id, sees the existing data.bin, recognizes it as an idempotent retry, and proceeds to publish the notification. The consumer dedups by batch id so it doesn't matter that the producer might publish twice. The net result is at-least-once delivery, with at-most-once processing preserved by the consumer's dedup.

FSx write succeeds, notification publish fails. Same picture as above. The file stays on FSx. The framework returns the publish failure to the caller, who is expected to retry. Cleanup is deferred to the retention policy. The pattern is explicitly non-transactional: FSx and the bus are composed, not atomically coupled.

Notification publish succeeds, FSx file is later corrupted or deleted. This shouldn't happen — FSx is durable storage — but if it did, the consumer would get a read error and fail the message. With a dead-letter queue it would surface as an actionable alert. The CRC32C in the pointer message lets the consumer detect a corrupted file before deserializing.

Consumer crashes mid-read. Bus message isn't acked. After visibility timeout, another consumer picks it up and reads the same file. Same file, same bytes, same processing. Dedup by batch id at the consumer side keeps the processing semantics correct.

FSx becomes unavailable mid-write. Writes fail with retryable exceptions. The framework retries with exponential backoff and jitter (the standard schedule: base delay, doubled on each failure, with a random jitter component bounded by the current ceiling). After the max attempts, the failure is propagated to the caller, who is expected to translate it into a rate-limit response or a circuit-break upstream. Critically, the framework does not sleep on the I/O executor thread. Retries are scheduled on a dedicated scheduler so the I/O threads stay free to handle other writes.

FSx is healthy but slow (a metadata operation stalls). The outer write timeout protects callers. The framework wraps the write future in a withTimeout that completes the caller-visible future with a TimeoutException after a configured deadline, while the actual write is allowed to complete in the background. The framework holds the pooled direct buffer reference until the real write finishes, not until the timeout fires, so we never release memory that NIO is still using. The size of this subtle bookkeeping difference, in production, is the difference between "occasional timeouts" and "occasional segfaults."

That last point is worth dwelling on. Caller-visible timeouts must not free resources that the underlying I/O still owns. The pattern I use:

CompletableFuture<FsxFileWriteResult> writeFuture = inFlightByteLimiter.acquire(...)
        .thenCompose(permit -> writeWithPermit(request, payload, payloadBytes, permit));

writeFuture.whenComplete((result, throwable) -> payload.close());

return withTimeout(writeFuture, configuration.getWriteTimeout())
        .whenComplete((result, throwable) -> recordWriteMetrics(...));

The payload.close() only runs when the inner writeFuture completes for real, regardless of whether the outer timeout-wrapped future has already returned. The buffer reference outlives the caller-visible future, by design. This is the kind of detail that doesn't show up in any architecture diagram but determines whether your system stays up under load.

Cost worked example at 100 TB/day

Let's do the actual math. The workload: 100 TB/day of compressed transitional data, 24-hour retention, multi-AZ durability required. List prices in us-east-1.

Option	Approximate monthly cost	Why
EFS Elastic Throughput	~$300K	$0.30/GB-month storage + $0.03/GB read + $0.06/GB written kills you at scale
EFS Provisioned Throughput	~$90K	Storage + ~5 GB/s provisioned throughput
MSK provisioned cluster	~$70–90K	Brokers + EBS + cross-AZ replication
Kinesis Data Streams (700 shards)	~$70K	Shards + PUT units, before consumer fan-out
S3 direct from producers	~$67K	$0.023/GB storage cheap, but the small-files PUT overhead and operational complexity push effective cost up
FSx for OpenZFS SSD	~$27K	$0.18/GB-month storage + ~5 GB/s provisioned throughput
FSx for OpenZFS Intelligent-Tiering	~$5K	Same throughput, but storage tier-shifts to Archive within days

Two clarifications. First, these are illustrative ballpark figures for the same shape of workload, not benchmarked actuals. Your numbers will differ. Second, the FSx Intelligent-Tiering number assumes a typical transitional-data access pattern — files are read within the first few minutes and then never touched again, so they migrate to the Archive tier quickly. If your access pattern is heavier (consumers re-reading historical data frequently), the savings shrink because more data stays warm.

The headline numbers are real, though. Moving from a streaming-bus model to a filesystem-pointer model knocks the cost down by roughly an order of magnitude at this scale. Adding Intelligent-Tiering knocks it down by another factor of five-ish. For a 100 TB/day workload you're looking at moving from ~$70K/month to ~$5–27K/month. That's not a marginal optimization. That's the kind of saving that funds the project on its own.

Observability and what to watch

The framework I work with emits a small set of metrics, all tagged by partition id, that have proven repeatedly useful in production:

fsx.write.success / fsx.write.failed — counters of completed writes
fsx.write.latency — write latency in milliseconds
fsx.write.bytes — total bytes written
fsx.write.retry — counter of retry attempts
fsx.write.inflight.bytes — current bytes reserved by running writes
fsx.write.inflight.operations — current running write count
fsx.write.backpressure.pending — current queue depth on the limiter

Watching inflight.bytes against maxInFlightBytes tells you immediately whether you're sized correctly. Watching backpressure.pending tells you whether producers are being throttled by the limiter (a good sign that your downstream is saturated) versus by FSx itself (which would show up as slow write.latency).

On the FSx side, watch the published CloudWatch metrics: DataReadBytes, DataWriteBytes, ClientConnections, NetworkThroughputUtilization, FileServerDiskIopsUtilization, FileServerCacheHitRatio. The cache hit ratio in particular is the early-warning signal for "I should provision an SSD read cache" if your access pattern starts re-touching aged files.

When this pattern is the wrong answer

The honest list, because the worst thing you can do with an architecture article is sell the pattern as universal.

Your traffic is small. If you're moving gigabytes per day, not terabytes per day, the cost gap closes and the operational overhead of a shared filesystem outweighs the savings. Use the streaming bus. It's fine.

Your producers and consumers are in different VPCs / regions. Cross-VPC NFS works but it's awkward. Cross-region defeats the whole point — at that point you're back to needing a real transport. If your topology is multi-region active-active, this pattern fits poorly.

You need true streaming, with downstream subscribers wanting to react to each event individually. This pattern is fundamentally batched. The minimum unit of work is a batch file, not an event. If your downstream wants per-event push semantics within single-digit-millisecond latency, you want a real streaming bus.

Your consumers are serverless functions that can't mount NFS. Lambda can't mount FSx for OpenZFS (it can mount EFS, but not FSx OpenZFS). If your consumer side is Lambda, your options are (a) use EFS instead, (b) front FSx with S3 access points and have Lambda read via S3 (sacrificing latency), or (c) use a small ECS task as an intermediary. None of those are great. Pick a different pattern.

You need the bus to provide ordering, replay, or stream-processing semantics. A pointer-on-bus model gives you ordering only if the bus already provides it (Kinesis with strict shard partitioning, Kafka with partition keys). It does not give you stream replay or windowed aggregations or any of the operations that real stream processors do. If your processing needs those, you need a real stream.

Your team's operational maturity doesn't include filesystem operations. Shared filesystems have failure modes. Stale NFS handles, mount drift, permission issues, capacity-planning surprises. If your team has historically only operated stateless services, they will be surprised by the first NFS-related incident. That's a fixable gap, but it's a gap.

For the broad middle — large volumes, in-VPC consumers, batched processing, cost-sensitive ingest — this pattern is the right answer. The decisive factor is usually the cost math. Once you see your own version of the chart above, the choice tends to make itself.

Closing

There's a recurring blind spot in cloud architecture that I've watched cost teams six- and seven-figure sums over the past few years: treating every data-in-motion problem as a streaming-bus problem. The streaming bus is a wonderful tool when its semantics — per-event delivery, low-latency push, multi-subscriber fan-out — actually match your workload. It is a remarkably expensive tool when your workload is "land bytes here, pick them up over there, then forget them."

Transitional data is that second shape. It deserves a different tool. A shared filesystem with sub-millisecond latency, multi-AZ durability, native compression, atomic rename semantics, and now an intelligent-tiering storage class that automatically migrates cooling data to cheap storage — that's the tool. FSx for OpenZFS is the specific implementation that happens to package all of those properties together at AWS today, but the pattern works against any filesystem with the same properties (in another decade, on another cloud, it'll be something else).

The architecture is small. Producers do an atomic write to FSx and publish a tiny pointer message. Consumers read the pointer and stream the bytes from a mount. The bus carries metadata, not payload. The cost shifts from "scale with byte volume" to "scale with stored volume," and the storage class itself takes care of the cooling story.

The code that implements this well — backpressure by bytes not requests, safe write protocol with temp+rename, idempotent retry without re-reads, careful timeout/buffer-lifetime bookkeeping — is small enough to fit in one engineer's head and robust enough to run for years untouched. The hard part was never the protocol. The hard part was unlearning the reflex that says "data in motion = put it on a stream."

If you take one thing from this, take this: put a calendar reminder for next week to look at your own AWS bill, find the line item with the highest cost-per-byte-of-useful-life, and ask yourself whether the bytes are actually getting their money's worth from the transport they're on. The answer might surprise you. The fix is usually simpler than the architecture diagram makes it look.

Steal the pattern. It works.