DEV Community: Anapeksha Mukherjee

What Designing a Binary Protocol Actually Taught Me

Anapeksha Mukherjee — Thu, 11 Jun 2026 06:10:07 +0000

Most developers never have to design a network protocol from scratch. You use HTTP, gRPC, WebSockets, or something else that already exists and has been debugged by thousands of people over many years. That is the right call for most situations.

I did not take that path when building Vaylix, a key-value database engine. I designed a custom binary protocol called VTP2, and the process taught me things about networking that I would not have picked up any other way.

This is not an argument that you should also build a custom protocol. For most things, you should not. This is an honest account of what I ran into.

Why not HTTP

The first question anyone reasonably asks is: why not just use HTTP?

HTTP is everywhere. The tooling is excellent. Every language has a client. Debugging with curl is trivial. If I had used HTTP, I would have had working client libraries in a dozen languages before writing a single line of server code.

The problem is that HTTP is stateless by design. Every request is independent. Every request carries headers. Every response carries headers. The model assumes that each round trip is a fresh conversation with no memory of what came before.

A database session is the opposite of that. A client connects, authenticates, and then issues many commands over the same connection. The authentication should happen once. The session should carry state. Pipelining requests without waiting for each response to return should be natural, not something you fight the protocol to achieve.

HTTP/2 closes some of this gap. But using HTTP/2 correctly for a stateful session model involves working against the grain of what HTTP was designed for. I would have been spending a lot of time on infrastructure that exists to make HTTP behave less like HTTP.

The other issue is overhead. HTTP headers are verbose. For small key-value operations, the headers can easily exceed the payload. That felt wrong for something designed to be a tight operational data store.

So I went with TCP directly, with a custom framing layer on top.

The first thing TCP teaches you

TCP is a stream. Not a sequence of messages. A stream.

When a client sends two requests back to back, the server cannot assume they arrive as two separate chunks of bytes. They might arrive together. They might arrive in three pieces. One might arrive before the other but the other might split across two read calls.

The first real problem a custom protocol has to solve is: where does one message end and the next begin?

The standard answer is a length-prefixed frame. Every message starts with a fixed-size header that includes the length of the payload that follows. The receiver reads the header, learns how long the body is, reads exactly that many bytes, and now has one complete message.

+--------+-------+---------+------ ... ------+
| magic  |  ver  |  flags  |    payload      |
| 4 bytes| 1 byte| 2 bytes | length bytes    |
+--------+-------+---------+------ ... ------+
         |                 |
         header (fixed)    body (variable)

Simple in theory. The implementation detail that catches you is the partial read. If you ask for 16 bytes and only 10 arrive, you wait. If you ask for 10,000 bytes and the connection closes after 9,000, you have a truncated frame. Both of these are normal TCP behavior and your parser needs to handle them without panicking or blocking forever.

In Rust with Tokio this is manageable, but it requires explicit handling. You cannot just call read() and assume the full frame arrives.

Versioning is a commitment you make to every future client

Once you have framing, the next thing you want is versioning. Not because you plan to break anything, but because you will, and you want a way to handle it gracefully when you do.

VTP2 includes the protocol version in every frame header. This sounds straightforward until you think about what compatibility actually means across versions.

A client built against protocol version 2 connects to a server running version 3. What should happen? There are two reasonable answers:

The server accepts the connection and negotiates down to the common version.
The server rejects the connection with a structured error explaining what versions it supports.

VTP2 uses a startup negotiation step. Before any command frames are exchanged, the client sends a hello frame with its protocol version, its client name and version, and the capabilities it wants to use. The server responds with what it accepts.

This means adding a new capability in a future version is safe because older clients simply do not request it. They get a connection without the new capability and everything works as before.

What you cannot do without breaking things is change the meaning of an existing opcode or restructure an existing response format. Those are wire-breaking changes. The version number is the signal that something fundamentally changed.

I learned this the hard way. In an early version of VTP2, I changed the response format for EXEC to return structured typed results instead of a string list. That was a correctness improvement. It was also a silent breaking change for any client that had already parsed the old response format. Now that is a protocol version boundary: 0.2.x clients are not transaction-wire-compatible with 0.3.0 servers, and the changelog says so explicitly.

Request IDs are not optional when you pipeline

Early in the design, requests used a local counter for identification. It was simple. It was wrong.

When you pipeline requests over a single connection, you might have dozens of in-flight requests at the same time with responses arriving in any order depending on how long each operation takes. If two connections both generate request IDs from a local counter, they can collide. If one connection's counter resets, it can collide with itself.

VTP2 switched to UUIDs for request IDs. Every request carries a UUID. Every response echoes back the same UUID. The client correlates responses to requests using the UUID, not position.

This removes the ordering assumption entirely. Responses can arrive in any order. The client matches them correctly regardless.

The cost is 16 bytes per request and per response for the UUID. For the workloads Vaylix targets, that is irrelevant. For a high-throughput system doing millions of tiny operations per second, it might be worth revisiting. For coordination state, it is the right tradeoff.

Checksums are the difference between silent corruption and a caught error

A frame travels from the client through the OS, the network stack, maybe some middleware, and into the server. Bytes can be flipped. Not often. Not reliably reproducibly. But it happens.

Without a checksum, a corrupted frame is processed as if it were valid. The server executes a command with wrong arguments, or writes garbage to the store, or produces a result nobody asked for. The error is silent and the consequences are unpredictable.

VTP2 includes a checksum in the frame header that covers the payload. If the checksum does not match, the frame is rejected before any processing happens. The client gets a structured error with the expected and actual checksum values. The server logs it. Nothing gets executed.

One subtlety: Vaylix uses zstd compression on outbound frames above a size threshold. The checksum validates the compressed payload, not the decompressed payload. This means a compression bug that produces different bytes would be caught by the checksum, but a decompression bug that produces different bytes would not. That asymmetry is deliberate and documented, but it is the kind of thing that is easy to get backwards if you do not think it through.

Error codes need to be stable forever

Error handling is the part of protocol design that is easiest to under-invest in early and hardest to fix later.

The naive approach is to return error strings. A server returns "key not found" and the client parses the string. This works until you change the error message for any reason, which you will, and every client that pattern-matched the string silently breaks.

VTP2 uses structured errors with three fields: a stable numeric code, a stable string name, and a human-readable message that can change freely.

{
  code: 4001,
  name: "KEY_NOT_FOUND",
  message: "the key 'config:env' does not exist"
}

Client code matches on the numeric code or the name. The message is for humans debugging the problem. You can change the message text without breaking anything. You cannot change the code or the name without a versioned protocol change.

The error codes are now documented in ERROR_CODES.md and treated as a stability contract. Any code that shipped in a release will not be reused for a different failure class. Adding new codes is fine. Changing old ones is a breaking change.

This seems like a lot of discipline for a small project. It is. But the alternative is telling users that their error handling broke because I rewrote a string.

Capability negotiation solves the feature drift problem

As a protocol evolves, new features get added. Compression. Request deadlines. Trace context propagation. Metrics. Each of these is useful in some contexts and irrelevant or undesirable in others.

Hard-coding every feature into every connection creates two problems. Clients that do not need compression still pay the negotiation cost. Servers that add a new feature have no way to know which connected clients support it.

VTP2 uses capability negotiation in the startup hello. The client lists the capabilities it wants. The server lists what it accepts. The intersection is what the connection uses.

Current capabilities:

zstd — frame-level compression
request_deadline — per-request timeout propagation
server_metrics — server-side metric events
pipelining — explicit pipeline mode
trace_context — distributed trace ID propagation

Adding a new capability in a future release is safe because existing clients just do not request it. The server enables it only for clients that ask. There is no flag day where all clients must be updated simultaneously.

What I would do differently

The startup negotiation adds latency to connection establishment. One extra round trip before any commands can be sent. For long-lived connections this is a one-time cost and irrelevant. For workloads with short-lived connections, it adds up.

If I were starting over, I would think harder about whether the hello/server-hello round trip could be combined with the first command frame or at least pipelined without waiting for the server hello response before sending the first request.

I also underestimated how much work the per-language SDK burden would be. Every language binding starts from scratch with VTP2. There is no existing tooling, no existing parser, no existing test suite. A first-class TypeScript SDK exists now and a Go SDK is in progress, but each one is weeks of work that would have been avoided with RESP or gRPC.

Whether that tradeoff was worth it depends on what the protocol enables. For a system where the protocol needs to carry replication metadata, structured error codes, versioned CAS operations, and request deadlines on the same transport, building something that was designed for all of that from the start made the implementation cleaner than it would have been if those features had been retrofitted onto RESP.

But that is a judgment call that only makes sense in hindsight.

The short version

TCP gives you a stream, not messages. Framing is your problem.

Versioning is a commitment you make to every client that ever connects. Get it wrong early and you pay later.

Request IDs need to be globally unique if you pipeline.

Checksums catch corruption before it becomes a silent bug.

Error codes are forever. Treat them that way from the start.

Capability negotiation is the only sane way to evolve a protocol without breaking existing clients.

None of these are surprising in retrospect. Building VTP2 was the only way I was going to understand them properly.

VTP2 is the transport protocol powering Vaylix, an open source key-value database engine built for operational state that must survive crashes.

vaylix / vaylix

A key-value database engine in Rust. Custom binary protocol, RBAC, encrypted WAL persistence, Raft-style replication, TLS/mTLS, and binary-safe values with versioned compare-and-set.

Vaylix

Vaylix is a Rust key/value database built around a strict transport boundary:

client -> transport -> TCP/TLS -> transport -> server -> engine

The current server stores UTF-8 keys with opaque byte values using segmented WAL plus encrypted snapshot persistence. It includes a shared framed binary transport, a Tokio multi-client server, authentication with RBAC, optional TLS/mTLS, default-on frame compression, logical backup/restore commands, offline PITR-oriented storage subcommands, maintenance mode, hash-chained audit logging, and Raft-style HA replication with automatic leader election and quorum-backed writes.

Detailed architecture context lives in LLM.md Benchmark guidance lives in BENCHMARKING.md Stability and compatibility contracts live in STABILITY.md, COMPATIBILITY_1_0.md, ERROR_CODES.md, NON_GOALS.md, and DEPLOYMENT.md.

Downloads

Release binaries are published from tagged releases:

Server and client archives: https://github.com/vaylix/vaylix/releases
Server image: ghcr.io/vaylix/vaylix:latest
Versioned server image example: ghcr.io/vaylix/vaylix:0.9.0

Release builds also publish SBOMs and keyless Sigstore/cosign attestations.

Run with Docker

docker pull ghcr.io/vaylix/vaylix:latest
docker

…

View on GitHub

Write-ahead logs: what fsync actually means and why it matters

Anapeksha Mukherjee — Tue, 09 Jun 2026 08:15:32 +0000

write() returned OK. Your data did not make it to disk.

There is a line of code that almost every developer has written and trusted completely.

file.write(data)

It returns. No error. You move on.

What actually happened is that the operating system accepted your data into a buffer in memory, marked some pages as dirty, and returned control to your program. The data has not touched the disk yet. The kernel will flush it eventually, in batches, when it decides the time is right.

If the process crashes, or the machine loses power, or the kernel panics between your write and that eventual flush, your data is gone. The write call succeeded. The data did not survive.

This is not a bug. It is how operating systems work. Buffered writes are one of the most significant performance optimizations in the entire I/O stack. The kernel batches small writes into larger sequential flushes, coalesces writes to the same blocks, and avoids saturating the disk with every individual write call. For most workloads, this is exactly what you want.

For a database, it is a disaster waiting to happen.

The lie at the heart of write()

When you issue a write command on a file descriptor, the data is mainly copied from user space to kernel space into the operating system's buffers. The kernel does not write the data directly to storage. It marks the pages as dirty and returns success to the user. The kernel periodically detects dirty data in its page buffers and writes it lazily in batches, trying to optimize write throughput.

So when your code does this:

let mut file = File::create("data.bin")?;
file.write_all(&payload)?;
// payload is in kernel buffer, not on disk

The write succeeded in the kernel's accounting. It has not reached persistent storage.

The gap between "write returned OK" and "data is on disk" is the window where a crash causes data loss. For an application that just saved a user's document, this might mean a few seconds of lost work. For a database that just told a client their write succeeded, it means a durability lie.

What fsync actually does

fsync() transfers all modified in-core data of the file referred to by the file descriptor to the disk device so that all changed information can be retrieved even if the system crashes or is rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed.

That blocking is the important part. fsync does not return until the hardware confirms the data is on non-volatile storage. Your program waits. The disk works. When fsync returns, you have a guarantee.

let mut file = File::create("data.bin")?;
file.write_all(&payload)?;
file.sync_all()?; // blocks until disk confirms
// now you can tell the client OK

The cost is real. Compared to a buffered write, fsync is slow. A modern NVMe SSD can handle tens of thousands of fsync operations per second under ideal conditions, but random synchronous writes at scale add up fast. Every database that takes durability seriously has to decide where to put this cost and how to amortize it.

There is also one subtlety worth knowing: fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that, an explicit fsync() on a file descriptor for the directory is also needed. This matters for atomic file replacement. If you write to a temp file and rename it over the old one, you need to fsync the directory too, or the rename may not survive a crash even if the file contents did.

Enter the write-ahead log

The naive solution to the durability problem is to fsync every write. Write the data, fsync, return OK to the client. That is correct but painful because every client write now pays the full disk latency cost synchronously.

The write-ahead log is the solution that every serious database converged on independently.

The name Write-Ahead Log says it all. It is a log that gets written before any risky change actually touches the database files.

Instead of writing changed data directly to its final location on disk, the database first appends a record of the change to a sequential log file, fsyncs that log entry, and only then acknowledges the write to the client. The actual data structures are updated separately, in the background, without being on the critical path for the client's write.

The write path looks like this:

client: SET config:env production

1. append entry to WAL:
   [seq:001][term:1][SET config:env production][checksum]

2. fsync WAL segment

3. update in-memory state:
   map.insert("config:env", "production")

4. return OK to client

The disk write is to the WAL, not to the primary data structure. And WAL writes are sequential appends, which are significantly faster than random writes to arbitrary locations in a data file.

If we follow this procedure, we do not need to flush data pages to disk on every transaction commit, because we know that in the event of a crash we will be able to recover the database using the log: any changes that have not been applied to the data pages can be redone from the WAL records.

What crash recovery actually looks like

When the process starts after a crash, it cannot trust its in-memory state because that state is gone. It cannot trust its primary data structures entirely because they may reflect partial writes. What it can trust is the WAL, because every entry in the WAL was fsynced before the client was told OK.

Recovery is replay:

startup:
  1. load last known good snapshot (if any)
  2. find WAL segments that postdate the snapshot
  3. replay each entry in strict sequence order
  4. skip entries past the last committed sequence
  5. in-memory state is now consistent with last committed write

This is deterministic. Given the same WAL, recovery always produces the same state. There is no ambiguity about what happened before the crash.

The sequence numbers matter. If a WAL segment has a gap, the entries after the gap are suspect. If entries are out of order, replay cannot be trusted. If a checksum fails on a WAL entry, that entry is corrupt and recovery should fail closed rather than apply a corrupted change.

In Vaylix, startup recovery fails closed on any of these conditions:

gap in sequence numbers → startup error, actionable message
out of order entries    → startup error
checksum mismatch       → startup error
unsupported format      → startup error

Failing closed is the correct choice. A database that recovers silently from a corrupt WAL and presents a subtly wrong state is more dangerous than one that refuses to start and tells you exactly what is wrong.

The snapshot problem

WAL segments cannot grow forever. If a process has run for months and handled millions of writes, replaying the entire WAL history on every restart is not practical.

The solution is snapshots. Periodically, the database serializes its entire current state to disk as a point-in-time snapshot. After a successful snapshot, WAL segments that predate it can be discarded. Recovery becomes: load the snapshot, then replay only the WAL entries that came after it.

But snapshots introduce their own failure modes.

What if the process crashes while writing the snapshot? A half-written snapshot file is worse than no snapshot at all, because you might load it and think you have valid state when you have garbage.

The solution is the atomic rename pattern:

1. serialize current state
2. write to a temporary file: snapshot.tmp
3. fsync the temporary file
4. fsync the parent directory
5. atomically rename snapshot.tmp → snapshot
6. fsync the parent directory again
7. write manifest pointing to new snapshot
8. fsync manifest
9. prune old WAL segments

The rename is atomic on all major filesystems. Either the old snapshot is there or the new one is. There is no state where a half-written file is the active snapshot.

The directory fsyncs around the rename are easy to forget and important. Without them, the rename itself may not survive a crash even if the file contents are fine.

WAL segments and retention

A single WAL file that grows indefinitely would be inefficient to manage. Most implementations split the WAL into fixed-size segments.

In Vaylix, the active segment is named to reflect its starting sequence:

wal/
  active-000001.wal       ← current segment, still being written
  000001-000500.wal       ← sealed segment
  000501-001000.wal       ← sealed segment

When the active segment reaches the configured size limit, it is sealed: renamed to include its sequence range, and a new active segment is opened. Segments older than the configured retention window are pruned after a successful snapshot.

The sealed segment names encode their sequence range deliberately. During recovery, the system can determine the correct replay order from the filenames alone, without needing to open every segment to find its position.

Point-in-time recovery

A WAL that records every change is also a time machine.

If you have a snapshot from Monday night and all WAL segments since then, you can replay the WAL to any point in time: to just before a bad deploy at 2pm Tuesday, to exactly the state the database was in when a bug was first reported, or to any sequence number you choose.

This is what PITR (point-in-time recovery) means in practice. It is not magic. It is just WAL replay stopped at a specific boundary.

vaylix pitr restore \
  --source-dir /var/lib/vaylix \
  --target-dir /var/lib/vaylix-restored \
  --to-timestamp-ms 1749200000000

The restore writes to a new target directory and never touches the source. If the restore produces the wrong state, you still have the original data intact to try again.

What this cost in practice

Building the WAL in Vaylix surfaced a few things that are easy to get wrong.

The WAL I/O worker runs on a dedicated thread, separate from the engine coordinator that assigns sequence numbers. Sequence assignment is fast and in-memory. The disk I/O is pushed off the hot path. But those two things have to stay in sync: the engine worker must not acknowledge a write to a client until the WAL I/O worker confirms the corresponding entry has been fsynced.

Getting that handoff wrong in either direction is a bug. Too early and you have the durability problem again. Too late and you serialize unnecessarily and hurt throughput.

The checksum on each WAL entry is also not optional. Without it, a partial write to the WAL is indistinguishable from a complete one. The checksum is what lets recovery distinguish a truncated entry, which happens when the process is killed mid-append, from a valid entry that happens to contain unusual bytes.

Directory fsyncs, as mentioned above, are easy to forget. They are not obviously required by the code. They show up as flaky data loss that only reproduces under specific crash conditions during specific filesystem operations. They are the kind of thing that only gets discovered through careful testing or production incidents.

The short version

write() buffers data in the kernel. The data is not on disk until something forces it there.

fsync() forces it there and blocks until the hardware confirms.

A write-ahead log puts fsync on the path of every acknowledged write, but makes that cost manageable by writing to a sequential append-only log rather than to arbitrary locations in the primary data structure.

On crash, replay the WAL to reconstruct state. On growth, take snapshots and prune old segments.

The details are subtle: sequence gaps, checksums, directory fsyncs, the atomic rename pattern, the ordering of snapshot and WAL pruning steps. Each one is a failure mode that shows up in the wrong conditions at the wrong time.

But the core idea is simple. Write first, confirm later, replay if needed.

Vaylix is an open source key-value database engine built around these principles. The WAL implementation, crash recovery path, and snapshot logic are all in the public repository.

vaylix / vaylix

A key-value database engine in Rust. Custom binary protocol, RBAC, encrypted WAL persistence, Raft-style replication, TLS/mTLS, and binary-safe values with versioned compare-and-set.

Vaylix

Vaylix is a Rust key/value database built around a strict transport boundary:

client -> transport -> TCP/TLS -> transport -> server -> engine

Downloads

Release binaries are published from tagged releases:

Server and client archives: https://github.com/vaylix/vaylix/releases
Server image: ghcr.io/vaylix/vaylix:latest
Versioned server image example: ghcr.io/vaylix/vaylix:0.9.0

Release builds also publish SBOMs and keyless Sigstore/cosign attestations.

Run with Docker

docker pull ghcr.io/vaylix/vaylix:latest
docker

…

View on GitHub

Raft, Quorum, and High Availability

Anapeksha Mukherjee — Sun, 07 Jun 2026 13:01:59 +0000

If you run one database node, life is simple. There is one copy of the data, one process accepting writes, and one place to recover from.

The moment you run three nodes, you get a harder question:

If these machines disagree, which one is right?

That is the problem Raft is built to solve.

Raft is a consensus algorithm. More specifically, it is a way for a group of machines to agree on a single ordered log of operations, even when some machines crash, restart, lag behind, or lose network connectivity.

It shows up in databases, metadata stores, schedulers, service discovery systems, and control planes because it gives those systems a practical answer to a dangerous question: how do we stay available without letting different nodes invent different versions of reality?

The Core Idea

Raft turns a cluster of nodes into one logical state machine.

Instead of letting every node mutate state independently, Raft makes nodes agree on a log:

1. SET user:1 "Ada"
2. SET user:2 "Grace"
3. DELETE session:abc

Each node applies the same committed log entries in the same order. If the state machine is deterministic, the nodes end up with the same state.

That is the whole trick.

same log + same order + deterministic apply = same state

Raft is not really about key-value stores or SQL or config files. Those are just things you can build on top. Raft is about agreeing on the order of changes.

The Three Roles

Each Raft node is in one of three roles:

Follower
Candidate
Leader

A follower is passive. It accepts messages from the leader and votes in elections.

A candidate is trying to become leader.

The leader is the node currently responsible for handling writes and replicating log entries to the rest of the cluster.

In a healthy cluster, there is one leader for the current term.

node-a: leader
node-b: follower
node-c: follower

Clients usually send writes to the leader. Some systems let clients connect to any node, but under the hood the write still has to reach the leader or go through an equivalent consensus path.

Terms: Raft's Logical Clock

Raft uses numbered terms:

term 1
term 2
term 3

A term is not wall-clock time. It is a logical epoch.

Every election happens in a term. If a candidate wins, it becomes leader for that term.

Terms help nodes detect stale information. If a node receives a message from an older term, it knows the sender is behind. If it sees a newer term, it updates itself and steps down if needed.

Example:

node-a thinks it is leader in term 4
node-b has already seen term 5

If node-a sends a leader message with term 4, node-b rejects it. That prevents old leaders from continuing to act authoritative after the cluster has moved on.

What Quorum Means

A quorum is the minimum number of voting nodes needed to make a decision.

In Raft, quorum usually means a majority:

quorum = floor(N / 2) + 1

Voting Nodes	Quorum
1	1
2	2
3	2
5	3
7	4

The important part is not the formula. The important part is overlap.

Any two majorities must share at least one node.

In a 3-node cluster:

majority 1: node-a + node-b
majority 2: node-b + node-c

Both include node-b.

In a 5-node cluster:

majority 1: node-a + node-b + node-c
majority 2: node-c + node-d + node-e

Both include node-c.

That overlap is what stops two isolated groups from making conflicting committed decisions.

Why Quorum Matters

Suppose we have three nodes:

node-a
node-b
node-c

The leader receives:

SET color "blue"

If the leader writes locally and immediately returns success, the write is fragile. The leader could crash before anyone else receives the entry.

Raft waits for the entry to be replicated to a quorum.

For three nodes, quorum is two:

node-a: SET color "blue"
node-b: SET color "blue"
node-c: not yet replicated

Once two nodes have the entry, the leader can mark it committed.

Why is that safe?

Because any future leader also needs a quorum to win an election. Since quorums overlap, a future leader election must involve at least one node that knows about the committed entry.

That is the safety property. A committed entry cannot simply vanish because one machine died.

Leader Election

Followers expect regular heartbeats from the leader.

If a follower does not hear from the leader before its election timeout, it assumes the leader may be gone and starts an election:

It becomes a candidate.
It increments its term.
It votes for itself.
It asks the other nodes for votes.
If it receives votes from a quorum, it becomes leader.

For a 3-node cluster:

node-a: candidate, votes for node-a
node-b: votes for node-a
node-c: no response

node-a has two votes. That is a majority, so it becomes leader.

Raft uses randomized election timeouts to reduce split votes. If every follower started an election at exactly the same time, each might vote for itself and nobody would win. Randomized timeouts make it likely that one node starts first and collects votes before the others become candidates.

Log Replication

Once there is a leader, writes go through the leader.

The flow looks like this:

client -> leader: SET user:1 "Ada"
leader: append entry to local log
leader -> followers: replicate entry
followers -> leader: acknowledged
leader: commit after quorum
leader -> client: success

The leader does not need every follower to acknowledge the write. It needs a quorum.

In a 5-node cluster, quorum is 3. The leader plus two followers is enough to commit.

That is why a Raft cluster can keep working even when some nodes are down.

Committed vs Applied

Two words matter a lot in Raft:

committed
applied

An entry is committed when Raft has made it durable according to the quorum rule.

An entry is applied when a node has actually executed it against its local state machine.

A follower can know that entry 100 is committed but only have applied through entry 98.

That distinction matters for reads. A node that has not applied the latest committed entries may return stale data if it serves reads directly from local state.

Reads Are Subtle

Writes naturally go through the log. Reads are easier to get wrong.

Imagine an old leader gets separated from the rest of the cluster. It still has data. It still has a process running. It may even still believe it is leader for a short time.

Meanwhile, the majority side elects a new leader and commits newer writes.

If the old leader serves reads without proving it still has authority, it can return stale answers.

There are a few common ways to handle this.

Route Reads To The Leader

This is simple and common.

The read goes to the leader, and the leader confirms it still has quorum authority before serving the read.

This is correct, but all linearizable read traffic goes through the leader.

ReadIndex

Raft ReadIndex is an optimization.

The leader establishes a safe read point, usually by confirming authority with a quorum. A follower can then wait until it has applied through that index and serve the read locally.

The important part is the wait:

safe read index = 120
follower applied index = 118
wait
follower applied index = 120
serve read

Without that wait, the follower may be stale.

Lease Reads

Lease reads rely on timing assumptions. The leader assumes it remains valid for a lease window after contacting quorum.

They can be fast, but they require careful timeout and clock assumptions. Getting them wrong can break consistency.

Stale Reads

Some systems intentionally allow local follower reads because they are fast and distribute load.

That is fine when the API is honest about it:

linearizable read: latest committed state
stale read: local replica state

The trouble starts when stale reads are presented as if they are strongly consistent.

High Availability: What Raft Actually Gives You

High availability does not mean every node can always accept writes.

In a strongly consistent Raft system, availability means:

The cluster can continue accepting safe writes as long as a quorum is alive and connected.

For three nodes:

quorum = 2

The cluster can tolerate one failed node:

node-a: alive
node-b: alive
node-c: down

For five nodes:

quorum = 3

The cluster can tolerate two failed nodes:

node-a: alive
node-b: alive
node-c: alive
node-d: down
node-e: down

In general, a cluster of 2f + 1 voting nodes can tolerate f failures.

Nodes	Quorum	Failures Tolerated
1	1	0
3	2	1
5	3	2
7	4	3

This is why odd-sized clusters are common.

Why Two Nodes Are Usually Not Enough

A 2-node cluster has quorum 2.

node-a
node-b

If either node fails, only one node remains. One node is not a quorum, so the cluster cannot safely elect a leader or commit writes.

That means a 2-node Raft cluster often gives you redundancy without useful write availability.

Three nodes is usually the smallest practical production setup.

Network Partitions

Network partitions are where quorum really earns its keep.

Take a 5-node cluster:

node-a
node-b
node-c
node-d
node-e

Now split the network:

partition 1: node-a, node-b
partition 2: node-c, node-d, node-e

Quorum is 3.

Partition 1 has only 2 nodes, so it cannot elect a leader or commit writes.

Partition 2 has 3 nodes, so it can continue.

This prevents split brain. The minority side may be alive, but it cannot make authoritative decisions.

That is the tradeoff: Raft sacrifices availability on the minority side to protect consistency.

What Happens During Failover?

Suppose the leader dies:

node-a: leader, down
node-b: follower
node-c: follower

The followers stop receiving heartbeats. After an election timeout, one follower becomes candidate.

node-b: candidate

It requests votes:

node-b votes for node-b
node-c votes for node-b

Now node-b has quorum and becomes leader.

There is usually a short window where writes are unavailable. Once a new leader is elected, the cluster can accept writes again.

This is high availability with a consistency boundary. The system pauses rather than allowing two leaders to accept conflicting writes.

Repairing Divergent Logs

Followers can fall behind. They can also contain entries that were written by an old leader but never committed.

Raft repairs this through log matching.

Append requests include information about the entry immediately before the new entries:

previous log index
previous log term
new entries

If the follower does not have the expected previous entry, it rejects the append. The leader backs up and tries again from an earlier point.

Once the leader finds a matching prefix, it overwrites the follower's conflicting suffix.

That sounds aggressive, but it is correct. Uncommitted entries are not guaranteed to survive. Committed entries are.

Snapshots And Compaction

Logs cannot grow forever.

If a system has applied millions of entries, it should not need to keep every old entry around just to recover.

A snapshot captures the state machine at a specific log index:

snapshot at index 1,000,000

After that, older log entries can be compacted.

If a follower is far behind, the leader may send a snapshot instead of trying to stream a huge log history.

Snapshots are not an optional polish feature in real systems. They are part of making Raft operationally practical.

Membership Changes

Adding or removing voting nodes is harder than it looks because membership changes alter quorum.

If different parts of the cluster disagree about who can vote, you can accidentally create two groups that both think they have authority.

That is why membership changes must go through the replicated log too.

The cluster has to agree on configuration changes with the same care it uses for data changes.

Different Raft implementations handle this in different ways, often with joint consensus or carefully sequenced configuration transitions.

What Raft Guarantees

Raft is designed around a few key safety properties:

Election safety: at most one leader is elected in a term.
Leader append-only: a leader does not rewrite its own log.
Log matching: matching entries imply matching prior history.
Leader completeness: committed entries appear in future leaders' logs.
State machine safety: two nodes do not apply different commands at the same log index.

These properties let a group of machines behave like one ordered state machine.

What Raft Does Not Solve

Raft is not a complete database.

It does not automatically fix:

bad disk durability settings
overloaded nodes
slow replication links
unsafe read paths
poor timeout tuning
multi-shard transactions
application-level conflicts
bad operational procedures

Raft gives you a way to agree on ordered changes. The system around it still has to use that agreement correctly.

Common Mistakes

Acknowledging Writes Too Early

If a leader returns success before quorum replication, a successful write can disappear during failover.

The safe flow is:

append locally
replicate to quorum
mark committed
respond success

Treating Follower Reads As Strong Reads

Follower reads are not automatically linearizable.

They are fine if they are documented as stale reads. They are dangerous if callers expect read-after-write consistency.

Letting Old Leaders Serve Reads

A partitioned leader can still be alive.

Before serving a strong read, a leader needs proof that it still has authority. Otherwise it may return stale state after a new leader has already been elected.

Making Timeouts Too Aggressive

Short election timeouts can make a healthy cluster flap during latency spikes.

Long election timeouts make failover slow.

There is no universal perfect value. Timeouts have to match the environment.

A Good Mental Model

Raft is not "replication" in the casual sense of copying bytes to other machines.

It is stricter than that.

Raft is agreement over history.

The leader proposes the next entries. A quorum decides which entries are durable enough to become part of the cluster's history. Every node applies that history in order.

The shortest version is:

Raft = one agreed log, replicated by majority decisions

That is why quorum matters. It makes sure every committed decision intersects with future decisions.

Closing Thoughts

High availability is not just keeping processes alive. A system that stays online but returns conflicting answers is not healthy.

Raft gives distributed systems a way to remain available on the majority side of failures while protecting the integrity of committed state.

That is the tradeoff:

keep serving when a quorum is available
stop serving authoritative writes when a quorum is not available
never let two disconnected groups commit conflicting histories

Once that clicks, Raft becomes less mysterious. It is a disciplined way to decide which changes become real.

Why I Stopped Using Redis for Coordination State and Built Something Else

Anapeksha Mukherjee — Fri, 05 Jun 2026 20:42:46 +0000

I have been building AuthSafe, a developer auth platform, for three years. Auth infrastructure is unforgiving about correctness. A stale session state, a dropped rate limit counter, a coordination write that silently vanished, these are not edge cases you can wave away. They are production bugs with real consequences.

For most of those three years, I used Redis for coordination state. Rate limiting counters. Session metadata. Configuration that had to survive a restart. It worked well enough that I did not question it seriously until I started digging into what well enough actually meant on the durability side.

What I found made me uncomfortable enough to build something else. That project is Vaylix.

The Default Redis Durability Story Is Worse Than Most People Think

Here is the part that surprised me: AOF persistence, the mechanism that gives Redis its best durability story, is disabled by default in open source Redis. What runs out of the box is RDB snapshotting, which takes periodic point-in-time snapshots of the dataset. The default RDB configuration triggers a snapshot after 3600 seconds if at least one key changed, after 300 seconds if at least 10 keys changed, and after 60 seconds if at least 10,000 keys changed.

Which means under default configuration, a Redis crash can lose anywhere from one minute to one hour of writes depending on write volume. The client got OK back. The data is gone.

If you enable AOF and set appendfsync everysec, that window shrinks to approximately one second. That is the configuration most production guides recommend and what Redis's own documentation describes as "fast enough and relatively safe." But one second of acknowledged writes disappearing is still a meaningful data loss window for coordination state, and appendfsync always, which fsyncs on every write, drops throughput by over 500 times compared to the default — from tens of thousands of operations per second down to a few thousand at best on SSDs.

None of this is a criticism of Redis. These are deliberate tradeoffs, documented openly, that make total sense for a system designed primarily as a cache. Redis is fast because it does not pay the full durability cost on every write. That is the right design for the workload it was built for.

For coordination state in an auth platform, it was the wrong design for my workload.

Why I Did Not Just Switch to etcd

etcd is the standard answer for strongly consistent key-value storage. It is battle-tested, runs at significant scale inside Kubernetes clusters worldwide, and its durability guarantees are genuine. Writes are not acknowledged until they are committed through Raft consensus and fsynced.

The problem is not etcd's correctness. The problem is that etcd's entire operational identity is Kubernetes infrastructure. Its documentation, its deployment patterns, its client API, its watch semantics — all shaped by that context. I spent two days trying to set up a simple three-node etcd cluster for a non-Kubernetes workload and kept hitting documentation that assumed I was configuring cluster state for container orchestration.

The operational weight was not justified for what I needed. I did not need etcd. I needed what etcd guarantees, without etcd's context.

What I Actually Needed

Stripped down to specifics:

Every acknowledged write had to survive a process crash. Not probably. Not within one second. Every write.

Reads had to be consistent with the latest committed write on the same connection. No stale replica reads for session-critical paths.

The security model had to be granular enough that different internal services could access different key namespaces without sharing a single global credential.

The deployment had to be operationally simple for a two or three node setup without a dedicated infrastructure team.

That is a narrow requirements list. No complex queries, no document storage, no pub/sub. Just a key-value store with correct durability semantics and a sensible auth model.

So I Built It

I had been learning Rust seriously and wanted a project with real systems constraints rather than toy complexity. The requirements above were specific enough to guide every architectural decision.

The durability foundation is a write-ahead log with fsync. Every write goes to the WAL and is fsynced before the client receives acknowledgement. On restart, the WAL replays to reconstruct state. No acknowledged write is lost.

On top of that is Raft-style replication. Writes are not acknowledged until a quorum of nodes confirms receipt. Which means even a leader crash immediately after acknowledgement leaves the data on a majority of nodes.

The wire protocol is a custom framed binary format rather than HTTP. Persistent connections with capability negotiation at startup, low per-request overhead, pipelined requests with UUID-based correlation. More work to build than wrapping HTTP, but the right design for stateful sessions.

Authentication and RBAC are on by default. Permissions are pattern-scoped at the key level, so a rate limiting service can read and write ratelimit:* without touching config:* or session:*.

The Honest Tradeoffs

Vaylix is slower than Redis on raw throughput. Significantly slower under appendfsync always comparison. That gap is structural, not an optimization problem. Vaylix fsyncs every write, replicates to a quorum before acknowledging, and runs a serialized engine worker. Redis with default or everysec configuration skips most of that work. The latency difference reflects different guarantees, not different levels of engineering effort.

If you need a cache, a leaderboard, a job queue where occasional loss is tolerable, or a pub/sub bus, Redis is the right tool. Vaylix is not.

If you need acknowledged writes to survive crashes without configuration gymnastics, and you want a security model that does not require every internal service to share a root credential, Vaylix is the gap it was built to fill.

Where It Sits Now

Vaylix is three months old and running inside AuthSafe in production for rate limiting and coordination state. It is the first real test of whether the design holds under actual operational conditions. So far the failure modes have been explicit rather than silent, which is what you want from infrastructure you are trusting with correctness.

It is pre-1.0. The roadmap has richer transaction semantics, better cluster tooling, and more client SDKs. Sharding and MVCC are explicitly deferred until the core model is proven by real usage.

The project is open source under MIT.

If you have been running Redis for coordination workloads and quietly uncomfortable about the durability configuration, or if you have looked at etcd and decided the operational overhead is not worth it for a non-Kubernetes context, I would genuinely value your feedback on whether Vaylix fits the gap you have been working around.

Engine: https://github.com/vaylix/vaylix

TypeScript SDK: https://github.com/vaylix/vaylix-ts

Docs: https://vaylix.github.io