Anapeksha Mukherjee

Posted on Jun 5

Why I Stopped Using Redis for Coordination State and Built Something Else

#rust #distributedsystems #database #opensource

I have been building AuthSafe, a developer auth platform, for three years. Auth infrastructure is unforgiving about correctness. A stale session state, a dropped rate limit counter, a coordination write that silently vanished, these are not edge cases you can wave away. They are production bugs with real consequences.

For most of those three years, I used Redis for coordination state. Rate limiting counters. Session metadata. Configuration that had to survive a restart. It worked well enough that I did not question it seriously until I started digging into what well enough actually meant on the durability side.

What I found made me uncomfortable enough to build something else. That project is Vaylix.

The Default Redis Durability Story Is Worse Than Most People Think

Here is the part that surprised me: AOF persistence, the mechanism that gives Redis its best durability story, is disabled by default in open source Redis. What runs out of the box is RDB snapshotting, which takes periodic point-in-time snapshots of the dataset. The default RDB configuration triggers a snapshot after 3600 seconds if at least one key changed, after 300 seconds if at least 10 keys changed, and after 60 seconds if at least 10,000 keys changed.

Which means under default configuration, a Redis crash can lose anywhere from one minute to one hour of writes depending on write volume. The client got OK back. The data is gone.

If you enable AOF and set appendfsync everysec, that window shrinks to approximately one second. That is the configuration most production guides recommend and what Redis's own documentation describes as "fast enough and relatively safe." But one second of acknowledged writes disappearing is still a meaningful data loss window for coordination state, and appendfsync always, which fsyncs on every write, drops throughput by over 500 times compared to the default — from tens of thousands of operations per second down to a few thousand at best on SSDs.

None of this is a criticism of Redis. These are deliberate tradeoffs, documented openly, that make total sense for a system designed primarily as a cache. Redis is fast because it does not pay the full durability cost on every write. That is the right design for the workload it was built for.

For coordination state in an auth platform, it was the wrong design for my workload.

Why I Did Not Just Switch to etcd

etcd is the standard answer for strongly consistent key-value storage. It is battle-tested, runs at significant scale inside Kubernetes clusters worldwide, and its durability guarantees are genuine. Writes are not acknowledged until they are committed through Raft consensus and fsynced.

The problem is not etcd's correctness. The problem is that etcd's entire operational identity is Kubernetes infrastructure. Its documentation, its deployment patterns, its client API, its watch semantics — all shaped by that context. I spent two days trying to set up a simple three-node etcd cluster for a non-Kubernetes workload and kept hitting documentation that assumed I was configuring cluster state for container orchestration.

The operational weight was not justified for what I needed. I did not need etcd. I needed what etcd guarantees, without etcd's context.

What I Actually Needed

Stripped down to specifics:

Every acknowledged write had to survive a process crash. Not probably. Not within one second. Every write.

Reads had to be consistent with the latest committed write on the same connection. No stale replica reads for session-critical paths.

The security model had to be granular enough that different internal services could access different key namespaces without sharing a single global credential.

The deployment had to be operationally simple for a two or three node setup without a dedicated infrastructure team.

That is a narrow requirements list. No complex queries, no document storage, no pub/sub. Just a key-value store with correct durability semantics and a sensible auth model.

So I Built It

I had been learning Rust seriously and wanted a project with real systems constraints rather than toy complexity. The requirements above were specific enough to guide every architectural decision.

The durability foundation is a write-ahead log with fsync. Every write goes to the WAL and is fsynced before the client receives acknowledgement. On restart, the WAL replays to reconstruct state. No acknowledged write is lost.

On top of that is Raft-style replication. Writes are not acknowledged until a quorum of nodes confirms receipt. Which means even a leader crash immediately after acknowledgement leaves the data on a majority of nodes.

The wire protocol is a custom framed binary format rather than HTTP. Persistent connections with capability negotiation at startup, low per-request overhead, pipelined requests with UUID-based correlation. More work to build than wrapping HTTP, but the right design for stateful sessions.

Authentication and RBAC are on by default. Permissions are pattern-scoped at the key level, so a rate limiting service can read and write ratelimit:* without touching config:* or session:*.

The Honest Tradeoffs

Vaylix is slower than Redis on raw throughput. Significantly slower under appendfsync always comparison. That gap is structural, not an optimization problem. Vaylix fsyncs every write, replicates to a quorum before acknowledging, and runs a serialized engine worker. Redis with default or everysec configuration skips most of that work. The latency difference reflects different guarantees, not different levels of engineering effort.

If you need a cache, a leaderboard, a job queue where occasional loss is tolerable, or a pub/sub bus, Redis is the right tool. Vaylix is not.

If you need acknowledged writes to survive crashes without configuration gymnastics, and you want a security model that does not require every internal service to share a root credential, Vaylix is the gap it was built to fill.

Where It Sits Now

Vaylix is three months old and running inside AuthSafe in production for rate limiting and coordination state. It is the first real test of whether the design holds under actual operational conditions. So far the failure modes have been explicit rather than silent, which is what you want from infrastructure you are trusting with correctness.

It is pre-1.0. The roadmap has richer transaction semantics, better cluster tooling, and more client SDKs. Sharding and MVCC are explicitly deferred until the core model is proven by real usage.

The project is open source under MIT.

If you have been running Redis for coordination workloads and quietly uncomfortable about the durability configuration, or if you have looked at etcd and decided the operational overhead is not worth it for a non-Kubernetes context, I would genuinely value your feedback on whether Vaylix fits the gap you have been working around.

Engine: https://github.com/vaylix/vaylix

TypeScript SDK: https://github.com/vaylix/vaylix-ts

Docs: https://vaylix.github.io

Top comments (2)

mote • Jun 10

The WAL + fsync per-write approach vs AOF appendfsync always comparison is the right call. People miss that Redis AOF rewrite during high throughput can spike latency 10x — and if your coordination state is tiny (a few MB), writing the full log on every write is totally fine.

One question: your custom binary frame protocol — any reason you didn't go with RESP2/3? Redis clients already speak it, and you could piggyback on existing tooling (redis-cli for debugging, existing SDKs for non-Rust languages). Rolling your own protocol means every new language binding starts from scratch.

On the embedded angle: I've been working on moteDB (Rust embedded DB for AI agents) and found that for coordination state under 100MB, embedding the store directly in-process removes the network hop entirely. Your AuthSafe use case (single service + coordination) might actually benefit from that model — no separate process to manage, no connection pool, crash-recovery is just reopening a file. Curious if you considered that vs a standalone server.

Anapeksha Mukherjee • Jun 11

The AOF rewrite point is something I glossed over in the post and you are right to flag it. Under sustained write load the rewrite forks, copies the dataset, and the latency spike during that window is not subtle. For small coordination state the full-log-per-write approach works fine, but that assumption breaks quietly as the dataset grows.

On RESP2/3: I thought about it and decided against it for one specific reason. RESP is built around Redis's data model. Lists, hashes, sorted sets. Vaylix has none of those, so I would have been carrying protocol baggage for concepts that do not exist in the system. Versioned CAS, structured error codes with stable numeric identifiers, startup capability negotiation, none of that maps cleanly onto RESP without bending it into something it was not designed for. The free tooling argument is real but redis-cli pointed at a server that speaks RESP but behaves nothing like Redis would have been more confusing than helpful. Starting from scratch with VTP2 meant every language client starts from zero, which is a real cost, and one I am actively paying right now building the Go client. But the protocol is shaped around the actual workload rather than borrowed from a different one.

On embedded versus standalone: the no-network-hop model is genuinely appealing and for a single process workload I would have reached for something like sled or redb and skipped all of this. The reason I went standalone is that AuthSafe runs as multiple instances behind a load balancer. They all need to see the same coordination state. An embedded store does not help when the problem is multiple writers on different machines.
moteDB sounds interesting for the single-agent case. What is your consistency model across agent restarts?

P.S. - VTP = Vaylix Transport Protocol