I have been building AuthSafe, a developer auth platform, for three years. Auth infrastructure is unforgiving about correctness. A stale session state, a dropped rate limit counter, a coordination write that silently vanished, these are not edge cases you can wave away. They are production bugs with real consequences.
For most of those three years, I used Redis for coordination state. Rate limiting counters. Session metadata. Configuration that had to survive a restart. It worked well enough that I did not question it seriously until I started digging into what well enough actually meant on the durability side.
What I found made me uncomfortable enough to build something else. That project is Vaylix.
The Default Redis Durability Story Is Worse Than Most People Think
Here is the part that surprised me: AOF persistence, the mechanism that gives Redis its best durability story, is disabled by default in open source Redis. What runs out of the box is RDB snapshotting, which takes periodic point-in-time snapshots of the dataset. The default RDB configuration triggers a snapshot after 3600 seconds if at least one key changed, after 300 seconds if at least 10 keys changed, and after 60 seconds if at least 10,000 keys changed.
Which means under default configuration, a Redis crash can lose anywhere from one minute to one hour of writes depending on write volume. The client got OK back. The data is gone.
If you enable AOF and set appendfsync everysec, that window shrinks to approximately one second. That is the configuration most production guides recommend and what Redis's own documentation describes as "fast enough and relatively safe." But one second of acknowledged writes disappearing is still a meaningful data loss window for coordination state, and appendfsync always, which fsyncs on every write, drops throughput by over 500 times compared to the default — from tens of thousands of operations per second down to a few thousand at best on SSDs.
None of this is a criticism of Redis. These are deliberate tradeoffs, documented openly, that make total sense for a system designed primarily as a cache. Redis is fast because it does not pay the full durability cost on every write. That is the right design for the workload it was built for.
For coordination state in an auth platform, it was the wrong design for my workload.
Why I Did Not Just Switch to etcd
etcd is the standard answer for strongly consistent key-value storage. It is battle-tested, runs at significant scale inside Kubernetes clusters worldwide, and its durability guarantees are genuine. Writes are not acknowledged until they are committed through Raft consensus and fsynced.
The problem is not etcd's correctness. The problem is that etcd's entire operational identity is Kubernetes infrastructure. Its documentation, its deployment patterns, its client API, its watch semantics — all shaped by that context. I spent two days trying to set up a simple three-node etcd cluster for a non-Kubernetes workload and kept hitting documentation that assumed I was configuring cluster state for container orchestration.
The operational weight was not justified for what I needed. I did not need etcd. I needed what etcd guarantees, without etcd's context.
What I Actually Needed
Stripped down to specifics:
Every acknowledged write had to survive a process crash. Not probably. Not within one second. Every write.
Reads had to be consistent with the latest committed write on the same connection. No stale replica reads for session-critical paths.
The security model had to be granular enough that different internal services could access different key namespaces without sharing a single global credential.
The deployment had to be operationally simple for a two or three node setup without a dedicated infrastructure team.
That is a narrow requirements list. No complex queries, no document storage, no pub/sub. Just a key-value store with correct durability semantics and a sensible auth model.
So I Built It
I had been learning Rust seriously and wanted a project with real systems constraints rather than toy complexity. The requirements above were specific enough to guide every architectural decision.
The durability foundation is a write-ahead log with fsync. Every write goes to the WAL and is fsynced before the client receives acknowledgement. On restart, the WAL replays to reconstruct state. No acknowledged write is lost.
On top of that is Raft-style replication. Writes are not acknowledged until a quorum of nodes confirms receipt. Which means even a leader crash immediately after acknowledgement leaves the data on a majority of nodes.
The wire protocol is a custom framed binary format rather than HTTP. Persistent connections with capability negotiation at startup, low per-request overhead, pipelined requests with UUID-based correlation. More work to build than wrapping HTTP, but the right design for stateful sessions.
Authentication and RBAC are on by default. Permissions are pattern-scoped at the key level, so a rate limiting service can read and write ratelimit:* without touching config:* or session:*.
The Honest Tradeoffs
Vaylix is slower than Redis on raw throughput. Significantly slower under appendfsync always comparison. That gap is structural, not an optimization problem. Vaylix fsyncs every write, replicates to a quorum before acknowledging, and runs a serialized engine worker. Redis with default or everysec configuration skips most of that work. The latency difference reflects different guarantees, not different levels of engineering effort.
If you need a cache, a leaderboard, a job queue where occasional loss is tolerable, or a pub/sub bus, Redis is the right tool. Vaylix is not.
If you need acknowledged writes to survive crashes without configuration gymnastics, and you want a security model that does not require every internal service to share a root credential, Vaylix is the gap it was built to fill.
Where It Sits Now
Vaylix is three months old and running inside AuthSafe in production for rate limiting and coordination state. It is the first real test of whether the design holds under actual operational conditions. So far the failure modes have been explicit rather than silent, which is what you want from infrastructure you are trusting with correctness.
It is pre-1.0. The roadmap has richer transaction semantics, better cluster tooling, and more client SDKs. Sharding and MVCC are explicitly deferred until the core model is proven by real usage.
The project is open source under MIT.
If you have been running Redis for coordination workloads and quietly uncomfortable about the durability configuration, or if you have looked at etcd and decided the operational overhead is not worth it for a non-Kubernetes context, I would genuinely value your feedback on whether Vaylix fits the gap you have been working around.
Engine: https://github.com/vaylix/vaylix
TypeScript SDK: https://github.com/vaylix/vaylix-ts
Docs: https://vaylix.github.io
Top comments (0)