Iwan Setiawan

Posted on Mar 7

Redis + AOF + Distributed Storage: A Cautionary Benchmark

#redis #cloudnative #kubernetes #devops

We put AOF persistence through 9 configurations across local SSD SAS and Longhorn. The results are definitive.

When designing a caching layer for a production migration to bare metal Kubernetes, we faced a question that sounds simple but turned out to have an expensive answer: should Redis AOF persistence live on Longhorn distributed storage?

The Redis documentation hints at the answer. But intuition and documentation are not the same as production data. So we ran redis-benchmark across nine configurations — varying storage backend, persistence settings, and dataset size — and measured the impact empirically.

The results are unambiguous, and one number in particular should give any architect pause.

Test Configuration

All tests used the same parameters throughout:

requests:    50,000
clients:     20 parallel
payload:     180,000 bytes (~180 KB)
pipeline:    keep-alive=1
thread:      single-threaded

The 180 KB payload is intentional — it reflects realistic cache object sizes for the production workload being benchmarked, not the micro-payload tests commonly seen in vendor benchmarks.

Nine environments tested:

Label	Storage	AOF	RDB	Dataset
Local · AOF off	Local SSD SAS	No	Thresholds	Empty
Local · AOF on (baseline)	Local SSD SAS	Yes	Thresholds	Empty
Local · AOF on (tuning 1)	Local SSD SAS	Yes	Thresholds	Empty
Local · AOF on (tuning 2)	Local SSD SAS	Yes	Thresholds	Empty
Local · AOF on (t2 + data)	Local SSD SAS	Yes	Thresholds	375,795 keys
Longhorn · AOF on (empty)	Longhorn	Yes	Thresholds	Empty
Longhorn · AOF on (data)	Longhorn	Yes	Thresholds	375,795 keys

SET Throughput: The Core Finding

The most important metric for a write-capable cache is SET throughput under load. Here are the results:

Configuration	SET RPS	SET avg latency	SET p99 latency
Local · AOF off	7,696	1.47 ms	5.12 ms
Local · AOF on (baseline)	1,275	14.39 ms	102.53 ms
Local · AOF on (tuning 1)	1,251	15.03 ms	105.92 ms
Local · AOF on (tuning 2)	1,248	15.03 ms	112.38 ms
Local · AOF on (t2 + 375K keys)	1,212	15.85 ms	121.15 ms
Longhorn · AOF on (empty)	577	33.56 ms	225.66 ms
Longhorn · AOF on (375K keys)	537	36.17 ms	201.86 ms

Let that sink in. Local SSD SAS with AOF disabled: 7,696 SET RPS, p99 = 5 ms. Longhorn with AOF enabled: 537 SET RPS, p99 = 202 ms.

That is a 14.3x throughput difference and a 39x p99 latency difference — on the same application code, same Redis version, same client.

The worst-case single SET operation on Longhorn reached 903 ms. For a cache layer.

The AOF Wall on Local Storage

Before we get to Longhorn, it's worth understanding what AOF persistence costs even on fast local SSD SAS.

Disabling AOF (keeping only RDB snapshot thresholds) delivers:

SET p99: 3.8–5.1 ms
Average SET latency: ~1.5 ms

Enabling AOF on the same local storage:

SET p99: 102–121 ms
Average SET latency: 14–16 ms

That's roughly a 20x p99 latency penalty just from AOF on local SSD SAS. And critically — tuning doesn't help. Across three tuning iterations (different appendfsync settings, no-appendfsync-on-rewrite toggles, and RDB threshold adjustments), the p99 numbers barely moved:

Baseline:  102.5 ms p99
Tuning 1:  105.9 ms p99
Tuning 2:  112.4 ms p99  ← actually got worse

The reason is fundamental: AOF with appendfsync everysec must call fsync() at least once per second. On an otherwise busy single-threaded Redis instance processing 180 KB payloads, this fsync stall dominates. You cannot tune your way past it.

Why Longhorn Makes AOF Catastrophic

Longhorn is a distributed block storage system for Kubernetes. It replicates data across nodes for durability. This is excellent for stateful workloads like databases with controlled write patterns.

Redis AOF is not that.

AOF appends to a log file on every write operation (or at least every second with everysec). The write pattern is continuous, small, and latency-sensitive. When this write pattern hits Longhorn:

Each AOF append crosses the network to the Longhorn controller
The controller replicates to N replicas before acknowledging
Only then does Redis get its fsync confirmation
Redis is single-threaded — it waits

The result: every SET operation pays the cost of network round-trip + multi-replica write confirmation. At 180 KB payload size, this stacks badly.

Redis's own documentation says:

"Avoid storing AOF/RDB files on storage that has network latency in the I/O path, such as NFS mounts."

Longhorn is effectively that — a network-replicated volume. The documentation warning is correct. Our benchmark puts a number on it: 903 ms max latency, 202 ms p99.

GET Performance

One important nuance: GET performance is much less affected by persistence settings.

Configuration	GET RPS	GET avg latency
Local · AOF off	8,027	1.47 ms
Local · AOF on (baseline)	2,537	4.29 ms
Longhorn · AOF on (375K keys)	2,522	4.21 ms

Longhorn doesn't significantly degrade GET performance compared to AOF-on local storage. This makes sense — reads don't write to the AOF log. The Longhorn penalty only appears when Redis needs to persist.

PING Latency: The Baseline

PING throughput gives a sense of the overhead without persistence in the picture:

Configuration	PING RPS	PING avg latency
Local · AOF off	~37,000	0.32 ms
Local · AOF on (baseline)	~11,000–18,000	0.84–1.50 ms
Longhorn · AOF on	~19,000–21,000	0.74–0.83 ms

Interestingly, PING performance on Longhorn is better than AOF-on-local at baseline. The Longhorn overhead only materializes when Redis actually needs to write to the AOF log — confirming that the bottleneck is specifically the persistence write path, not general Longhorn I/O overhead.

The Recommended Architecture

Based on these results, the right architecture for this workload is a split-persistence design:

Hot path (primary):

Redis with AOF disabled
RDB snapshots only, with generous thresholds (e.g., save 3600 1)
Local-path storage on SSD SAS
Result: 7,600+ SET RPS, sub-5 ms p99

Recovery path (replica):

Redis replica of the primary
RDB-only snapshots to persistent storage (Longhorn acceptable here — snapshot writes are infrequent and bursty)
Not in the hot write path

This gives you sub-5 ms p99 at full throughput on the write path, while maintaining durability guarantees through the replica's periodic snapshots. If the primary fails, you lose at most one RDB snapshot interval of data — which for most cache workloads is acceptable.

If true durability for every write is required (it rarely is for a cache), the right answer is a different tool — not Redis with AOF on distributed storage.

Summary

Question	Answer
Can AOF on local SSD SAS achieve good SET latency?	No. p99 stays above 100 ms regardless of tuning.
Can AOF on Longhorn achieve acceptable SET latency?	No. p99 reaches 202 ms, max 903 ms.
Does Longhorn affect GET performance with AOF?	Minimally — GETs don't write to AOF.
What's the right architecture for high-throughput caching?	AOF disabled on hot path, RDB replica for recovery.
Is the Redis documentation warning about network storage accurate?	Definitively yes. Our data confirms it.

The 14x throughput gap between AOF-on-Longhorn and AOF-off-local is not a configuration problem. It is an architectural mismatch. Building a fast cache on slow persistence is a contradiction — and these numbers prove it.

Environment Details

Redis version: 7.x
Storage backends: Local-path provisioner (SSD SAS) and Longhorn 1.6 on Kubernetes 1.31
redis-benchmark parameters: -n 50000 -c 20 -d 180000 --keepalive 1
Single-threaded mode throughout (no --threads flag)
Dataset: Empty at baseline; 375,795 keys for loaded tests

Questions about Redis architecture on Kubernetes? Leave a comment below.

— Iwan Setiawan, Hybrid Cloud & Platform Architect · portfolio.kangservice.cloud

DEV Community