We put AOF persistence through 9 configurations across local SSD SAS and Longhorn. The results are definitive.
When designing a caching layer for a production migration to bare metal Kubernetes, we faced a question that sounds simple but turned out to have an expensive answer: should Redis AOF persistence live on Longhorn distributed storage?
The Redis documentation hints at the answer. But intuition and documentation are not the same as production data. So we ran redis-benchmark across nine configurations — varying storage backend, persistence settings, and dataset size — and measured the impact empirically.
The results are unambiguous, and one number in particular should give any architect pause.
Test Configuration
All tests used the same parameters throughout:
requests: 50,000
clients: 20 parallel
payload: 180,000 bytes (~180 KB)
pipeline: keep-alive=1
thread: single-threaded
The 180 KB payload is intentional — it reflects realistic cache object sizes for the production workload being benchmarked, not the micro-payload tests commonly seen in vendor benchmarks.
Nine environments tested:
| Label | Storage | AOF | RDB | Dataset |
|---|---|---|---|---|
| Local · AOF off | Local SSD SAS | No | Thresholds | Empty |
| Local · AOF on (baseline) | Local SSD SAS | Yes | Thresholds | Empty |
| Local · AOF on (tuning 1) | Local SSD SAS | Yes | Thresholds | Empty |
| Local · AOF on (tuning 2) | Local SSD SAS | Yes | Thresholds | Empty |
| Local · AOF on (t2 + data) | Local SSD SAS | Yes | Thresholds | 375,795 keys |
| Longhorn · AOF on (empty) | Longhorn | Yes | Thresholds | Empty |
| Longhorn · AOF on (data) | Longhorn | Yes | Thresholds | 375,795 keys |
SET Throughput: The Core Finding
The most important metric for a write-capable cache is SET throughput under load. Here are the results:
| Configuration | SET RPS | SET avg latency | SET p99 latency |
|---|---|---|---|
| Local · AOF off | 7,696 | 1.47 ms | 5.12 ms |
| Local · AOF on (baseline) | 1,275 | 14.39 ms | 102.53 ms |
| Local · AOF on (tuning 1) | 1,251 | 15.03 ms | 105.92 ms |
| Local · AOF on (tuning 2) | 1,248 | 15.03 ms | 112.38 ms |
| Local · AOF on (t2 + 375K keys) | 1,212 | 15.85 ms | 121.15 ms |
| Longhorn · AOF on (empty) | 577 | 33.56 ms | 225.66 ms |
| Longhorn · AOF on (375K keys) | 537 | 36.17 ms | 201.86 ms |
Let that sink in. Local SSD SAS with AOF disabled: 7,696 SET RPS, p99 = 5 ms. Longhorn with AOF enabled: 537 SET RPS, p99 = 202 ms.
That is a 14.3x throughput difference and a 39x p99 latency difference — on the same application code, same Redis version, same client.
The worst-case single SET operation on Longhorn reached 903 ms. For a cache layer.
The AOF Wall on Local Storage
Before we get to Longhorn, it's worth understanding what AOF persistence costs even on fast local SSD SAS.
Disabling AOF (keeping only RDB snapshot thresholds) delivers:
- SET p99: 3.8–5.1 ms
- Average SET latency: ~1.5 ms
Enabling AOF on the same local storage:
- SET p99: 102–121 ms
- Average SET latency: 14–16 ms
That's roughly a 20x p99 latency penalty just from AOF on local SSD SAS. And critically — tuning doesn't help. Across three tuning iterations (different appendfsync settings, no-appendfsync-on-rewrite toggles, and RDB threshold adjustments), the p99 numbers barely moved:
Baseline: 102.5 ms p99
Tuning 1: 105.9 ms p99
Tuning 2: 112.4 ms p99 ← actually got worse
The reason is fundamental: AOF with appendfsync everysec must call fsync() at least once per second. On an otherwise busy single-threaded Redis instance processing 180 KB payloads, this fsync stall dominates. You cannot tune your way past it.
Why Longhorn Makes AOF Catastrophic
Longhorn is a distributed block storage system for Kubernetes. It replicates data across nodes for durability. This is excellent for stateful workloads like databases with controlled write patterns.
Redis AOF is not that.
AOF appends to a log file on every write operation (or at least every second with everysec). The write pattern is continuous, small, and latency-sensitive. When this write pattern hits Longhorn:
- Each AOF append crosses the network to the Longhorn controller
- The controller replicates to N replicas before acknowledging
- Only then does Redis get its fsync confirmation
- Redis is single-threaded — it waits
The result: every SET operation pays the cost of network round-trip + multi-replica write confirmation. At 180 KB payload size, this stacks badly.
Redis's own documentation says:
"Avoid storing AOF/RDB files on storage that has network latency in the I/O path, such as NFS mounts."
Longhorn is effectively that — a network-replicated volume. The documentation warning is correct. Our benchmark puts a number on it: 903 ms max latency, 202 ms p99.
GET Performance
One important nuance: GET performance is much less affected by persistence settings.
| Configuration | GET RPS | GET avg latency |
|---|---|---|
| Local · AOF off | 8,027 | 1.47 ms |
| Local · AOF on (baseline) | 2,537 | 4.29 ms |
| Longhorn · AOF on (375K keys) | 2,522 | 4.21 ms |
Longhorn doesn't significantly degrade GET performance compared to AOF-on local storage. This makes sense — reads don't write to the AOF log. The Longhorn penalty only appears when Redis needs to persist.
PING Latency: The Baseline
PING throughput gives a sense of the overhead without persistence in the picture:
| Configuration | PING RPS | PING avg latency |
|---|---|---|
| Local · AOF off | ~37,000 | 0.32 ms |
| Local · AOF on (baseline) | ~11,000–18,000 | 0.84–1.50 ms |
| Longhorn · AOF on | ~19,000–21,000 | 0.74–0.83 ms |
Interestingly, PING performance on Longhorn is better than AOF-on-local at baseline. The Longhorn overhead only materializes when Redis actually needs to write to the AOF log — confirming that the bottleneck is specifically the persistence write path, not general Longhorn I/O overhead.
The Recommended Architecture
Based on these results, the right architecture for this workload is a split-persistence design:
Hot path (primary):
- Redis with AOF disabled
- RDB snapshots only, with generous thresholds (e.g.,
save 3600 1) - Local-path storage on SSD SAS
- Result: 7,600+ SET RPS, sub-5 ms p99
Recovery path (replica):
- Redis replica of the primary
- RDB-only snapshots to persistent storage (Longhorn acceptable here — snapshot writes are infrequent and bursty)
- Not in the hot write path
This gives you sub-5 ms p99 at full throughput on the write path, while maintaining durability guarantees through the replica's periodic snapshots. If the primary fails, you lose at most one RDB snapshot interval of data — which for most cache workloads is acceptable.
If true durability for every write is required (it rarely is for a cache), the right answer is a different tool — not Redis with AOF on distributed storage.
Summary
| Question | Answer |
|---|---|
| Can AOF on local SSD SAS achieve good SET latency? | No. p99 stays above 100 ms regardless of tuning. |
| Can AOF on Longhorn achieve acceptable SET latency? | No. p99 reaches 202 ms, max 903 ms. |
| Does Longhorn affect GET performance with AOF? | Minimally — GETs don't write to AOF. |
| What's the right architecture for high-throughput caching? | AOF disabled on hot path, RDB replica for recovery. |
| Is the Redis documentation warning about network storage accurate? | Definitively yes. Our data confirms it. |
The 14x throughput gap between AOF-on-Longhorn and AOF-off-local is not a configuration problem. It is an architectural mismatch. Building a fast cache on slow persistence is a contradiction — and these numbers prove it.
Environment Details
- Redis version: 7.x
- Storage backends: Local-path provisioner (SSD SAS) and Longhorn 1.6 on Kubernetes 1.31
-
redis-benchmark parameters:
-n 50000 -c 20 -d 180000 --keepalive 1 -
Single-threaded mode throughout (no
--threadsflag) - Dataset: Empty at baseline; 375,795 keys for loaded tests
Questions about Redis architecture on Kubernetes? Leave a comment below.
— Iwan Setiawan, Hybrid Cloud & Platform Architect · portfolio.kangservice.cloud
Top comments (0)