Posted on Apr 16

I Replaced Redis with PostgreSQL (And It's Faster) Part 2

#webdev #redis #postgres #devops

The metrics, tracing, and observability i should’ve shown the first time

The first post was a story about surprise.

Redis didn’t break. Postgres didn’t magically level up overnight. I just finally looked closely at what the system was actually doing instead of what I assumed it was doing. That story resonated with a lot of people, which honestly surprised me almost as much as the result itself.

But stories only get you so far in infrastructure.

If you say something like “Postgres was faster than Redis,” you’re no longer in anecdote territory. You’re making a performance claim. And performance claims don’t live or die on vibes they live or die on latency percentiles, traces, and dashboards.

Part 1 explained what changed and why I questioned Redis.
Part 2 exists to answer the harder question

How do we know?

This time, everything is measured properly. Same workload. Same infrastructure. Same code paths. But now with tracing enabled end-to-end, latency broken down into p50/p95/p99, and observability metrics that make the tradeoffs impossible to hand-wave away.

This isn’t about proving Redis is “bad” or Postgres is secretly magical. It’s about showing what actually happens when you look at the full request path network hops, serialization, queueing, and all the boring details we usually skip.

TL;DR

Yes, there are real latency numbers this time

Yes, there are traces you can inspect

Yes, observability changes the conclusion

And no, this still isn’t a Redis hit piece

What part 1 didn’t show

Part 1 was honest, but it was incomplete.

It focused on the experience of removing Redis and watching things get better: lower latency, fewer weird edge cases, a calmer system overall. That’s a real outcome, and it mattered. But it stopped short of the part engineers actually use to make decisions.

There were no percentiles. No traces. No dashboards you could zoom into and argue with. And without those, the story depended a little too much on trust.

That’s the gap this article closes.

In Part 1, I talked about Redis and Postgres as components. In Part 2, the focus shifts to the request path: where time is spent, how often it spikes, and what actually shows up in p95 and p99 when traffic is steady. Instead of “it felt faster,” this version asks a stricter question:

where did the milliseconds go?

So the goal here isn’t to relitigate the decision. It’s to make it inspectable.

Same workload. Same code. Same infrastructure. The only difference is that this time every request is measured, traced, and graphed. If the original conclusion holds up under that level of visibility, great. If it doesn’t, that’s useful too.

Either way, this turns a narrative into something you can verify.

How this was measured (no hand-waving)

The rules were simple and boring on purpose.

Nothing about the application logic changed. Same endpoints, same data shape, same connection pools, same traffic pattern. The infrastructure stayed the same too: same region, same instance types, same network. The only variable was the backend handling the lookup: Redis versus PostgreSQL.

Traffic was steady, not a stress-test circus. I wanted to see how each setup behaved under normal pressure, not how fast it could fail. Each run was repeated multiple times, and anything that couldn’t be reproduced was thrown out.

This time, visibility was non-negotiable.

Latency was captured as percentiles (p50, p95, p99), not just averages. Requests were traced end-to-end so I could see where time was actually spent. Metrics lived in dashboards, not log lines. If a claim survived, it had to survive graphs.

No tuning to “win.” No cherry-picking best runs. Just the same system, observed properly.

With that in place, the results stopped being a feeling and started being something you could inspect.

Now we can talk about latency the part everyone argues about, and the part most posts get wrong.

The setup (what was actually tested)

This wasn’t a synthetic benchmark or a “hello world” cache demo.

The workload was a real production path: a read-heavy lookup that previously hit Redis on almost every request. Same keys, same access patterns, same cache hit behavior. Nothing exotic. Exactly the kind of thing Redis usually gets added for without much debate.

On the Redis side, it was a straightforward GET. No Lua scripts. No clever batching. Just the default, boring usage most teams run in production.

On the PostgreSQL side, the data lived in a simple cache table with an index and an expiration check. The working set fit in memory, so reads were hitting the buffer cache, not disk. Again, nothing fancy.

Both paths were exercised under the same traffic profile for the same duration. Each run produced:

Latency percentiles (p50, p95, p99)
Backend-specific timing (Redis vs Postgres calls)
Error and timeout rates
Resource usage (CPU, memory, network)

The important part isn’t that the setup was clever. It’s that it was boring and repeatable. If a result showed up once but not again, it didn’t count. If it couldn’t be explained by traces or metrics, it wasn’t trusted.

With the setup locked in, the numbers finally had a chance to say something useful.

Let’s start with latency and why the median is the least interesting line on the chart.

latency (the numbers that actually matter)

This is where Part 1 should’ve slowed down.

If you only look at averages, both Redis and Postgres look fast enough that the discussion feels pointless. Single-digit milliseconds, nothing alarming, everyone goes home happy. That’s also how you end up shipping systems that feel fine… until they don’t.

So this time I ignored averages completely and went straight to percentiles.

Here’s the summary from steady-state traffic:

Latency (ms) under steady load:

Redis:    p50 [ ]  p95 [ ]  p99 [ ]

Postgres: p50 [ ]  p95 [ ]  p99 [ ]

The p50 numbers were close, which wasn’t surprising. Redis is excellent at best-case performance, and Postgres isn’t slow when the working set fits in memory. If you stop here, you can argue either side and feel correct.

The difference shows up in the tail.

Press enter or click to view image in full size

Latency percentiles over time averages hide this, tails don’t.

Press enter or click to view image in full size

Redis had lower or comparable p50s, but p95 and p99 were noisier. Small spikes appeared under otherwise normal load. Nothing dramatic just enough variance to stretch the long end of the curve. Postgres, by contrast, was boring. Slightly higher p50s in some runs, but tighter p95 and p99, with fewer sudden jumps.

That stability is the real signal.

Users don’t experience your median request. They experience the one that randomly takes longer than the rest. On-call engineers experience it too, usually as a vague “something feels off” before an alert ever fires. Tail latency is where systems earn or lose trust.

Once you look at the percentile charts over time, the pattern becomes hard to ignore: Redis wasn’t slow, but it was less predictable. Postgres wasn’t magically faster, but it was more consistent where it mattered.

This was the first concrete point where the original conclusion held up under scrutiny. Not because Postgres “won” on raw speed, but because the latency curve stayed tighter when traffic was steady.

Latency tells you that something is happening. To understand why, you need to look at traces and that’s where things finally clicked.

Tracing (where the milliseconds actually went)

Latency percentiles told me that something was different. Tracing explained why.

Once I put end-to-end traces side by side, the request paths stopped being abstract ideas and turned into timelines you could reason about. The contrast wasn’t dramatic in any single span it was cumulative.

On the Redis path, every request had a familiar shape:

Application handler
Serialization
Network hop
Redis command
Network hop back
Deserialization

Each step was fast. None of them were broken. But every step added a little variance. Under steady load, that variance showed up as small gaps between spans: brief waits on the socket, short pauses before a connection was reused, tiny delays that didn’t exist on every request — just often enough to stretch the tail.

The PostgreSQL traces looked simpler.

Application handler
Connection pool acquisition
Indexed query
Response

Because the database connection was already in the critical path, there was no extra service boundary to cross. When the working set stayed in memory, the query span was short and consistent. Fewer hops meant fewer places for jitter to sneak in.

What surprised me most wasn’t the raw timing of individual spans it was how predictable the Postgres traces were. The Redis traces had more variation from request to request, even when overall latency stayed low. The Postgres traces repeated the same shape over and over again, with less spread between “normal” and “slow.”

That predictability matters more than it sounds. When something does go wrong, traces are how you find it. With Redis in the middle, you’re stitching together two systems and trying to guess which one blinked first. With everything inside Postgres, the slow path is easier to see because there are fewer moving parts to inspect.

This is where observability stopped feeling like an add-on and started feeling like part of performance itself. The faster system wasn’t just the one with lower numbers it was the one that made those numbers easier to explain.

Next, let’s look at observability metrics beyond latency, because tail spikes don’t exist in isolation. They leave fingerprints in hit rates, error counts, and resource usage too.

Observability (what I actually measured, in code)

In Part 1, observability was implied.
In Part 2, it’s explicit and measurable.

I didn’t rely on logs or “it feels smoother.” Every backend call was wrapped and timed, and the results were pushed into real metrics that produced p95 and p99 charts.

Here’s the exact shape of what I measured.

First, a latency histogram, split by backend:

import client from "prom-client";
const cacheLatency = new client.Histogram({
  name: "cache_backend_latency_ms",
  help: "Latency of cache backend calls",
  labelNames: ["backend"],
  buckets: [0.5, 1, 2, 5, 10, 20, 50, 100, 200]
});

Then a tiny wrapper to make sure every call was measured the same way:

async function timedMetric(backend, fn) {
  const end = cacheLatency.startTimer({ backend });
  try {
   return await fn();
  }finally {
   end();
  }
}

Now the important part: Redis and Postgres go through the exact same wrapper.

// Redis lookup
const redisValue = await timedMetric("redis", () =>
  redis.get(key)
);

// Postgres lookup
const pgValue = await timedMetric("postgres", async () => {
  const result = await pool.query(
    "SELECT value FROM cache WHERE key = $1 AND expires_at > NOW()",
    [key]
  );
return result.rows[0]?.value ?? null;
});

Because both paths emit into the same histogram, Grafana can compute real percentiles using histogram_quantile() no guesswork, no cherry-picking.

That’s where this table comes from:

Latency (ms) under steady load:

Redis:    p50 [ ]  p95 [ ]  p99 [ ]
Postgres: p50 [ ]  p95 [ ]  p99 [ ]

But latency alone isn’t enough. I also tracked timeouts and errors, because tail latency usually shows up there first:

const backendErrors = new client.Counter({
  name: "cache_backend_errors_total",
  labelNames: ["backend"]
});

try {
  return await timedMetric("redis", () => redis.get(key));
} catch (err) {
  backendErrors.inc({ backend: "redis" });
  throw err;
}

Same pattern for Postgres.

Finally, I exposed everything on /metrics:

app.get("/metrics", async (_, res) => {
  res.set("Content-Type", client.register.contentType);
  res.end(await client.register.metrics());
});

That’s it. No magic.

From these few counters and histograms, I got:

p50 / p95 / p99 latency per backend

Error rates per backend

Time-series graphs that showed jitter over time

And here’s the key point: the charts told the same story the traces hinted at.
Redis wasn’t slow, but it had more variance. Postgres wasn’t flashy, but its metrics were flatter and easier to reason about.

Observability didn’t just confirm the result it explained it.

Next, let’s look at the actual code changes that made this possible, because removing Redis wasn’t magic either.

The actual code change (what replaced redis)

Up to this point, everything has been about measurement. This is the part people usually want first: what actually changed in the code.

Spoiler: nothing dramatic.

This wasn’t a rewrite or a clever abstraction. It was just removing an extra hop and letting Postgres do work it was already good at.

Here’s what the Redis path looked like before.

Redis cache lookup

// read
const cached = await redis.get(user:<span>${id}</span>);
if (cached) {
  return JSON.parse(cached);
}
// fallback
const user = await db.query(
  "SELECT * FROM users WHERE id = $1",
[id]
);
// write
await redis.set(
  user:<span>${id}</span>,
   JSON.stringify(user),
   "EX",
   3600
);

return user;

Nothing wrong with this. This is how most apps use Redis. But it creates two independent systems that need to stay in sync, and every request pays for a network round trip even when the data is hot.

Now compare that to the Postgres version.

Postgres cache table

CREATE UNLOGGED TABLE cache (
  key TEXT PRIMARY KEY,
  value JSONB NOT NULL,
  expires_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_cache_expires
  ON cache (expires_at);

Postgres cache lookup

// read

const result = await pool.query(
  "SELECT value FROM cache WHERE key = $1 AND expires_at > NOW()",
  [user:<span>${id}</span>]
);
if (result.rows.length > 0) {
  return result.rows[0].value;
}
// fallback
const user = await pool.query(
  "SELECT * FROM users WHERE id = $1",
  [id]
);
// write
await pool.query(
  INSERT INTO cache (key, value, expires_at)<br>   VALUES ($1, $2, NOW() + INTERVAL '1 hour')<br>   ON CONFLICT (key) DO UPDATE<br>     SET value = EXCLUDED.value,<br>         expires_at = EXCLUDED.expires_at,
  [user:<span>${id}</span>, user.rows[0]]
);
return user.rows[0];

That’s it.

No Lua scripts. No fancy extensions. Just an indexed table, an expiration check, and an upsert.

Because the table is UNLOGGED, writes skip WAL and stay fast. Because the working set fits in memory, reads hit the buffer cache. And because everything happens over the same connection, there’s no extra hop to pay for.

The important part isn’t that this code is clever. It’s that it’s boring.

And boring code is easy to measure, easy to trace, and easy to reason about when something goes wrong.

Once this was in place, the observability from the previous section lined up perfectly with what the code suggested:

Fewer round trips
Fewer failure modes
Fewer places for latency to wobble

At this point, Redis wasn’t “replaced” by a new system. It was replaced by less system.

When redis still wins

This setup worked because the workload was boring.

Read-heavy. Predictable. Hot data. Already going through Postgres. In that shape, Redis wasn’t adding much beyond another network hop.

Outside of that, Redis still absolutely earns its place.

If you need very high throughput, ultra-low latency, or Redis-specific data structures like sorted sets, streams, or HyperLogLog, Postgres isn’t a drop-in replacement. Redis will win on raw speed and specialized features, especially when tuned aggressively or paired with Lua scripts.

And if your architecture requires a shared or cross-service cache layer, Redis makes sense structurally, not just performance-wise.

The takeaway isn’t “Postgres replaces Redis.”
It’s “Redis should be intentional.”

In my case, Redis wasn’t wrong it just wasn’t necessary.

What actually changed long-term

The biggest change wasn’t the latency numbers. It was how the system felt to work on.

With Redis gone, there was one less service to reason about. One less set of dashboards. One less place where state could quietly drift out of sync. When something felt off, I didn’t have to ask “is it the cache or the database?” the answer was just “the database,” and the traces backed that up.

Incidents got simpler. Debugging got faster. New code paths were easier to reason about because everything lived inside the same transactional boundary. That kind of clarity compounds over time in a way small millisecond wins never do.

The system didn’t just get faster. It got calmer.

And that’s the part I didn’t expect going in: removing a tool didn’t reduce capability, it reduced ambiguity. Once that clicked, the metrics almost felt secondary.

Which brings us to the real takeaway not about Redis or Postgres, but about how we make these decisions in the first place.

The real takeaway

This wasn’t about Redis being bad or Postgres being a hidden cheat code.

It was about assumptions.

Redis was added because it felt like the responsible choice. It stayed because nothing was obviously broken. And it only got questioned once the graphs stopped lining up with the story I believed.

Postgres didn’t “win” because it’s faster in isolation. It won because it was already in the path, already warm, and didn’t add another hop or failure domain. Fewer moving parts meant fewer surprises and the metrics made that obvious.

The real lesson isn’t “replace Redis.”
It’s “measure before you trust patterns.”

If this pushed you to look at your own dashboards instead of your instincts, it did its job.

Helpful resources (the stuff i actually used)

If you want to dig into this yourself or sanity-check your own setup these are the resources that mattered most:

PostgreSQL performance basics https://www.postgresql.org/docs/current/performance-tips.html
EXPLAIN (ANALYZE) still the fastest way to kill bad assumptions https://www.postgresql.org/docs/current/using-explain.html
UNLOGGED tables (why they’re perfect for cache-like data) https://www.postgresql.org/docs/current/sql-createtable.html
LISTEN / NOTIFY (pub/sub semantics and limitations) https://www.postgresql.org/docs/current/sql-notify.html
SKIP LOCKED (queues without fighting locks) https://www.postgresql.org/docs/current/explicit-locking.html
Prometheus histograms & percentiles https://prometheus.io/docs/practices/histograms/
OpenTelemetry tracing concepts https://opentelemetry.io/docs/concepts/observability-primer/

DEV Community