DEV Community: Ovaise Qayoom

From 0 to 10M Requests/Day, Architecting a Boring but Bulletproof Backend

Ovaise Qayoom — Thu, 02 Apr 2026 16:01:35 +0000

TL;DR — You do not need Kafka, Kubernetes, or 17 microservices to handle serious traffic. A well-structured monolith, PostgreSQL, Redis, PgBouncer, a CDN, and horizontal application scaling can take you to 10 million requests per day with less risk, less cost, and far less operational pain.

Most backend scaling guides start at the wrong point.
They show you the architecture after a company has 200 engineers, multiple platform teams, and years of accumulated infrastructure. Then they present that architecture as if it were the natural starting point.

It was not.

Most high-traffic systems begin the same way: one application, one database, one deployment path, one team trying to ship product fast enough to matter. The systems that survive are not the ones that adopt the most technology. They are the ones that adopt the right technology at the right time.

This is a guide to that path.
Not the hype version.

Not the conference-talk version.

The production version.

What This Guide Covers

This post is an opinionated case study plus guide for building a high-traffic backend using a scalable monolith and a small number of carefully chosen supporting components.

It is optimized for teams that care about:

Shipping fast without building infrastructure theater
Scaling predictably under real production traffic
Keeping debugging and deployment simple
Avoiding premature microservices
Reaching 10M requests/day without rewriting the whole backend

Core keywords this guide is built around:

scalable monolith
high-traffic backend
pragmatic architecture
production scaling
backend architecture
Node.js scaling
PostgreSQL performance
Redis caching
PgBouncer
boring technology

Why Boring Technology Wins

Boring technology is not boring because it lacks power.

It is boring because it is understood.

That matters more than people admit.

A system that uses PostgreSQL, Redis, Nginx, and a monolith is easier to reason about, easier to observe, easier to debug at 2am, and easier to hire for than a system built from six trendy abstractions nobody fully understands.

The highest-leverage backend architecture principle is this:

Prefer the simplest system that survives your current load with margin.

That single rule eliminates most bad architectural decisions.

The Traffic Stages

Every backend goes through roughly four traffic stages. The architecture that is reasonable at one stage is often unnecessary or actively harmful at another.

The mistake most teams make is solving Stage 4 problems while still in Stage 1.

That is how you end up with architecture that looks impressive and ships slowly.

The Boring Stack

Here is the baseline stack for a pragmatic high-traffic backend:

Layer	Choice	Why it stays
Runtime	Node.js + Fastify	Fast enough, mature, simple developer experience
Primary database	PostgreSQL	Transactional, reliable, great indexing, JSON support
Cache	Redis	Perfect for read-heavy data, sessions, rate limits, queues
Queue	BullMQ	Uses Redis, avoids introducing another broker too early
Connection pooler	PgBouncer	Protects Postgres from connection explosion
Reverse proxy	Nginx	Stable, battle-tested, high-performance
CDN	Cloudflare	Offloads traffic before it reaches origin
Monitoring	Prometheus + Grafana	Standard, simple, effective
Logs	Structured JSON logs	Searchable, aggregatable, production-friendly

What is intentionally not here:

Kafka by default
Microservices by default
Kubernetes by default
Event-driven everything
Service mesh
Distributed transactions
Premature CQRS

None of those are inherently bad. They are just not your starting point.

Stage 1: Build a Monolith That Is Easy to Scale

At low traffic, the correct architecture is usually one codebase, one app process group, one database, one deployment pipeline.

That is not a temporary embarrassment. That is good architecture.

Why a Monolith Is the Right Default

A scalable monolith gives you:

One deployable unit
One place to debug business logic
One database transaction boundary
No network hops between internal features
No distributed systems complexity
No cross-service schema drift
No internal API versioning burden

At this stage, your job is not to create elegant infrastructure. Your job is to create a stable product with clean boundaries inside a single codebase.

That means modularizing inside the monolith.

Monolith Structure That Ages Well

A good monolith is not a folder full of chaos. It has domain boundaries even though it deploys as one service.

src/
  modules/
    auth/
      auth.routes.js
      auth.service.js
      auth.repo.js
    billing/
      billing.routes.js
      billing.service.js
      billing.repo.js
    users/
      users.routes.js
      users.service.js
      users.repo.js
  lib/
    db.js
    cache.js
    queue.js
    logger.js
  app.js
  server.js

This gives you the operational simplicity of a monolith and the code organization of a more mature system.

Start with Pooling Immediately

Even at low traffic, do not connect to PostgreSQL casually from every request path. Use pooling from the start.

import Fastify from 'fastify';
import postgres from '@fastify/postgres';

const app = Fastify({ logger: true });

app.register(postgres, {
  connectionString: process.env.DATABASE_URL,
  pool: {
    min: 2,
    max: 10,
    idleTimeoutMillis: 30000,
    connectionTimeoutMillis: 2000,
  },
});

The point is not that you need massive pooling on day one.

The point is that you build the habit before traffic arrives.

Stage 1 Database Design: Decisions That Matter Later

Most scaling pain is not caused by traffic alone. It is caused by data model decisions that looked harmless when traffic was small.

Index Foreign Keys Immediately

This is one of the most common missing pieces in production systems.

CREATE TABLE users (
  id UUID PRIMARY KEY,
  email TEXT NOT NULL UNIQUE,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE posts (
  id UUID PRIMARY KEY,
  user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  slug TEXT NOT NULL UNIQUE,
  status TEXT NOT NULL DEFAULT 'draft',
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_posts_user_id ON posts(user_id);
CREATE INDEX idx_posts_created_at ON posts(created_at DESC);
CREATE INDEX idx_posts_published ON posts(status) WHERE status = 'published';

That partial index on status = 'published' matters because it reflects the access pattern you are likely to have in production.

Design for Query Shapes, Not Just Entities

A schema is not just a representation of objects. It is a representation of future queries.

Ask:

What will be filtered often?
What will be sorted often?
What joins will appear on critical paths?
What should be cached?
What can remain eventually consistent?
What must stay transactional?

That level of thinking matters more than fashionable architecture diagrams.

Stage 2: Remove the Easy Bottlenecks

Once traffic starts becoming meaningful, most problems are still boring.

They usually fall into one of these buckets:

Repeated identical reads hitting the database
Too many application connections reaching Postgres
Missing indexes
Static assets reaching the app server unnecessarily
Expensive synchronous operations done inside requests

This is where good production scaling begins.

Add Redis the Right Way

Redis should enter the architecture as a targeted optimization, not as a random dependency.

Three strong uses at this stage:

Caching
Rate limiting
Session storage / temporary state

Cache-Aside Pattern

The most reliable caching pattern for application data is still cache-aside.

export class Cache {
  constructor(redis, ttl = 300) {
    this.redis = redis;
    this.ttl = ttl;
  }

  async get(key) {
    const raw = await this.redis.get(key);
    return raw ? JSON.parse(raw) : null;
  }

  async set(key, value, ttl = this.ttl) {
    await this.redis.setex(key, ttl, JSON.stringify(value));
  }

  async del(key) {
    await this.redis.del(key);
  }

  async getOrFetch(key, fetchFn, ttl = this.ttl) {
    const cached = await this.get(key);
    if (cached !== null) return cached;

    const fresh = await fetchFn();
    if (fresh !== null && fresh !== undefined) {
      await this.set(key, fresh, ttl);
    }
    return fresh;
  }
}

app.get('/posts/:slug', async (req, reply) => {
  const key = `post:slug:${req.params.slug}`;

  const post = await cache.getOrFetch(
    key,
    async () => {
      const result = await req.pg.query(
        'SELECT * FROM posts WHERE slug = $1 AND status = $2',
        [req.params.slug, 'published']
      );
      return result.rows ?? null;
    },
    600
  );

  if (!post) return reply.code(404).send({ error: 'Not found' });
  return post;
});

Cache What Is Expensive and Stable

Good cache candidates:

Product details
Blog posts
User profile summaries
Public pricing data
Aggregated dashboard numbers
Settings that rarely change

Bad cache candidates:

Highly volatile counters without clear invalidation
Permission-sensitive data unless keyed safely
Write-heavy rows with constant mutation
Anything you cannot invalidate confidently

Caching is not magic. It is a trade: memory and invalidation complexity in exchange for lower latency and lower database load.

Protect PostgreSQL with PgBouncer

At moderate traffic, PostgreSQL often stops being limited by raw query power and starts being limited by connection handling.

That is what PgBouncer solves.

Why PgBouncer Matters

Your app may create many logical connections.

Postgres should not have to maintain that many physical ones.

PgBouncer lets you:

Smooth connection spikes
Keep Postgres stable
Scale app instances without linearly scaling DB connections
Reduce memory pressure at the database layer

Critical config principle:

pool_mode = transaction

That one setting is often the difference between a useful PgBouncer deployment and a misleading one.

Stage 3: Horizontal Scale Without Losing Simplicity

Now the backend starts looking more serious.

Not because you adopted microservices.

Because you removed single points of failure.

This is still a boring architecture.

It is also enough for a high-traffic backend.

Add Read Replicas When Read Load Dominates

The right time to add Postgres read replicas is when your write volume is manageable but reads are crowding the primary.

That is a common pattern for:

SaaS dashboards
CMS-backed sites
APIs with heavy lookup traffic
B2B platforms with read-heavy admin views

Split Reads and Writes Explicitly

import pg from 'pg';

const writePool = new pg.Pool({
  connectionString: process.env.DATABASE_PRIMARY_URL,
  max: 5,
});

const readPool = new pg.Pool({
  connectionString: process.env.DATABASE_REPLICA_URL,
  max: 20,
});

export const db = {
  write(sql, params) {
    return writePool.query(sql, params);
  },

  read(sql, params) {
    return readPool.query(sql, params);
  },
};

This is deliberately explicit.

You want engineers to know when they are reading from a replica versus writing to a primary.

Understand Replica Lag

Read replicas are not free throughput.

They introduce a real tradeoff: replica lag.

That means a request can:

Write to primary
Immediately read from replica
Not see its own write yet

So do not route consistency-sensitive reads blindly to replicas.

Examples of reads that should still hit primary:

Immediately after creating a resource
Checkout confirmation flows
Billing updates
Permission changes
Authentication-adjacent state

This is where many scaling guides get too hand-wavy. Replica lag is not theoretical. You must design for it.

Add a CDN Earlier Than Most Teams Do

A CDN is not just for images.

It is a traffic absorber.

A CDN should sit in front of your system as early as possible because it gives you:

Lower latency globally
Reduced origin load
Edge caching for static assets
Basic DDoS mitigation
TLS termination
Better burst handling

Cloudflare alone can eliminate a surprising amount of backend work before the request ever reaches your application.

What Should Be Cached at the Edge

Good CDN candidates:

JS, CSS, fonts, images
Static marketing pages
Public docs pages
Public blog content with short revalidation windows
Some anonymous API responses if safe

Bad CDN candidates:

Authenticated user dashboards
Personalized responses
Permission-sensitive resources
Anything with ambiguous cache headers

Add Rate Limiting Before You Think You Need It

A production system gets stressed not only by success but by abuse, bugs, retries, crawlers, scripts, and bursty clients.

Rate limiting is not just security. It is stability.

import { RateLimiterRedis } from 'rate-limiter-flexible';

const limiter = new RateLimiterRedis({
  storeClient: redis,
  keyPrefix: 'rate_limit',
  points: 100,
  duration: 60,
  blockDuration: 60,
});

export async function rateLimit(req, reply) {
  try {
    const key = req.user?.id || req.ip;
    await limiter.consume(key);
  } catch {
    return reply.code(429).send({
      error: 'Too many requests',
    });
  }
}

The crucial point here is that the limiter state lives in Redis, not memory.

If the limiter is in memory per instance, it becomes inconsistent behind a load balancer.

Stage 4: Make the System Degrade Gracefully

At 1M to 10M requests/day, you stop thinking only about scale and start thinking about failure shape.

The question changes from:

“Can the system handle this?”

to:

“How does the system behave when it cannot?”

That is a more mature question.

Move Non-Critical Work Out of the Request Path

Anything not required to complete the user-visible response should leave the request path.

That includes:

Emails
Webhook delivery
Image processing
Video transcoding
Search indexing
Report generation
Analytics fanout
Notification dispatch

This is where a Redis-backed queue like BullMQ is exactly right.

Background Jobs with BullMQ

import { Queue, Worker } from 'bullmq';

const connection = {
  host: process.env.REDIS_HOST,
  port: 6379,
};

export const emailQueue = new Queue('email', {
  connection,
  defaultJobOptions: {
    attempts: 3,
    backoff: { type: 'exponential', delay: 2000 },
    removeOnComplete: { count: 1000 },
    removeOnFail: { count: 5000 },
  },
});

export const emailWorker = new Worker(
  'email',
  async (job) => {
    const { to, subject, template, data } = job.data;
    await sendEmail({ to, subject, template, data });
  },
  {
    connection,
    concurrency: 5,
  }
);

app.post('/users/:id/welcome-email', async (req, reply) => {
  const user = await db.read(
    'SELECT id, email, name FROM users WHERE id = $1',
    [req.params.id]
  ).then(r => r.rows);

  if (!user) return reply.code(404).send({ error: 'User not found' });

  await emailQueue.add('welcome', {
    to: user.email,
    subject: 'Welcome aboard',
    template: 'welcome',
    data: { name: user.name },
  });

  return reply.code(202).send({ message: 'Email queued' });
});

This is a major step in backend architecture maturity.

Not because queues are fashionable. Because they protect request latency and isolate failure.

Build a Caching Hierarchy

At 10M requests/day, caching must exist at more than one layer.

Each layer exists for a different reason:

CDN handles global traffic and static/public content
Nginx can absorb repeated identical upstream requests
Redis handles application-level object caching
Read replicas absorb remaining read pressure
Primary is reserved for writes and consistency-sensitive reads

If everything reaches the primary database, the rest of your scaling story is mostly fiction.

Observability: The Part Everyone Delays Too Long

A backend is not scalable because it can survive load once in a benchmark.

It is scalable when you can understand what is happening under production load quickly enough to act.

That means metrics, logs, health checks, and dashboards.

Minimum Metrics You Need

Metric	Why it matters
Request rate	Traffic shape and load changes
p95 latency	Real user experience under load
Error rate	Detects systemic failure quickly
DB query duration	Tells you when DB is the bottleneck
Cache hit rate	Shows whether Redis is doing useful work
Queue depth	Reveals background work backlog
CPU / memory	Capacity planning and saturation signals
Replica lag	Prevents stale-read surprises

Example Prometheus Metrics

import client from 'prom-client';

client.collectDefaultMetrics();

export const httpDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5],
});

export const dbQueryDuration = new client.Histogram({
  name: 'db_query_duration_seconds',
  help: 'Database query duration in seconds',
  labelNames: ['operation', 'table'],
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
});

Practical Thresholds

Signal	Warning	Critical
p95 request latency	> 200ms	> 500ms
p95 DB query latency	> 50ms	> 200ms
Cache hit rate	< 85%	< 70%
5xx error rate	> 0.5%	> 2%
Queue backlog growth	Sustained 5 min	Sustained 15 min

If you do not know these numbers, you do not yet know whether your backend is healthy.

When Not to Use Microservices

This section matters because too many teams still treat microservices as a rite of passage.

They are not.

Microservices are a tax. Sometimes a necessary one, often an unnecessary one.

Do Not Break the Monolith If:

One team can still understand the codebase
Deployments are still manageable
The database is not your deployment bottleneck
Most modules do not need independent scaling
Internal network calls would replace in-process calls for no clear gain
Your operational maturity is still limited
Your debugging pipeline is still weak
You do not have platform engineers

Consider Extracting a Service Only If:

One subsystem has drastically different scaling needs
One subsystem needs a different runtime or storage model
Deployment coupling is causing continuous friction
Team boundaries are stable and long-lived
You already have strong observability and operational discipline

The goal is not ideological purity.

The goal is operational sanity.

What 10M Requests/Day Actually Means

This number sounds dramatic, but it helps to reduce it.

10M requests/day is about:

115 requests per second on average
Often 300–600 req/s at peak
Higher during bursts, launches, crawls, or regional concentration

That is large enough to be real.

It is also absolutely within reach of a well-built monolith plus:

horizontal app scaling
Redis caching
PgBouncer
read replicas
CDN offload
async job processing

The scary part is not the request count.

The scary part is waste.

A bad query, missing index, unbounded N+1 pattern, or synchronous email send can collapse a backend long before traffic itself becomes the true problem.

Production Checklist

Database

[ ] PostgreSQL primary configured correctly
[ ] Read replicas added when read traffic justifies them
[ ] PgBouncer deployed in transaction mode
[ ] All foreign keys indexed
[ ] Slow query logging enabled
[ ] Backup and restore tested, not just configured
[ ] pg_stat_statements enabled

Caching

[ ] Redis in place for cache, sessions, or rate limiting
[ ] Cache key naming convention documented
[ ] Invalidation exists on all write paths
[ ] Cache hit ratio monitored
[ ] No sensitive cross-user cache leakage

App Layer

[ ] Health checks at /health
[ ] Graceful shutdown implemented
[ ] Request timeouts enforced
[ ] Rate limiting in Redis, not memory
[ ] Background tasks moved off request path
[ ] Structured logs everywhere

Infrastructure

[ ] Nginx or equivalent reverse proxy configured
[ ] CDN in front of public traffic
[ ] HTTPS enforced
[ ] Compression enabled
[ ] Static assets cached aggressively

Observability

[ ] Prometheus metrics exposed
[ ] Grafana dashboards for latency, errors, cache hit rate, queue depth
[ ] Alerts configured before incidents happen
[ ] Logs searchable centrally
[ ] Replica lag visible if using replicas

The Real Scaling Mindset

The most useful scaling principle is not “build for hyperscale.”

It is this:

Remove the current bottleneck without introducing unnecessary permanent complexity.

That is how serious systems are actually built.

A good backend architecture is not the one with the most boxes in the diagram.

It is the one that stays understandable while traffic grows.

A pragmatic architecture wins because it keeps your team fast, your failure modes legible, and your operating costs reasonable. A scalable monolith wins because most systems need better boundaries, better caching, and better query discipline long before they need service decomposition.

If you want to reach 10M requests/day, do not start by asking whether you need Kafka.

Start by asking:

Are my queries indexed?
Is my cache hit rate healthy?
Am I blocking requests on background work?
Can Postgres survive connection spikes?
Can my system degrade gracefully under load?
Can I explain every major component in one sentence?

If the answer to those questions is yes, you are much closer than you think.

Real scale rarely demands flashy architecture first. It demands disciplined engineering first.

If this post gave you even one useful insight, that’s a win.

I focus on writing practical, no-BS content that actually helps you build better — not just consume more.

👉 Read more blogs

If you want to reach out or discuss anything:

👉 Contact me

If you're stuck in any project or need help, feel free to connect — I actually respond.

Follow for more real-world dev + design content 🚀

Why One Extra Network Hop Silently Breaks Your Latency Budget in Production

Ovaise Qayoom — Mon, 30 Mar 2026 21:45:55 +0000

Your Latency Budget Is Lying: The Real Cost of a Single Extra Network Hop

That one "harmless" extra service call is quietly burning your p99. Here's the math, the failure modes, and how to fix it.

You shipped a feature. Everything looked fine in staging. The integration tests passed. The average response time in production is 120ms — well within the 200ms target your team agreed on six months ago.

Then someone checks the p99.
It's 780ms.
The dashboards look fine at a glance, users aren't screaming yet, but something is clearly wrong. You start digging. You find that three weeks ago, someone added a call to a new internal service — a feature flag resolver, a permission check, a logging sidecar flush — and nobody thought much of it. "It only adds about 5ms," they said.

And they were right, at the median. But at the tail? It quietly murdered your latency budget.

This is the story of how that happens, why it's almost always invisible until it isn't, and what you can actually do about it.

A single network hop looks trivial in isolation. In a distributed system, it's never just one hop.

First, What Even Is a Latency Budget?

A latency budget is a constraint. It's the total time you have available to fulfill a request end-to-end — from the client sending the first byte to the client receiving the last byte — before the experience degrades.

Your product team says "the page must load in under 200ms." That 200ms is the budget. Now you have to allocate it across every layer of your stack.

A typical allocation for a server-rendered web request might look like this:

Layer	Allocated Time
DNS resolution (cached)	~1ms
TCP + TLS handshake (cached)	~5ms
Network transit (round trip)	~20ms
Load balancer + reverse proxy	~3ms
Application logic	~80ms
Database query	~40ms
Response serialization	~10ms
Network return	~20ms
Total	~179ms

That gives you roughly 21ms of buffer. Sounds reasonable. But notice that this model assumes one path through your system. In reality, modern distributed systems don't have one path. They have a graph of paths, and each path has its own tail behavior.

The moment you add one more synchronous network hop — another service call, a proxy that wasn't there before, a new sidecar — you don't just add the median latency of that hop. You add its entire latency distribution. Including its p99. Including its occasional 2-second timeout spike. And those distributions don't add linearly.

The Math They Don't Put in Your Architecture Diagram

Let's be precise about this, because it's the core of everything.

If you make a single call to a service with the following latency distribution:

p50: 5ms
p95: 20ms
p99: 80ms

...then at the 50th percentile, your caller sees 5ms. Fine.

But now suppose you're calling five services in series. Even if every one of them has the same "5ms median" profile:

The compound tail problem:

If each service independently has a 1% chance of hitting 80ms, then the probability that at least one of them hits 80ms in a single request is:

P(at least one slow) = 1 - P(all fast)
                     = 1 - (0.99)^5
                     = 1 - 0.951
                     = 4.9%

So your compound p95 is now being shaped by the slowest of five services, not the average. What was a 1-in-100 event for each service individually becomes a nearly 1-in-20 event for the composite request.

Add ten services and the math gets grimmer:

P(at least one slow) = 1 - (0.99)^10 = 9.6%

Your p99 just became your p90. In production, at scale, that's thousands of requests per minute hitting the tail.

This is the phenomenon described in the classic Google paper "The Tail at Scale" — and it's been reproduced in real systems countless times since. research

What Actually Happens Inside a Single Extra Hop

When you add a synchronous call to another service, here's what actually happens on the wire — most of which is invisible in your flame graphs if you're not looking:

1. TCP Connection Overhead

If the connection isn't kept alive (common in naive HTTP/1.1 setups or misconfigured HTTP/2), every call involves a TCP handshake: ~1 RTT. At a typical inter-datacenter latency of 1–5ms, that's 1–5ms before you've sent a single byte of your request.

Connection pooling eliminates most of this, but only if you've set it up correctly and your pool isn't exhausted under load.

2. TLS Negotiation

If the service-to-service call is over HTTPS (as it should be in a zero-trust setup), TLS adds latency. A full TLS 1.3 handshake with a session resumption costs roughly 0.5–2ms. Without session resumption, it's a full 1–2 RTT.

In a service mesh like Istio with mutual TLS (mTLS), every single pod-to-pod call goes through TLS — it's automatic and transparent, which is great for security and brutal for people who thought "service mesh is free." foci.uw

Benchmarks of Istio with Envoy sidecars have shown consistent per-hop overhead of 1–5ms added latency at the median, with p99 overheads stretching into tens of milliseconds under load, depending on payload size and connection concurrency. oneuptime

3. Serialization and Deserialization

Your service sends a request body. JSON, Protobuf, MessagePack — doesn't matter, it costs something. JSON serialization of a medium-complexity object (10–20 fields, some nested) in Node.js or Go costs roughly 0.05–0.5ms. Across many hops at high concurrency, this adds up. More importantly, large payloads increase memory allocation, which can trigger GC pauses — and GC pauses are essentially uncapped.

4. Queueing at the Receiving End

Even if the downstream service is fast on average, under real traffic it's doing other things. Goroutines are scheduled. Thread pools have limits. Connection queues fill up. The incoming request waits.

This is the queueing component of latency — often the largest and most volatile contributor to tail latency — and it's completely invisible to the caller. Your request could sit in a queue for 0ms at 10 RPS and 200ms at 1000 RPS, and your p50 will look fine the whole time while your p99 is on fire.

5. The Return Trip

All of the above applies symmetrically on the way back: serialization of the response, TCP acknowledgment, return network latency. A "fast" synchronous RPC call to an internal service that "only" takes 3ms median has already consumed 3ms of your budget before your code has done anything with the result.

Visualizing the Compounding Effect

Let's walk through a concrete example.

The scenario: an e-commerce checkout endpoint

Your /checkout endpoint has a 200ms latency budget. Here's the architecture three months ago vs. today.

Before:

Measured latency breakdown:

Network + gateway: 5ms
Checkout service logic: 30ms
DB query (indexed): 25ms
Response serialization + return: 10ms
Total p50: ~70ms. p99: ~130ms. Budget remaining: ~70ms.

After (four new hops added over three months):

Now let's reconstruct the budget:

Hop	p50	p99
Network + gateway	5ms	10ms
Auth service call	8ms	60ms
Feature flag service	4ms	40ms
Checkout logic	30ms	55ms
DB query	25ms	70ms
Inventory service call	10ms	90ms
Pricing service call	12ms	85ms
Return + serialization	10ms	20ms
Total	~104ms	~430ms

The p50 looks fine. Still well under 200ms. But the p99 has blown past the budget more than twice over — and the team didn't notice because their alerting was on average response time.

This is an extremely common pattern. It's how systems that "feel fast" break under scrutiny. systemoverflow

Every hop through your data center carries overhead that compounds across the request chain.

Tail Latency: The Number That Actually Matters for Users

Most teams instrument p50. Some instrument p95. Very few actually act on p99. This is a mistake.

The p99 is the latency that 1 in 100 of your users experiences. At 100 requests per second, that's 1 user every second hitting a degraded experience. At 10,000 requests per second, it's 100 users per second.

More critically: the p99 of your composite service is almost always dominated by the worst single component in your call chain. If you have ten services and one of them has an occasionally misbehaving garbage collector, that GC pause becomes your p99 — even if the other nine services are perfectly tuned.

Here's a simulation in Go that demonstrates the compound distribution:

package main

import (
    "fmt"
    "math/rand"
    "sort"
    "time"
)

// simulateHopLatency returns a latency in ms for a single service hop.
// Models a bimodal distribution: usually fast, occasionally slow.
func simulateHopLatency(rng *rand.Rand) float64 {
    if rng.Float64() < 0.99 {
        // Fast path: normally distributed around 5ms
        return 5.0 + rng.NormFloat64()*1.5
    }
    // Slow path: GC pause, queue buildup, etc.
    return 5.0 + 60.0 + rng.NormFloat64()*10.0
}

func percentile(sorted []float64, p float64) float64 {
    idx := int(p/100.0*float64(len(sorted)))
    if idx >= len(sorted) {
        idx = len(sorted) - 1
    }
    return sorted[idx]
}

func main() {
    rng := rand.New(rand.NewSource(time.Now().UnixNano()))
    samples := 100_000

    for numHops := 1; numHops <= 5; numHops++ {
        results := make([]float64, samples)
        for i := 0; i < samples; i++ {
            total := 0.0
            for h := 0; h < numHops; h++ {
                total += simulateHopLatency(rng)
            }
            results[i] = total
        }
        sort.Float64s(results)
        fmt.Printf("Hops: %d | p50: %.1fms | p95: %.1fms | p99: %.1fms\n",
            numHops,
            percentile(results, 50),
            percentile(results, 95),
            percentile(results, 99),
        )
    }
}

Running this produces roughly:

Hops: 1 | p50: 5.0ms  | p95: 8.1ms  | p99: 64.3ms
Hops: 2 | p50: 10.0ms | p95: 69.2ms | p99: 124.8ms
Hops: 3 | p50: 15.0ms | p95: 128.4ms| p99: 185.0ms
Hops: 4 | p50: 20.0ms | p95: 134.1ms| p99: 246.1ms
Hops: 5 | p50: 25.1ms | p95: 193.8ms| p99: 317.2ms

Notice what happened: from 1 hop to 2 hops, the p95 jumped from 8ms to 69ms. Not because the services got slower — because the probability of hitting at least one slow response nearly doubled. This is tail amplification, and it's the reason p50 monitoring is effectively useless for latency budget tracking. aerospike

The Invisible Hops You Forget to Count

Here's the thing: engineers are usually aware of the obvious hops — the service calls they wrote. What they miss are the silent ones:

Service mesh sidecars. In Istio or Linkerd, every outbound and inbound request passes through an Envoy/Linkerd proxy. That's two extra network hops per RPC call. The proxy has its own CPU overhead, memory allocation, and queue. At high RPS, this isn't free. Benchmarks show Istio adding 1–5ms to median latency, with meaningfully worse tail behavior under load. foci.uw

Feature flag SDKs calling home. Some feature flag systems are backed by an SDK that does a remote HTTP call to resolve flags per request. If your flag SDK is calling out to a remote service on every checkout request, that's a hop you probably forgot to count. It's especially painful because flag evaluation feels like it should be pure local logic.

Auth middleware calling an external service. JWT validation is local and fast. But if your auth middleware is calling a user service or an OAuth introspection endpoint to validate tokens per request, you've added a hop that's invisible in your app code but very visible in your latency.

Centralized rate limiters. Redis-backed rate limiters are common and reasonable. But a call to Redis over the network on every request adds 0.5–3ms depending on co-location, even when it's just a INCR. At high traffic, Redis also becomes a hot node, and its tail latency degrades.

Distributed tracing agents. Most tracing SDKs are async and non-blocking. Some aren't, or have internal queues that fill up under load and start blocking.

Load balancers in front of load balancers. Cloud-managed load balancers in front of ingress controllers in front of service mesh proxies in front of your app. That's three layers before your code runs.

None of these hops appear in your architecture diagram. All of them show up in your flame graphs.

Queueing Theory, Very Briefly

You don't need a PhD in queueing theory to understand why adding hops is dangerous. You just need one intuition from Little's Law:

L = λW

Where:

L = average number of requests in the system
λ = arrival rate
W = average time a request spends in the system

As W (the latency per request) increases due to extra hops, L (backlog) grows proportionally. When backlog grows, queueing delays increase, which makes W larger, which makes L larger. This feedback loop is what turns a "5ms extra hop" into "500ms occasional spikes" — the system tips past its natural equilibrium.

The practical implication: every hop you add reduces your headroom before the system becomes queue-bound under load.

How to Actually Measure Your Latency Budget

Knowing the theory is one thing. Measuring it in production is where most teams fail. Here's how to do it properly.

1. Trace every request end-to-end with OpenTelemetry

Distributed tracing is the single most important tool for latency budget tracking. If you're not already using OpenTelemetry, this is the baseline.

A basic setup in Node.js:

// tracing.js — initialize before anything else
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_ENDPOINT || 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

sdk.start();

Once you have traces flowing into Jaeger, Tempo, or Honeycomb, you can:

See the waterfall diagram for every request
Identify which span is consuming the most time
Filter by p99 requests specifically (filter by duration > 400ms) and see what's different about them
Compare span durations across percentiles

The key metric to extract from your traces: span duration by percentile, per service. Not aggregate. Per service. That's how you find the outlier.

2. Calculate your budget utilization per span

Most teams look at total response time. What you want is a budget utilization view — a percentage of the budget consumed at each hop.

This is trivially expressible as a Prometheus query if you're using span metrics:

# Fraction of total budget consumed by each service
histogram_quantile(0.99,
  sum(rate(http_server_duration_bucket{service_name="inventory-service"}[5m])) by (le)
)
/ 0.200  # divided by your 200ms budget

If this query returns 0.45 for inventory-service, that single service is consuming 45% of your budget at p99. You now have a number to act on.

3. Measure, don't estimate, the overhead of infrastructure layers

Before profiling your application code, measure the bare overhead of your infrastructure:

Add a /health endpoint to your service that does nothing except return 200
Measure its latency from another pod in the same cluster
That number is your infrastructure floor: it includes DNS, proxy overhead, TLS, and serialization
Anything under your application latency is "free"; anything above it needs a reason

In a well-tuned Kubernetes cluster without a service mesh, this baseline is typically 0.5–2ms. With Istio mTLS, it's typically 2–8ms, sometimes higher. oneuptime

Mistakes Teams Make (That Kill Their Latency Budget)

These are the patterns I see repeatedly in real systems.

Alerting on p50 instead of p99

Average and median latency look good right up until your on-call engineer gets paged by an angry stakeholder. Alert on p95 and p99. The p50 is almost useless for user-facing latency SLOs.

Adding hops without counting them

Every architectural decision that adds a synchronous network call should be an explicit tradeoff discussion: "This adds approximately Xms to our median latency and introduces Y% tail risk." That conversation almost never happens because teams think about correctness, not latency topology.

Treating timeouts as a safety net, not a budget item

A timeout of 500ms on a downstream call is not "safe." If that downstream service is called on every request and occasionally hits its 500ms timeout, your caller will block for 500ms before getting an error and returning a degraded response. Timeouts are not a performance feature. They're a correctness feature. Tune them aggressively.

The right mental model: your timeout is the maximum you're willing to spend on that hop. It should be a fraction of your total budget, not a failsafe.

Ignoring retry amplification

Retries with no budget awareness are latency multipliers. If service A times out calling service B and retries twice, a single user request has now made three calls to service B. Under load, this turns transient slowness into a cascading failure. Always budget for retries:

effective_timeout = (retry_count + 1) * per_attempt_timeout + (retry_count * retry_delay)

If you have 3 retries, a 100ms per-attempt timeout, and 50ms retry delay, a single user request can block for up to 450ms on that one hop. That's your entire budget, gone, on error handling.

Not accounting for fan-out in parallel calls

Parallel service calls look free on a timeline diagram. They're not. The total latency of N parallel calls is max(L1, L2, ..., LN) — the slowest one. And as N grows, the probability that at least one of them hits its p99 grows exponentially. A "parallel" checkout that calls 8 services simultaneously will hit the worst p99 of those 8 services on nearly every request. systemoverflow

Trusting that the service mesh is zero-cost

Istio and Linkerd are excellent tools. They are not zero-cost. Benchmark them. Measure the overhead in your specific workload. The overhead depends heavily on payload size, connection concurrency, and CPU availability on the sidecar. At high RPS with large payloads, the overhead is significant. foci.uw

Latency compounds invisibly across layers. Observability is the only way to see the full picture.

How to Reduce Latency and Reclaim Your Budget

Once you've measured the problem, here's how to actually fix it.

1. Eliminate unnecessary synchronous hops entirely

This is the most impactful change and the hardest to get approved. Ask for every synchronous service call in your hot path: "Does this need to happen before I return a response?"

Feature flag resolution: cache flags locally and refresh asynchronously. Don't call a remote service on every request.

Auth token validation: validate JWTs locally with a public key. Don't introspect them via HTTP.

Audit logging: write to a local queue and flush asynchronously. The audit log doesn't need to be consistent before the user gets their response.

Each hop you remove doesn't just save its own latency. It removes its entire tail distribution from your compound calculation.

2. Move from HTTP to faster transports where it matters

HTTP/1.1 → HTTP/2 for multiplexing. HTTP/2 → gRPC with connection reuse for internal service calls. gRPC with Protobuf serialization typically cuts serialization overhead by 3–10x compared to JSON, and connection multiplexing eliminates most connection-establishment overhead. This won't save you from architectural problems, but in a path where every millisecond counts, it's worth it.

3. Parallel what can be parallel, but with a real fan-out budget

If your request genuinely needs data from multiple services, call them in parallel. But bound the fan-out. If you're calling 8 services in parallel, set a hedged timeout — not "wait for all of them," but "wait until 95% respond and use degraded data for the rest." This is called partial responses or timeout hedging, and it's a powerful pattern for high-availability systems.

// Example: parallel fetch with timeout and partial result tolerance
func fetchWithTimeout(ctx context.Context, services []ServiceClient, budgetMs int) []Result {
    ctx, cancel := context.WithTimeout(ctx, time.Duration(budgetMs)*time.Millisecond)
    defer cancel()

    results := make([]Result, len(services))
    var wg sync.WaitGroup

    for i, svc := range services {
        wg.Add(1)
        go func(idx int, client ServiceClient) {
            defer wg.Done()
            res, err := client.Fetch(ctx)
            if err != nil {
                results[idx] = Result{Degraded: true} // Use fallback
                return
            }
            results[idx] = res
        }(i, svc)
    }

    wg.Wait()
    return results
}

4. Cache aggressively and correctly

Not "add Redis in front of everything," but cache at the right layer:

In-process cache for data that rarely changes: feature flags, configuration, rate limit thresholds. This eliminates the hop entirely.
Distributed cache (Redis, Memcached) for data that changes moderately and is expensive to recompute. But remember: a Redis call is still a network hop. Measure it.
CDN or edge caching for responses that are fully cacheable. The fastest hop is the one that never reaches your origin.

5. Tune your connection pools aggressively

Connection pool exhaustion is one of the most common causes of sudden latency spikes in production. When a pool is exhausted, new requests queue waiting for a connection — and that queueing can spike your p99 into seconds even when the underlying service is healthy.

For every downstream HTTP client in your system, explicitly configure:

Maximum connections
Connection timeout (how long to wait for a connection from the pool)
Request timeout (how long to wait for a response)
Idle timeout (how long to keep an unused connection alive)

Most HTTP client libraries default to conservative settings that are badly mismatched for high-throughput internal service calls.

6. Profile your serialization

Particularly in JVM-based and Node.js services, JSON serialization of large objects is surprisingly expensive. If you're serializing the same data structure on every request, consider:

Pre-computing and caching the serialized form
Switching to Protobuf or MessagePack for internal APIs
Trimming your response payloads — only send what the caller actually uses

The Architectural Checklist

Before you ship any change that adds a new service call to a latency-sensitive path, run this checklist:

[ ] Measured the baseline latency of the new dependency (p50, p95, p99) in production or under realistic load
[ ] Calculated the new compound p99 for the full request chain after adding this hop
[ ] Verified the new p99 is within the latency budget with margin for growth
[ ] Considered async alternatives: can this happen outside the request path?
[ ] Set an explicit timeout on the call — not a default, a deliberate number based on the budget allocation for this hop
[ ] Defined a fallback for when this call fails or times out — degraded response, cached result, default value
[ ] Added tracing instrumentation so this hop appears in distributed traces
[ ] Added latency alerting on this specific service-to-service call at p99
[ ] Reviewed retry policy — retries are multiplied against the timeout; have you budgeted for them?
[ ] Checked connection pool settings — are they tuned for the expected concurrency of this call?
[ ] Reviewed if TLS/mTLS overhead has been measured and accounted for in the budget

If any of these items can't be answered confidently, the PR should not merge into a latency-sensitive path without an explicit team discussion.

A Real Latency Budget Calculation

Let's close the loop with a worked example you can adapt.

System: A mobile app backend. The product requirement is 150ms end-to-end response at p95 for the home feed endpoint.

Budget allocation:

Component	Budget (p95)	Owner
DNS + TCP + TLS (mobile to CDN edge)	10ms	Infrastructure
CDN to origin gateway	5ms	Infrastructure
Gateway + auth (JWT local validation)	5ms	Platform
Feature flag resolution (local cache)	1ms	Platform
Feed service business logic	30ms	App team
Primary DB query (indexed read)	25ms	App team
Recommendations service call	35ms	ML team
Response serialization + compression	8ms	App team
Return path network	10ms	Infrastructure
Total allocated	129ms
Remaining headroom	21ms

This leaves 21ms of headroom before hitting the 150ms SLO. Now someone proposes adding a "personalization boost" service call. Its measured p95 is 18ms.

If you add it synchronously to the hot path, your headroom drops to 3ms. Any slight increase in traffic, any GC event in any service, any network hiccup — and you're over budget. The right conversation is: "Can this call happen asynchronously? Can we pre-compute and cache the result? Does it need to be in the hot path?" Often the answer is no, it doesn't.

This is how you defend your latency budget: with numbers, not intuition.

The Takeaway

The problem with latency budgets isn't that engineers don't care about them. It's that the damage is cumulative, invisible at the median, and always attributed to "the system getting more complex" rather than the specific architectural decisions that caused it.

One extra hop is never just 5ms. It's 5ms at p50, and it's the entire tail distribution of that service — including its worst-day behavior — injected into every request that goes through it. Multiply that across five services added over six months, and you've turned a snappy product into something that users feel is "kinda slow sometimes."

The tools to fight this aren't exotic. Distributed tracing, explicit budget allocation, p99 alerting, aggressive timeout tuning, and a cultural habit of treating every new synchronous hop as a cost that needs justification. That's it.

Your architecture diagram shows boxes and arrows. Your users experience latency distributions. Make sure someone on your team is closing the gap between those two views — before your p99 starts closing it for you.

#performance #distributedsystems #systemdesign #backend #architecture #microservices

Building Reliable Agents with the Transactional Outbox Pattern and Redis Streams.

Ovaise Qayoom — Sun, 29 Mar 2026 21:46:23 +0000

When Your AI Agent Makes the Right Call But Your System Doesn't Follow Through

You built an agent that works. The model is sharp, the decisions are correct — and yet, customers are still getting burned. Here's why, and how to fix it for good.

There's a moment every developer building AI agents eventually hits. The demo goes well. The model makes the right call. Everyone's impressed. Then you ship it, and three days later, a customer is furious because the refund the agent approved never actually happened.

The agent didn't fail. The system did.

This is the problem nobody talks about when they're showing off agentic workflows: the gap between a decision being made and that decision being trusted by the rest of your platform. That gap — the handoff — is where reliability goes to die.

In this post, I want to talk about a pattern that solves this elegantly: the Transactional Outbox Pattern, paired with Redis Streams. It's not new. It's been a quiet workhorse in microservices architecture for years. But it's exactly the kind of infrastructure thinking that agentic systems desperately need right now.

I also write about patterns like this on my engineering blog at kallis.in — feel free to check it out if this kind of systems design content interests you.

The Real Problem: The Handoff

Picture this: your customer support agent reads a conversation, applies the refund policy, and returns "approve the refund". Your service then does two things:

Updates the support case record to refund_approved
Publishes an event to billing so the money actually moves

Seems simple. Until your process crashes between step 1 and step 2.

Now your case record says "refund approved." Billing never got the event. The customer waits, gets nothing, calls support, and that conversation becomes a manual investigation. The worst part? If you check the database, everything looks fine. The bug is invisible until someone notices the money never moved.

[Agent Decision] → [Update Case ✅] → 💥 crash → [Publish Event ❌]

![Diagram 1](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i5r1dh918ol15s5t41aq.webp)
Result: Case says "approved". Customer gets nothing.

This isn't an AI problem. It's a handoff problem. And it's been around long before LLMs were in the picture.

Why "Just Retry" Doesn't Save You

The instinctive fix most developers reach for is retries. "If the publish fails, retry it." Reasonable — until you think about where the failure happens.

Retries only help if your application still knows it has something to retry. If the process crashes after the state update but before the event is written anywhere, there's nothing left to retry. The knowledge of "I need to publish this event" died with the process.

This is the core distinction:

Approach	What it solves
Retries	Delivering an event that already exists
Transactional Outbox	Ensuring the event exists in the first place

Once you see it that way, the Outbox pattern stops feeling like ceremony and starts feeling like basic correctness.

What the Transactional Outbox Pattern Actually Does

The idea is straightforward: when your business state changes, write an event record in the same atomic operation. Don't publish the event directly to a message broker. Write it to an outbox table (or stream) that lives alongside your data, in the same commit.

A dedicated relay process then reads from the outbox and delivers events to downstream systems. If delivery fails, the event is still in the outbox. Nothing is lost.

Before (fragile):
  1. UPDATE case SET status = 'refund_approved'
  2. PUBLISH RefundApproved to billing  ← can fail silently

After (outbox pattern):
  1. ATOMICALLY:
       UPDATE case SET status = 'refund_approved'
       INSERT INTO outbox (event_type, payload) VALUES ('RefundApproved', {...})
  2. Relay picks up outbox row → delivers to billing (retryable, durable)

The request path now only has one job: commit the decision and the event together. Everything after that is recoverable.

Why Redis Streams Fit This Beautifully

The Transactional Outbox pattern is most commonly associated with Kafka + Debezium. That stack is powerful, but it's also heavy. You're managing Kafka brokers, ZooKeeper (or KRaft), Debezium connectors, schema registries — and you haven't even started on your actual application yet.

Redis Streams offer a much lighter path that preserves the semantics you need:

Append-only log — events are written in order and stay there
Consumer groups — billing, notifications, and CRM sync can each consume independently, at their own pace
Pending entry tracking — Redis knows which messages have been delivered but not yet acknowledged
Built-in recovery — unacknowledged messages stay in the pending list and can be reclaimed

But the biggest win isn't any of those features individually. It's this: if your application state also lives in Redis, the case update and the outbox append can share a single MULTI/EXEC transaction. One atomic commit. No dual-write problem.

With Kafka, you're coordinating two separate distributed systems. With Redis Streams, it's one.

The Architecture

Here's how the pieces fit together for a customer support agent scenario:

State lives in a Redis Hash:

support:{tenant-acme}:case:case-123  →  { status: "refund_approved", ... }

Outbox lives in a Redis Stream:

support:{tenant-acme}:outbox  →  [ RefundApproved event, ... ]

The hash tag {tenant-acme} is doing important work here. In a clustered Redis setup, keys with the same hash tag are guaranteed to land in the same slot — which is what makes them eligible for the same transaction. Miss this and your MULTI/EXEC will fail in production in ways that are maddening to debug.

From there, downstream consumer groups each process the stream independently:

                          ┌─► billing-cg      → issues refund
outbox stream ────────────┼─► notifications-cg → emails customer  
                          └─► crm-sync-cg     → updates CRM

Each group moves at its own pace. If billing is slow, notifications aren't blocked. If CRM sync has a bug, billing keeps working.

Trade-offs Worth Thinking Through

Source of truth colocation

The pattern is strongest when your business state and outbox live in the same datastore. If your case data is in Postgres and your outbox is in Redis, you're back in dual-write territory with extra steps. Colocate them.

Per-tenant streams over global streams

A single global outbox stream sounds tempting but becomes a pain in clustered Redis. Per-tenant streams keep related events together, enable better ordering guarantees, and make incident investigation dramatically easier.

Idempotency is non-negotiable

The outbox makes the handoff durable, but it doesn't make effects exactly-once. If a worker crashes after processing but before acknowledging, another worker will retry the same event. Your downstream handlers must be safe to run more than once. Treat stream entries as immutable facts, not mutable instructions.

Retention needs a policy before go-live

An outbox is a log, and logs grow. Trim too aggressively and you lose your replay window. Never trim and you have a slow-growing operational problem. Set a MAXLEN policy before you ship and revisit it regularly.

Redis is now part of your correctness model

If the outbox carries refunds and escalations, Redis isn't a cache anymore. It's part of your durability story. That means thinking seriously about replication, AOF persistence, failover, and what happens during a Redis primary failure.

Let's Look at the Code

Here's how this looks in practice using Java and Jedis. The same concepts translate cleanly to any Redis client.

Key structure

public record SupportKeys(String caseKey, String outboxKey) {

    public static SupportKeys forCase(String tenantId, String caseId) {
        String hashTag = "{" + tenantId + "}";
        return new SupportKeys(
            "support:" + hashTag + ":case:" + caseId,
            "support:" + hashTag + ":outbox"
        );
    }
}

The hash tag ensures both keys land in the same Redis cluster slot — making them transactable together.

The atomic write — the heart of the pattern

public RefundCommitted approveRefund(RefundDecision decision) {
    SupportKeys keys = SupportKeys.forCase(decision.tenantId(), decision.caseId());

    // Case state update
    Map<String, String> caseFields = new LinkedHashMap<>();
    caseFields.put("status", "refund_approved");
    caseFields.put("updated_at", decision.decidedAt().toString());
    // ... other fields

    // Outbox event
    Map<String, String> outboxFields = new LinkedHashMap<>();
    outboxFields.put("event_type", "RefundApproved");
    outboxFields.put("event_id", decision.eventId());
    outboxFields.put("refund_id", decision.refundId());
    // ... other fields

    try (AbstractTransaction redisTx = jedis.multi()) {
        redisTx.hset(keys.caseKey(), caseFields);           // update state
        Response<StreamEntryID> streamId =
            redisTx.xadd(keys.outboxKey(),                  // write outbox event
                         StreamEntryID.NEW_ENTRY, outboxFields);

        List<Object> results = redisTx.exec();
        if (results == null) {
            throw new IllegalStateException("Transaction aborted");
        }

        return new RefundCommitted(decision.caseId(), decision.eventId(),
                                   streamId.get().toString());
    }
}

This is the entire correctness guarantee in one block. Either both writes happen or neither does. The downstream world never sees a case that changed without a corresponding event.

The consumer — processing with recovery

public void run(String tenantId) throws InterruptedException {
    String outboxKey = SupportKeys.forCase(tenantId, "unused").outboxKey();
    createConsumerGroup(outboxKey);

    while (!Thread.currentThread().isInterrupted()) {
        // First: drain anything pending from a previous crashed worker
        List<StreamMessage> pending = readGroup(outboxKey, PENDING_ID, 10);
        if (!pending.isEmpty()) {
            processEntries(outboxKey, pending);
            continue;
        }

        // Then: pick up new entries
        List<StreamMessage> fresh = readGroup(outboxKey, NEW_ENTRY_ID, 10);
        if (!fresh.isEmpty()) {
            processEntries(outboxKey, fresh);
        } else {
            Thread.sleep(200L);
        }
    }
}

The key detail here is the two-pass approach: always drain pending entries first. If a worker crashed mid-processing, those entries are still sitting in the pending list with the previous consumer's name attached. This loop ensures they get picked up and retried — which is exactly the recovery behavior the pattern is designed to provide.

Processing looks like this:

private void processEntries(String outboxKey, List<StreamMessage> entries) {
    for (StreamMessage message : entries) {
        try {
            String eventType = message.fields().get("event_type");
            if ("RefundApproved".equals(eventType)) {
                billingGateway.issueRefund(
                    message.fields().get("refund_id"),
                    message.fields().get("customer_id")
                );
            }
            // Acknowledge only after successful processing
            jedis.xack(outboxKey, BILLING_GROUP_NAME, message.id());
        } catch (Exception e) {
            // Log and move on — message stays pending for retry
            log.error("Failed to process {}: {}", message.id(), e.getMessage());
        }
    }
}

Acknowledge only after you've successfully processed. Never before. That's what keeps the pending list accurate and recovery reliable.

Key Takeaways

If you're building AI agents that trigger real-world actions, here's what to walk away with:

The agent isn't the reliability problem. The handoff between the agent's decision and the downstream work is where things break.

"Save state, then publish" is fragile by design. Any crash in the gap creates invisible inconsistency that only surfaces later, during customer complaints or manual audits.

The Transactional Outbox pattern removes the worst failure mode by making the decision and the event a single atomic commit. If the commit succeeds, the event exists and delivery becomes a recoverable problem.

Redis Streams are a lightweight, well-suited fit — especially when your application state is also in Redis. The hash tag design for cluster slot colocation is the detail that makes it actually work in production.

Idempotency and retention aren't optional. Design for them before you ship, not after you hit the problem.

Agentic systems are getting more capable quickly. But capability without reliability is just a more impressive way to fail. Patterns like this are what turn impressive demos into trustworthy production systems.

If you found this useful, I write about systems design, distributed patterns, and engineering craft at kallis.in. Come say hi.

Tags: #ai #redis #distributedsystems #architecture #agents

I Just Built a Production-Ready Authentication System with Supabase

Ovaise Qayoom — Sat, 28 Mar 2026 07:15:08 +0000

Full Guide (Next.js App Router)

Hey Dev Community! 👋

If you've ever struggled with real-world authentication in a Next.js project — dealing with session expiry, middleware headaches, email verification, protected routes, or proper Row Level Security (RLS) — this one's for you.

I just published a complete, battle-tested guide on my blog that walks you through building a full authentication system using Supabase and Next.js App Router. No half-baked examples. No missing pieces. Everything you need to ship a secure, production-grade auth flow today.

Why This Guide is Different

Most Supabase tutorials stop at signInWithPassword and call it a day.

This one goes all the way:

✅ Middleware + HTTP-only cookies with @supabase/ssr
✅ Email verification + password reset flows
✅ Protected routes using getUser() on the server
✅ Role-Based Access Control (RBAC) with Postgres RLS + profiles table
✅ Proper client vs server Supabase clients
✅ Security best practices & common pitfalls

Read the full step-by-step tutorial here:

👉 How to Build a Full Authentication System with Supabase (Real Project Setup)

What You'll Learn

Setting up Supabase Auth correctly in Next.js 15/16
Configuring middleware for seamless session management
Building a profiles table with RLS policies
Handling auth redirects and edge cases
The right mental model for scalable auth architecture

The guide is written from a real project perspective — exactly how I implement auth for client work and SaaS products.

More from Kallis Blog

Love deep-dive technical content? Check out these recent posts too:

Browse the full blog archive:

👉 All Articles on Kallis Blog

Main website:

👉 www.kallis.in – Web Development, SaaS, Backend APIs, Mobile Apps & SEO Services

Would love your feedback!

Drop a comment below if you’ve tried Supabase auth before. What was the biggest pain point for you?

If this guide helped (or if you want me to cover OAuth, MFA, or anything else next), give it a ❤️ and share it with your network.

Follow me on X for more Next.js + Supabase + modern web dev content: @ovaiseqayoom (or just search Ovaise Qayoom).

Happy coding! 🚀

Published by Ovaise Qayoom | Kallis Blog