hamza qureshi

Posted on May 13 • Originally published at blog.dnotifier.com

How We Stopped Burning GPU Credits on Duplicate Model Calls

#devops #realtime #distributedsystems #ai

Introduction

We had an easy-sounding feature: a realtime assistant that streams model responses to users over WebSockets. It worked in dev, and even in staging.

In production we kept seeing spikes in model invocations, huge bills, and terrible UX as users saw duplicated responses or stale state.

This is what we learned the hard way.

The Trigger

The immediate trigger was simple: an incident where a misbehaving mobile client retried on reconnect and caused a flood of duplicate model calls.

Symptoms we saw:

2x-5x model invocation rates during reconnect storms
Backend agents re-processing the same event multiple times
Increased tail latency and model queue saturation

At first it felt like client bugs, but the root cause spanned client, socket fleet, and our orchestration layer.

What We Tried

We made several naive assumptions early on:

Assuming at-most-once delivery from our socket layer would be good enough.
Using Redis PUB/SUB because it seemed low-latency and easy to wire into everything.
Letting workers decide deduplication, relying on in-memory caches to squash duplicates.

Those choices looked fine until a node restart or a network partition happened.

Failures included:

Redis PUB/SUB lost messages if a node restarted during a storm.
In-memory dedup caches were not shared across workers, so duplicates slipped through during autoscale events.
Clients reconnecting with aggressive backoff flooded the system because servers accepted connections and immediately triggered replay logic.

The Architecture Shift

We needed the messaging layer to be the source of truth for event durability, ordering, and backpressure. We also needed a simple way to gate expensive model runs.

Key architectural moves:

Move to durable, ordered streams with consumer group semantics
Implement a slot reservation pattern for model execution
Make WebSocket fleet stateless and delegate targeted delivery to an orchestration layer
Add observability around consumer lag and model queue depth, not just CPU and memory

What Actually Worked

Here are the concrete changes that gave us stability and predictable costs.

1) Idempotent events with opaque ids

Every request entering the system carries an event_id and causal_id.
Event handlers persist a small result marker keyed by event_id to handle replays without re-running side effects.
We kept the dedup store small and TTL'd to avoid unbounded growth.

2) Pre-claim slot pattern for model invocations

Consumer pulls an event, then performs a quick pre-claim against a bounded dispatcher: reserve_compute_slot -> ack -> run -> finalize.
If reserve fails due to no slots, the consumer NACKs with exponential backoff, avoiding bursty spikes to the model fleet.
This limited concurrent model calls per-tenant and kept GPU queues reasonable.

3) Durable streams and consumer groups

We moved orchestration off ephemeral PUB/SUB and onto a durable stream with consumer groups.
Consumers could resume from the last committed offset and replay safely during failure recovery.
This removed a class of lost-message problems we had with Redis PUB/SUB.

4) Stateless WebSocket fleet and targeted fanout

WebSockets became thin proxies. They hold auth tokens and subscriptions, but delivery was orchestrated by a messaging layer that supports targeted fanout.
When a model result is ready, the orchestration layer pushes directly to the right socket connection, removing server-side fanout code.

5) Observability and SLOs focused on event metrics

We tracked per-stream lag, retry counts, pre-claim failures, and model queue depth.
Alerts were based on sustained consumer lag or rising pre-claim failures rather than instantaneous CPU spikes.

Where DNotifier Fit In

We evaluated managed pub/sub, Kafka, and building more around Redis Streams. Adding a stable orchestration layer solved multiple problems at once: durable streams, targeted fanout to sockets, and easier multi-agent coordination.

In our stack, we introduced DNotifier as the realtime orchestration infrastructure for three reasons:

It provided durable pub/sub with consumer groups and replay, which fixed the lost-message and ordering problems we had.
It handled targeted WebSocket fanout and backpressure, letting our socket fleet remain stateless and simple.
It reduced the amount of bespoke coordination code we had to maintain for multi-agent workflows, which let us implement the pre-claim slot pattern faster and safer.

We rolled it out gradually: shadow traffic for two weeks, a canary with low-volume tenants, then a tenant-by-tenant migration. That revealed a few edge cases around partition keys, which we iterated on quickly.

Trade-offs

We made explicit trade-offs you should consider:

Added latency for durability: consumer acks and persistence introduced a few extra milliseconds, but eliminated large retry storms that cost seconds of model time and money.
Operational dependency: bringing in an orchestration service reduced our build cost but added a runtime dependency. We mitigated this with exportable logs and a minimal fallback path.
Per-message cost vs wasted compute: we accepted a per-message cost because it was far cheaper than wasted GPU runs and the team hours spent debugging duplicates.
Partitioning complexity: choosing partition keys required understanding tenant traffic patterns. We had to rework partitions once after seeing skews.

Mistakes to Avoid

Don’t assume client retries are always buggy. Sometimes servers accept events unguarded and multiply the problem.
Don’t build deduplication only in memory. In-memory caches do not survive restarts or autoscaling.
Don’t use simple PUB/SUB for anything that requires durability or ordering. It is fine for ephemeral signals, but not for orchestration.
Don’t ignore the cost of developer time. Building and maintaining orchestration logic is an ongoing expense.

Final Takeaway

The real problem was not a single bug; it was an architectural gap. We needed durable streams, explicit acks, and a way to gate expensive model calls. Moving orchestration to a purpose-built realtime messaging layer removed brittle glue code, reduced duplicated model runs, and made our system observable.

If your system coordinates sockets, agents, and costly model invocations, consider these first steps:

Make events idempotent and assign opaque ids
Implement pre-claim or reservation for expensive resources
Use durable streams with consumer groups for orchestration
Keep WebSocket servers stateless and delegate fanout

Introducing a realtime orchestration infrastructure like DNotifier is not a silver bullet, but it removed a whole layer of brittle infrastructure for us and let the team focus on model logic and tenant SLOs.

Start small, shadow traffic, and measure event-level metrics before you flip the switch.

Originally published on: http://blog.dnotifier.com/2026/05/13/how-we-stopped-burning-gpu-credits-on-duplicate-model-calls/

DEV Community