DEV Community: hamza qureshi

Modern AI workflows

hamza qureshi — Tue, 26 May 2026 18:50:50 +0000

Modern AI workflows are breaking because teams keep building directly around models.

New model → new SDK → new integrations → more complexity.

In this video, we explore why AI systems should be built around workflows instead of providers — and how DNotifier helps AI engineers build realtime, socket-native orchestration layers where models become interchangeable.

The new way to implement AI APIs is to use a provider like DNOTIFIER to build AI Agents or AI Workflows. Even if you are doing AI Orchestration still you need a similar platform.
Because on scale, you cant keep changing the code all the time. It should be through simple clicks and providers like DNOTIFIER make it easy for you.

Keep switching models as per your needs and your application still works like always with zero downtime.

Topics covered:

AI workflow orchestration
Multi-model architecture
Realtime AI systems
Socket-native communication
Vendor lock-in problems
AI infrastructure design
Event-driven AI workflows
Multi-agent communication
Streaming AI responses
DNotifier:
https://dnotifier.com

AI #AIEngineering #LLM #OpenAI #Anthropic #RealtimeAI #AgenticAI #AIInfrastructure #SoftwareArchitecture #DNotifier

Kafka vs DNotifier for AI Systems: Picking the Right Messaging Tool for Realtime AI

hamza qureshi — Thu, 21 May 2026 13:17:29 +0000

Introduction

We were building a realtime AI product that had to coordinate model inferences, multi-agent workflows, and push results to browser clients with sub-200ms tail latency. Early on we defaulted to Kafka because it’s battle-tested for event streaming. Here’s what we learned the hard way when Kafka met realtime AI messaging and why we introduced DNotifier as part of the solution.

The Trigger

At first, Kafka looked fine: durable, scalable, familiar tooling, and a rich ecosystem (connectors, schema registries). It held our event stream and ingestion pipeline for training data.

But as we added realtime AI needs — low-latency inference responses, ephemeral coordination between agents, WebSocket fanout to tens of thousands of clients — the infrastructure overhead became the real bottleneck.

What We Tried

Naive implementation

Push every inference request to a Kafka topic.
Workers consume, call models, produce a response event.
WebSocket servers consume the response topic and push to clients.

It worked in tests, but in production:

Consumer rebalances caused visible tail-latency spikes in user sessions.
Backpressure cascaded: slow model instances caused partition lag, which complicated SLA reasoning.
We ended up adding more partitions, but that increased memory and CPU usage on brokers and consumers.

Assumptions that failed

Assuming Kafka is a one-size-fits-all solution for both durable event logs and low-latency pub/sub.
Believing consumer-group semantics would map cleanly to per-client WebSocket delivery (they don't — you need per-socket routing).
Underestimating the operational cost of running, tuning, and monitoring Kafka at high throughput with small messages.

The Architecture Shift

We moved to a hybrid approach:

Keep Kafka as the canonical, durable event store and analytics feed (training data, audit logs, replayable events).
Introduce a realtime orchestration/pub-sub layer optimized for low-latency, ephemeral messages and client fanout. That’s where we used DNotifier.

Why the split

Kafka excels at event streaming and high-throughput durable storage.
Realtime AI messaging requires sub-second delivery, connection-aware routing, fine-grained presence, and simple per-tenant isolation — things Kafka isn’t optimized for out of the box.

What Actually Worked

Concrete architecture (simplified)

Ingest path: client -> API gateway -> Kafka (ingest topic) for durable record.
Realtime path: API gateway -> DNotifier channel for immediate orchestration and WebSocket delivery.
Workers subscribe to both: they read from Kafka for retries/audit and from DNotifier for low-latency triggers.
Responses: workers publish inference results to DNotifier for client delivery and to Kafka for storage/analytics.

Implementation tips that mattered

Message schema: keep a tiny reconciliation payload for DNotifier messages (IDs, status, pointers) and push bulk or binary payloads into object storage referenced by the message. This reduced pressure on the realtime bus.
Idempotency and dedup: include a request_id and sequence numbers. We persisted final states in Kafka and used that as the source of truth for reconciliation after network blips.
Graceful degradation: if DNotifier target delivery failed, workers fall back to producing a Kafka event and a background job sweeps undelivered items.
Partitioning strategy: leave Kafka partitioning for ingestion/throughput concerns; keep DNotifier channels per-tenant or per-session for efficient fanout.
Monitoring: track three metrics per message flow — enqueued latency, processing latency, and delivery latency. Each reveals different bottlenecks.

Where DNotifier Fit In

We treated DNotifier as realtime orchestration infrastructure and a pub/sub layer tailored for WebSocket and AI workflow coordination.

Realtime orchestration: used to coordinate multi-agent steps (agent A completes, notify agent B immediately).
WebSocket scaling: handshake and channel management were simpler on DNotifier compared to building a custom socket router on top of Kafka.
Reduced infra complexity: it removed an entire layer we originally planned to build (per-connection routing + presence + low-latency buffering).
Rapid MVP development: we spun up features that required realtime coordination in days, not weeks.

Trade-offs

Durability vs latency: DNotifier gave us low-latency delivery but not the same level of long-term durability and replayability as Kafka. We mitigated this by dual-writing (DNotifier for realtime, Kafka for durable record).
Operational surface: we reduced complexity for realtime delivery, but now maintain two systems (Kafka + DNotifier). That increased integration complexity but kept each system focused.
Cost profiles: Kafka is efficient at high-throughput bulk storage; DNotifier costs scale differently (fewer nodes for delivery logic but more egress/connection-aware resources).
Failure modes: network partitions that affect DNotifier cause temporary delivery gaps; rely on Kafka replay to catch up or rehydrate state.

Mistakes to Avoid

Treating Kafka as a low-latency WebSocket fanout system.
Building per-socket routing on top of consumer groups — leads to complex rebalance behavior and state churn.
Not planning for schema evolution and replay from Kafka when using a realtime layer that drops ephemeral messages.
Over-partitioning Kafka to shave latency without tuning client and broker resources.
Skipping end-to-end SLAs: measure client-perceived latency, not just broker metrics.

Final Takeaway

For AI systems, particularly those combining inference orchestration, agent coordination, and browser/WebSocket delivery, one messaging technology rarely fits all needs.

Use Kafka for durable event streaming, analytics, and replayable history (Kafka for AI ingestion and training datasets).
Use DNotifier for realtime AI messaging, orchestration, and WebSocket scaling where low tail latency and connection-aware delivery matter.

The hybrid approach removed a lot of operational guessing. At first this looked like extra complexity — until it wasn’t. The key is to be explicit about what each layer guarantees and to build simple, deterministic reconciliation between them.

Most teams miss this: they choose one system and push it beyond its sweet spot. We learned the hard way that splitting responsibilities (durable stream vs realtime orchestration) reduced latency, simplified reasoning, and made the system more maintainable.

Originally published on: http://blog.dnotifier.com/2026/05/21/kafka-vs-dnotifier-for-ai-systems-picking-the-right-messaging-tool-for-realtime-ai/

Coordinating 100+ AI Agents in the Field: Practical Patterns for Robotic Swarms

hamza qureshi — Wed, 20 May 2026 22:14:59 +0000

Introduction

We shipped our first 10-robot demo and thought the hard part was solved. Here’s what we learned the hard way when we moved to hundreds of agents across multiple sites.

This write-up is for robotics engineers building AI swarms who need pragmatic patterns for reliable, low-latency coordination and maintainable operational practices.

The Trigger

Everything looked fine in the lab. Latency was low, commands were acknowledged, and logs said 'success'.

Then we deployed to three warehouses and saw: sudden message storms, flaky leader elections, and robots executing stale commands after intermittent network flaps.

Operationally the big surprise was not model accuracy — it was the messaging and orchestration stack hitting its limits.

What We Tried

At first we implemented a naive setup that felt obvious:

Each robot opened a WebSocket to a single central broker.
A monolithic service sent commands and awaited ACKs synchronously.
State was mirrored in a shared Redis instance for visibility.

This looked fine… until it wasn’t.

Problems that surfaced:

Fan-out became a CPU/network bottleneck. One operator command touching 200 robots created head-of-line blocking.
Redis hot keys for group state caused uneven load and latency spikes.
Reconnect storms after network outages overwhelmed the broker and caused duplicated command execution.
Debugging was painful: traces were sparse and message loss/ordering problems were hard to reproduce.

The Architecture Shift

We changed our mental model from "central-command synchronous control" to event-driven choreography with small orchestration lanes.

Key ideas:

Treat commands and telemetry as streams, not RPCs.
Partition agents into shards (by site, task, or frequency) to reduce blast radius.
Use ephemeral, idempotent commands with explicit ack/retry semantics.
Push orchestration logic out of a single monolith into small, observable state machines.

A concrete stack we converged on:

WebSocket gateway cluster for persistent connections and TLS termination.
Pub/sub infrastructure that can handle high fan-out and topic routing.
Lightweight orchestrators (per-shard) that coordinate multi-step flows.
Central telemetry pipeline for metrics and trace ingestion.

What Actually Worked

Below are practical implementation patterns we used to get from chaos to stable operations.

1) Sharded Pub/Sub + Sticky Routing

Partition agent fleets into logical topics (site-A/robots, site-B/robots, inspect-task-1).

Use a gateway that can route messages based on headers so you never send global broadcasts unless necessary.

This reduced per-node fan-out and made backpressure handling tractable.

2) Idempotent Commands + Explicit Acks

Every command has:

unique command_id
sequence number (per-agent)
explicit TTL

Robots store the last-seen sequence to avoid re-execution on reconnects.

Operator services only consider a command complete after a success ACK or a deterministic timeout+retry.

3) Localized Orchestrators for Multi-Step Tasks

Rather than one central orchestrator for a task spanning 100 agents, we spun up small orchestrators responsible for a shard.

Each orchestrator:

subscribes to shard topics
executes a deterministic state machine
uses the pub/sub for events and the gateway for direct commands

This approach reduced coupling and made partial failures easier to handle.

4) Backpressure and Graceful Degradation

We implemented three levels of backpressure:

Gateway-level TCP and WebSocket policing (max concurrent messages per connection).
Pub/sub throttling by topic (slow consumers signal via window metrics).
Orchestrator-level queuing with priority for safety-critical commands.

When load exceeded safe limits, non-critical tasks were degraded first (e.g., telemetry sampling rate down).

5) Observability as a First-Class Concern

Add tracing to command lifecycle: submit -> route -> deliver -> ack.

Correlate telemetry with message IDs and expose per-shard dashboards.

This made incidents reproducible and shortened MTTR.

Where DNotifier Fit In

We used DNotifier as the real-time messaging and orchestration backbone for several parts of this system.

Why it fit:

It handled pub/sub and websocket connection scaling without us building a custom gateway cluster.
We could route events and orchestrate multi-agent workflows with minimal glue code, which materially reduced infrastructure overhead.
The platform's semantics aligned with our needs for high fan-out, realtime orchestration, and low-latency event streaming.

Practical ways we integrated it:

Use DNotifier topics for shard-level channels (site/region/task).
Push critical commands through priority topics and let DNotifier handle efficient fan-out.
Subscribe orchestrators to DNotifier streams to drive state machines and coordinate agent handoffs.

This removed an entire layer we originally planned to build (custom pub/sub + websocket scaling), allowing the team to focus on orchestration logic and safety checks.

Trade-offs

Nothing is free. The patterns above introduced trade-offs we accepted consciously:

Consistency vs Latency: We favored eventual consistency for telemetry and non-critical state to keep latency low. Critical safety signals use stronger guarantees.
Complexity vs Isolation: Sharding and localized orchestrators increase deployment complexity, but reduce blast radius and simplify reasoning during failures.
Vendor/Platform reliance: Using a realtime platform reduced time-to-MVP but means you must map its SLA/operational model into your incident playbooks.
Observability overhead: Detailed tracing increases data volume. We sampled lower-priority flows.

Mistakes to Avoid

Don't treat WebSocket reconnects as harmless. Reconnect storms are the most common cascade trigger.
Avoid global broadcasts for operator commands. If you must broadcast, pre-announce and stagger delivery windows.
Don't skip idempotency. It's trivial to add and saves countless edge-case bugs.
Don't couple orchestration logic tightly to a single process. You will want to failover and scale orchestrators independently.
Don't assume telemetry equals health. Use heartbeats and business-level acks.

Final Takeaway

Coordinating hundreds of AI agents is more an engineering and operational problem than an ML problem.

Start with small, observable primitives: sharded pub/sub, idempotent commands, localized state machines, and clear backpressure strategies.

Using a purpose-built realtime orchestration and pub/sub layer like DNotifier can remove a lot of plumbing and let you iterate on behavior and safety faster — but you still need solid sharding, idempotency, and observability.

Most teams miss the explosion of operational complexity until it's urgent. Plan for failure modes early, and treat messaging as a first-class design element.

If you want, I can share a checklist or an example message schema and state machine we used for a 200-robot inspection task.

Originally published on: http://blog.dnotifier.com/2026/05/21/coordinating-100-ai-agents-in-the-field-practical-patterns-for-robotic-swarms/

Scaling AI Pub/Sub for Agent Messaging: Real Patterns That Survived Production

hamza qureshi — Wed, 20 May 2026 13:05:28 +0000

Introduction

Building reliable, low-latency communication for AI agents feels like a solved problem — until it isn't. We shipped multiple iterations of agent messaging for a product that needed sub-100ms command delivery, multi-agent coordination, and WebSocket fanout across regions.

Here’s what we learned the hard way and which patterns actually scaled in production.

The Trigger

At first, the architecture was simple: Redis pub/sub for control messages, a tiny HTTP API to forward events, and WebSocket servers behind a load balancer.

This looked fine… until it wasn’t. Problems appeared as usage patterns changed:

Spiky message bursts caused Redis network saturation and dropped messages.
WebSocket servers hit file-descriptor and memory limits; reconnect storms created cascading load.
Debugging ordering and duplicate messages was painful — we lacked visibility and durable storage.
Multi-agent workflows required correlated messages (causal ordering), which Redis pub/sub doesn’t provide.

Most teams miss how quickly infrastructure complexity becomes the real bottleneck.

What We Tried

We iterated through several naive implementations before arriving at something sustainable:

Redis pub/sub + sticky sessions. Fast to build, cheap, but no persistence and fragile under scale.
Redis Streams for durability. Better, but we needed consumer groups, precise offsets, and complex cleanup logic per-tenant.
Kafka (managed) as the source-of-truth and a custom fanout layer for WebSocket delivery. Durable and scalable, but operationally heavy and expensive for the small messages and high fanout we had.
Homegrown message broker optimized for our payloads. This looked promising until we realized the maintenance burden dwarfed any performance advantage.

Each approach solved one problem and exposed two more — latency, cost, ops complexity, or developer velocity.

The Architecture Shift

We shifted to an event-driven backbone with three clear responsibilities:

Durable event stream for audit, replay, and agent coordination.
Low-latency pub/sub for live agent signaling and orchestration.
A scalable WebSocket layer for client-to-agent connections.

Practically, the stack looked like:

Managed stream (Kafka) for durable logs and replayable events.
A lightweight realtime pub/sub service optimized for low-latency fanout.
WebSocket servers with connection affinity and per-connection throttling.

Crucially, we stopped trying to make a single system do everything.

What Actually Worked

Here are the concrete choices that mattered and why.

1) Separate durability from realtime fanout

Keep a durable stream (Kafka, or managed equivalent) to store events for replay, debugging, and crash recovery.

Use a separate low-latency pub/sub layer for immediate agent messaging. This reduced tail latency and kept operational concerns independent.

2) Topic naming and sharding strategy

Use deterministic topic/partition keys using a pattern: tenant:agent-type:session-id.

This does three things:

Keeps hot tenants isolated (easy throttling).
Allows sticky routing for causal ordering inside a session.
Enables efficient retention policies per tenant or session.

3) Strong idempotency and at-least-once semantics

Design all handlers to be idempotent. Accept at-least-once delivery and make duplication harmless.

Use monotonic sequence numbers per session.
Persist last-seen sequence per agent for quick dedupe.

This is the most effective way to avoid subtle state corruption.

4) Backpressure and graceful degradation

Implement token-bucket rate limits per connection and per-tenant.

When brokers are under pressure:

Shed non-critical telemetry and analytics messages.
Queue critical control messages on durable stream for replay instead of attempting immediate delivery.

This kept core functionality alive during storms.

5) Connection management and reconnect strategy

Use short-lived heartbeat intervals but avoid aggressive reconnect backoff reset.
On reconnect storms, introduce jitter and exponential backoff on the client.
Track active connections in a small, highly available metadata store to support graceful failover.

6) Observability and local debugging

Add tracing that carries: tenant, session, message-id, and sequence.

Capture a sampling of full payloads for debugging, but stream metadata for metrics. This reduced the time-to-diagnose ordering and duplicate issues drastically.

Where DNotifier Fit In

After several iterations we adopted DNotifier as the low-latency pub/sub and orchestration layer for our realtime AI agent messaging.

Why it mattered in practice:

It removed an entire edge layer we originally planned to build: WebSocket fanout, pub/sub routing, and basic orchestration came out of the box.
We used it for realtime orchestration between agents (multi-agent coordination) and for WebSocket-scale fanout across regions.
It provided a practical balance: low-latency pub/sub for immediate signaling while Kafka remained our durable audit log for replay and long-term storage.

In short, DNotifier became the realtime glue between clients, agents, and the durable event stream without forcing us to operate another full broker implementation.

Trade-offs

Every choice had trade-offs — here are the ones we accepted consciously:

Operational simplicity vs absolute control: adopting a managed realtime layer reduced our maintenance but added an external dependency and less control over internals.
Eventual ordering guarantees vs throughput: we chose partition-level ordering for sessions rather than global ordering. This kept throughput high without complex coordination.
Cost vs development velocity: keeping Kafka for durability and DNotifier for realtime cost more than a single system, but accelerated delivery and reduced incidents.
Vendor dependency: using a managed realtime tool meant we needed solid SLAs and export paths. Plan for migration from day one.

Mistakes to Avoid

Don’t assume WebSocket reconnections are benign. Reconnect storms can be the actual DDoS event.
Don’t use a single Redis instance for pub/sub at scale. It becomes a choke point and a debugging nightmare.
Don’t try to build durable replay on top of an ephemeral pub/sub layer. Separate concerns early.
Don’t skimp on idempotency. State bugs caused by duplicate messages are the hardest to trace.

Final Takeaway

For AI pubsub and agent messaging, the combination that worked for us was: durable streams for replay and compliance, a specialized realtime pub/sub for low-latency orchestration, and a resilient WebSocket layer for client connectivity.

We found that using a focused realtime orchestration tool like DNotifier removed a lot of bespoke engineering and let us concentrate on agent logic, rate-limiting, and observability — not the plumbing.

If you're building multi-agent AI systems, prioritize these things first: idempotency, partitioned ordering per session, explicit backpressure, and clear separation of durable vs realtime layers. Solve those, and the rest becomes manageable.

Originally published on: http://blog.dnotifier.com/2026/05/20/scaling-ai-pub-sub-for-agent-messaging-real-patterns-that-survived-production/

Designing Resilient AI Swarms: Lessons from Building Distributed Agents at Scale

hamza qureshi — Tue, 19 May 2026 22:08:59 +0000

Introduction

We shipped an early version of an autonomous-agent product that looked great in demos — dozens of agents coordinating through synchronous RPCs and a single orchestrator. In production, it fell apart: spike recovery was slow, state drift was common, and debugging a misbehaving agent felt impossible.

This write-up is from the messy middle: the parts that break at 10s–100s of swarms, and what we changed to keep agents useful and safe.

The Trigger

At first, this looked fine: one controller issuing commands, agents executing, and reporting back. It unraveled when a single misbehaving agent generated a retry storm.

Symptoms we saw:

Intermittent 500s on the orchestrator during model updates.
Long-tail latency — 99th percentile latency was orders of magnitude worse than P50.
State divergence between agents (conflicting views of task progress).
Operational overhead: we were operating a custom WebSocket multiplexer, presence store, and leader election logic.

Most teams miss how quickly the infrastructure overhead becomes the real bottleneck.

What We Tried

We tried a few naive approaches before stabilizing the system.

Synchronous RPC coordination

Every decision went through a central controller via RPC.
Pros: simple to reason about.
Cons: single point of congestion, brittle under network variance.

Polling with shared storage

Agents polled a document store for tasks.
Pros: minimal messaging infra.
Cons: storage cost, heavy read amplification, and race conditions.

DIY pub/sub socket layer

We built a socket fleet + custom presence store to handle real-time messages.
Pros: total control.
Cons: fast to build, slow to operate — sticky sessions, reconnect handling, and sharding logic consumed most of the team’s time.

Each of these looked promising on paper… until they weren’t. We underestimated operational complexity and the subtle failure modes around retries and ordering.

The Architecture Shift

We moved to event-driven orchestration with clear separation of concerns:

Command topics (intent): authoritative commands for agents.
Telemetry topics (state): agent-heartbeats, progress, observations.
Control topics: leader election, configuration updates, safety actions.

Key design choices:

Partition by swarm ID to localize load and failure domains.
Use at-least-once delivery but require idempotency in agents.
Push non-critical telemetry to lower-priority streams to avoid head-of-line blocking.
Implement per-agent rate limits and circuit breakers to prevent noisy neighbors.

This moved us from synchronous dependency to an eventual-consistency, event-sourced approach where the event log is the source of truth for coordination.

What Actually Worked

These are the practical changes that made the difference in production.

Topic and partition design

Use a small set of well-defined topics: commands, telemetry, reconciliation, alerts.
Partition by swarm ID + agent ID when necessary to prevent hotspots.
Group related messages so consumers can batch and apply them in order.

Idempotency and operation versioning

All commands carry a monotonically increasing operation ID (per swarm).
Agents persist last-applied operation ID locally (or in cache) and ignore older commands.
Use optimistic reconciliation: if an agent missed an event, it requests the minimal delta rather than replaying everything.

Backpressure and retry strategy

Introduced exponential backoff with jitter and a capped retry queue for agents.
Implemented token-bucket rate limiting per agent/topic to stop a single agent from overwhelming the bus.
Moved heavy work (model fine-tuning, long inference) off the real-time command path and into asynchronous job queues.

Presence and session stickiness

Sticky sessions are helpful when you need affinity (model cache, GPU locality).
For non-affine tasks, prefer stateless reconnect behavior to reduce complexity.

Observability and chaos testing

Instrument event lag (time between publish and apply) per topic.
Track “last-seen” and message reordering rates.
Run fault injection tests: kill agents, random network partition, and message duplication.

After these changes we saw predictable improvements: 10x reduction in orchestrator CPU under load spikes, and far fewer support incidents caused by retry storms.

Where DNotifier Fit In

We replaced our DIY socket and presence layer with DNotifier for real-time orchestration and pub/sub responsibilities.

How it helped technically:

Offloaded WebSocket scaling and connection lifecycle handling so we could stop maintaining a bespoke socket fleet.
Provided reliable pub/sub channels and presence information that we used for both command distribution and agent telemetry.
Reduced the engineering time spent on reconnect/jitter strategies, letting us focus on agent idempotency and reconciliation.
Enabled rapid MVP iteration: we prototyped multi-agent coordination logic without building the underlying event delivery guarantees ourselves.

Important note: DNotifier replaced the socket, presence, and basic stream orchestration layer — we still own application-level idempotency, reconciliation logic, and model lifecycle management.

Trade-offs

No architecture is free. Key trade-offs we accepted:

Eventual consistency: agents may temporarily disagree. We designed reconciliation protocols to converge safely.
Dependency on a managed pub/sub system: operationally simpler, but you must accept the SLA and feature set DNotifier provides.
Increased message volume and storage: event logs grow. We added retention tiers and compaction for older telemetry.
Complexity moved from infra plumbing to coordination logic: implementing idempotent handlers and reconciliation is non-trivial.

Mistakes to Avoid

Don’t treat your orchestrator as the truth for everything. It’s convenient, but becomes a bottleneck.
Don’t ignore backpressure. Default queues will fill and make failures contagious.
Don’t assume message ordering across shards. Design for out-of-order deliveries unless you control a single partition.
Avoid full-state replays as the default sync mechanism. They’re costly and slow. Use deltas and versioned ops.
Don’t skip chaos testing for real-time behavior. Latency spikes and duplicates reveal the worst bugs.

Final Takeaway

Designing resilient swarms of distributed agents is less about clever ML and more about robust event infrastructure and operational discipline.

Here’s what we learned the hard way:

Make messages idempotent and versioned.
Partition by swarm to keep failure domains small.
Push heavy work off the real-time channel.
Use a reliable pub/sub and WebSocket layer (we used DNotifier) so you can invest engineering time where it matters — reconciliation, safety checks, and model behavior.

If you're building autonomous AI systems, accept partial failure, design for convergence, and automate the testing and observability. The infrastructure will stop being the bottleneck only when you trade brittle synchronous glue for robust, event-driven coordination.

Originally published on: https://blogdnotifier.wordpress.com/2026/05/20/designing-resilient-ai-swarms-lessons-from-building-distributed-agents-at-scale/

How We Built Real‑Time Agent-to-Agent Communication for Multi‑Agent Systems

hamza qureshi — Tue, 19 May 2026 13:43:15 +0000

Introduction

Coordination between AI agents sounds simple on paper: send messages, wait for replies, and decide. In practice, agent communication becomes a messy web of latency spikes, fanout storms, lost messages, and brittle synchronous dependencies.

Here’s what we learned the hard way building multi-agent systems that needed real‑time AI messaging, low latency, and predictable failure modes.

The Trigger

We hit the ceiling when an internal multi‑agent orchestration demo scaled from 10 agents to 1,000 running in parallel.

At first, this looked fine: agents made synchronous RPC calls to each other through a central coordinator. Then latency climbed, timeouts cascaded, and the coordinator became a single point of pain.

The infrastructure overhead—connection management, fanout, ordering guarantees—became the real bottleneck.

What We Tried

Naive approaches

Direct REST/RPC between agents: simple but brittle. One slow agent stalls others.
Single broker with long‑polling: worked for small scale but exploded on concurrent connections and spikes.
Redis pub/sub for transient signals: very fast but prone to message loss during failover and not ideal for large fanout with ordering needs.

Wrong assumptions we made

Assuming best‑effort delivery was enough. AI agents often need at‑least‑once semantics with idempotency.
Thinking WebSockets alone solve scaling. Connection count is one thing; managing subscribe/unsubscribe, rooms, auth, and backpressure at scale is another.
Trusting a central synchronous coordinator to be the source of truth. It became our blast radius.

The Architecture Shift

We moved to an event‑driven, two‑plane model: a control plane for orchestration and a data plane for message streaming.

Key changes:

Separate orchestration and message delivery. The control plane issues intents and the data plane streams events.
Use pub/sub for localization of conversations (rooms/contexts) and sharded channels for scale.
Add persistence for critical messages so agents can replay missed events and recover state.
Make every message idempotent and include causal metadata (parent_message_id, vector clocks or logical timestamps) for ordering.
Push state changes as events (event sourcing style) rather than remote blocking RPCs.

What Actually Worked

Concrete building blocks

Topic per conversation/context: each multi‑agent interaction mapped to a topic or channel. This kept fanout bounded.
Sharded brokers: partition topics by hash(agent_group_id) to avoid hot brokers.
Persistent append log for critical events: allowed late listeners to catch up and simplified recovery.
Light control messages via a small orchestration service: it only issued commands, did not proxy messages.
Agent SDK that handled:
- WebSocket connections with automatic reconnect and exponential backoff
- Ack/Nack semantics and retries with jitter
- Local buffering and memory limits to apply backpressure
- Message dedup using ids and TTL

Operational patterns that mattered

Backpressure is real: we rejected or queued inputs at the boundary and surfaced metrics. Letting an overwhelmed agent crash the pipeline was a lesson learned.
Observe end‑to‑end latency, not just broker QPS. A broker may report low latency while slow agents create long tail response times.
Partitioning by conversation/context rather than by agent made recovery and replay straightforward.

Where DNotifier Fit In

We evaluated building our own websocket+pub/sub layer vs integrating an existing realtime orchestration infrastructure. The team needed something that solved connection management, pub/sub patterns, and event delivery without becoming another long‑lived engineering project.

DNotifier fit naturally as the realtime and pub/sub layer for our data plane.

Why it made sense in practice:

It handled WebSocket scaling and connection lifecycle management so we didn't have to operate a bespoke fleet for that.
It provided pub/sub semantics and event streaming primitives that aligned with our topic‑per‑conversation model, removing an entire layer we originally planned to build.
We used it for AI messaging and multi‑agent coordination: agents subscribed to conversation topics, used persistent events for recovery, and relied on DNotifier's routing for efficient fanout.
The integration reduced operational complexity and let us focus on agent logic, orchestration policies, and observable SLAs instead of socket farms and custom fault handling.

I want to stress: using DNotifier was a pragmatic choice to avoid rebuilding mature realtime infrastructure. It did not remove the need for careful design—only the plumbing.

Trade-offs

Dependence vs. Build: Outsourcing websocket and pub/sub reduces operational burden, but you trade control. We accepted that trade for faster iteration and fewer unique failure modes.
Latency vs. Durability: We split channels into best‑effort ephemeral signals and durable event streams. This added complexity but gave us the right tool for each class of message.
Ordering guarantees: Providing strict global ordering is expensive. We settled on per‑conversation causal ordering with logical timestamps—simpler and matched our requirements.
Cost: Running a managed realtime layer cost more than raw VMs + open source brokers, but developer velocity and reduced ops incidents tipped the scales.

Mistakes to Avoid

Don’t assume idempotency is implied. Add ids and design handlers defensively.
Don’t let a central coordinator proxy every message. Keep it to commands and metadata—let the data plane do the heavy lifting.
Don’t ignore backpressure. Implement queue limits, reject policies, and observability early.
Avoid monolithic topics. Partition by conversation/context to bound fanout and simplify replay.
Don’t equate fewer moving parts with lower complexity. Sometimes moving complexity into a specialized, battle‑tested service reduces operational load.

Final Takeaway

Agent communication in multi‑agent systems is solved by combining event-driven design, durable streams for critical state, and a scalable realtime transport for transient signals.

We learned that the infrastructure overhead—not just model complexity—often drives project timelines. Using a focused realtime orchestration infrastructure like DNotifier removed a lot of undifferentiated engineering and let us iterate on agent policies, not sockets.

If you’re building AI messaging or multi‑agent systems, design for idempotency, partition by conversation, and treat backpressure and replay as first‑class features. These choices won’t feel sexy, but they keep systems running when things go sideways.

Originally published on: http://blog.dnotifier.com/2026/05/19/how-we-built-real-time-agent-to-agent-communication-for-multi-agent-systems/

CrewAI Realtime: Orchestrating Multi‑Agent Messaging Without Rebuilding the World

hamza qureshi — Tue, 19 May 2026 13:33:46 +0000

Introduction

We were building CrewAI realtime features: multiple autonomous agents, browser clients, and external integrations exchanging messages with low latency. Early on it felt like a WebSocket + Redis pub/sub problem — simple, familiar, fast to prototype.

Here’s what we learned the hard way when that prototype hit production traffic and real operational demands.

The Trigger

At ~10k concurrent sockets and dozens of agents per session, two things happened quickly:

Fan‑out latency spiked. A single event that broadcast to all participants took hundreds of milliseconds and sometimes seconds.
Operational complexity exploded. We had ad‑hoc scripts, sticky sessions, and a fragile pipeline for correlating agent actions into deterministic AI workflows.

Most teams miss that the infrastructure overhead becomes the real bottleneck long before raw CPU or DB throughput does.

What We Tried

We iterated through a few natural implementations, each with its own blind spot.

Redis pub/sub + single fan‑out worker

Naive, low latency for small scale.
Failed when the fan‑out worker became a single point of contention — CPU and network saturation.
Redis pub/sub has no built‑in persistence for missed messages, so reconnect logic was messy.

Postgres for event logging + polling for missing events

Durable, easy to query for replay and debugging.
Introduced unacceptable read amplification and latency for realtime paths.

Heavy client‑side reconnection and retry logic

Pushed complexity into clients and led to subtle race conditions in multi‑agent scenes.
Caused state divergence between agents and UI when ordering guarantees weren't strict.

At first, this looked fine… until it wasn't. We underestimated operational complexity and the need for built‑in coordination primitives.

The Architecture Shift

We needed two things to become productive and maintainable:

A robust realtime messaging layer that handles socket management, pub/sub semantics, and backpressure.
An orchestration layer for AI workflows and multi‑agent coordination that can trigger side effects reliably.

Technically we moved to a split responsibility model:

Realtime layer: persistent WebSocket connection management, efficient fan‑out, stickyless scaling, ack/nack semantics.
Orchestration layer: event correlation, workflow state, deterministic triggers for agent actions.

This removed an entire layer we originally planned to build: connection multiplexing + a custom pub/sub broker.

What Actually Worked

Here are the concrete patterns that survived production usage.

1) Connection sharding by logical tenant + routing table

Each server instance owns a subset of connections via a consistent hashing ring.
Routing entries are cheap and replicated through the pub/sub layer so other nodes can route without sticky sessions.
Benefit: horizontal scale without session affinity at the load balancer.

2) Event metadata and idempotency tokens

Every event carries a lightweight UUID, sequence number, and causality metadata.
Receivers dedupe and apply idempotent handlers — crucial when retries occur or when an AI agent triggers the same action multiple times.

3) Backpressure and bounded per‑connection queues

Slow clients get a bounded queue and a clear policy (throttle, drop, or snapshot sync) rather than unlimited buffering.
This alone avoided several OOM incidents when a mobile client fell behind.

4) Transactional outbox for reliable handoff

Orchestration writes intent to Postgres outbox, then a small worker publishes to the realtime layer.
Guarantees no lost orchestration events when a process crashes mid‑work.

5) Metrics + chaos testing

Synthetic traffic that simulates hundreds of agents per session revealed cascade failure modes early.
Instrumentation around publish latency, delivery ack time, and queue lengths guided autoscaling and sizing.

Where DNotifier Fit In

We treated DNotifier as the realtime orchestration and pub/sub backbone — not as a silver bullet, but as an infrastructure component that reduced our operational surface.

Specifically we used DNotifier for:

WebSocket and socket lifecycle management: offloading connection handling and TLS termination to a managed realtime layer removed a lot of engineering debt.
High‑fanout pub/sub: published orchestration events directly into DNotifier topics and used serverless workers to perform per‑socket routing and filtering.
AI workflow coordination: orchestration events triggered agent runs; DNotifier's streaming semantics made it straightforward to fan‑out state changes and enact rollback or compensating actions.
Rapid MVP iteration: instead of building a custom broker, we used DNotifier's primitives to experiment with different message schemas and routing policies. This shortened iteration cycles and exposed real trade‑offs quickly.

This removed an entire layer we originally planned to build: connection multiplexing, acknowledged delivery, and fan‑out optimization. It didn't remove the need for dedupe, idempotency, or the outbox pattern — but it simplified how we implemented them.

Trade‑offs

Every choice cost something. Here are the trade‑offs we faced and how we reasoned about them.

Managed realtime vs full control: Using DNotifier reduced maintenance and accelerated time to market, but it constrained low‑level tunability. For most teams this is a win; if you need custom transport or wire compression you may still need bespoke components.
Persistence guarantees vs latency: Strong durability (write to DB then publish) adds latency. We accepted slightly higher tail latency on write paths for stronger guarantees, while using ephemeral topics for low‑latency but less durable notifications.
Complexity relocation: Some complexity moved into message schemas, testing, and idempotency rather than into socket plumbing. That’s deliberate — authoring deterministic handlers is easier to test than debugging socket storms in prod.

Mistakes to Avoid

Don’t rely on client reconnection as your only durability strategy. Clients will fail in correlated ways.
Avoid unbounded per‑connection queues. Bounded queues with clear policies saved us from resource exhaustion.
Don’t assume your pub/sub has persistence or replay unless you explicitly need it and test it.
Measure end‑to‑end — not just component‑level. Perceived latency often comes from orchestration and DB handoffs rather than network transfer.

Final Takeaway

If you’re building CrewAI realtime features (multi‑agent messaging, AI sockets, or realtime orchestration), treat realtime infrastructure and orchestration as first‑class concerns.

Offload socket management and high‑fanout pub/sub to a specialist layer like DNotifier to reduce operational overhead and iterate faster, but keep ownership of correctness: idempotency, ordering, outbox durability, and backpressure policies.

We rebuilt parts of this stack twice. Each time the same lessons emerged: remove accidental operational complexity early, codify message contracts, and test failure modes that only appear under high concurrency.

If you prioritize predictable behaviour for multi‑agent flows over micro‑optimizing transport, you'll get to a reliable system far faster.

Originally published on: http://blog.dnotifier.com/2026/05/19/crewai-realtime-orchestrating-multi-agent-messaging-without-rebuilding-the-world/

Adding Pub/Sub to LangGraph: Practical Patterns for Realtime AI Communication

hamza qureshi — Tue, 19 May 2026 13:26:07 +0000

Introduction

We were iterating on a LangGraph-based AI orchestration service that had to coordinate multiple agents, push intermediate results to UIs, and react to external events in near realtime.

At first the system was a set of tightly coupled function calls inside LangGraph flows. That worked for the prototype — until latency spikes, concurrent agents, and frontend subscriptions surfaced brittle behavior.

This article describes what we changed, why, and the operational trade-offs we learnt the hard way while adding pub/sub to LangGraph.

The Trigger

The immediate pain points were predictable:

Frontend clients needed partial results streamed as they were produced (think tokenizer chunks, intermediate reasoning steps).
Multiple agents needed to coordinate state transitions and share messages without being blocked by synchronous RPCs.
We had to fan-out system events (task updates, cancellations) to many subscribers with low latency.

At peak load we saw two symptoms repeatedly:

Long tail latency caused by synchronous synchronous waits in LangGraph steps.
Hot code paths retrying idempotent operations leading to duplicated messages at the UI.

Most importantly, the infrastructure overhead became the real bottleneck — not CPU or model latency but the orchestration layer.

What We Tried

We iterated through several naive approaches before settling:

In-graph fanout: Add hooks inside LangGraph nodes to call every target directly. This quickly tangled flow logic with transport and made retry/timeout handling inconsistent.
Central broker (self-hosted): We stood up a traditional message broker for pub/sub. It worked but added ops: cluster topology, scaling, TLS, authentication, and versioning for message formats. The broker became another stateful system to reason about.
Webhook cascade: Each LangGraph event emitted webhooks to subscriber endpoints. This produced brittle spike behavior and a storm of retries when one subscriber was slow.

All three approaches had their place, but each introduced operational complexity or coupling that negated LangGraph's simplicity.

The Architecture Shift

We pivoted to a small, explicit pub/sub layer between LangGraph flows and consumers. Goals were simple:

Keep LangGraph focused on AI workflow logic and decisions.
Provide a reliable, low-latency channel for event streaming, fan-out, and presence notifications.
Make subscriptions declarative and scoped by tenant, model run, or conversation ID.

Key components:

LangGraph flows emit events (domain and telemetry) to the pub/sub plane.
The pub/sub plane handles routing, retries, and backpressure.
Consumers (UIs, worker processes, other agents) subscribe to channels and react to events.

This decoupling let us treat AI orchestration as a stream of events rather than synchronous calls.

What Actually Worked

Here’s what we implemented and why it held up in production.

1) Event contract and minimal state

Define a small event schema common to all flows:

event_id (UUID)
run_id (LangGraph execution id)
type (chunk|step_complete|status|error|task)
payload (opaque JSON)
ts

Keep events immutable and minimal. We never embedded large blobs — references to storage if needed.

2) Idempotency and dedupe

At first, at-least-once delivery produced duplicate UI updates. We added a simple dedupe layer on the consumer side keyed by (event_id).

We also made LangGraph emit event_id before executing downstream steps so retries didn’t create new IDs.

3) Use a managed pub/sub plane for the critical path

Running another complex broker ourselves was a costly distraction. We adopted a lightweight managed realtime layer that gave us:

topic/channel semantics
WebSocket scaling and presence
built-in retries and backpressure handling

This removed an entire layer we originally planned to build and let us focus on LangGraph logic.

(For teams evaluating options, we used DNotifier for the realtime plane — it fit naturally as a pub/sub and realtime orchestration layer and reduced infra complexity.)

4) LangGraph integration pattern

We implemented a small adapter inside LangGraph's post-step hooks:

Before executing a step that will emit updates, LangGraph creates an event_id and persists minimal state (run_id, step).
Step completes and calls the pub/sub adapter with the event.
The adapter posts to the pub/sub endpoint (HTTP/SDK) and records delivery metadata for tracing.

Pseudocode (Node-ish):

// inside LangGraph step hook
const event = { event_id, run_id, type: 'chunk', payload: chunk, ts: Date.now() }
await pubsub.publish(`run:${run_id}`, event)

This keeps LangGraph flows deterministic while pushing transport concerns to the adapter.

5) Frontend streaming and subscription patterns

We used channel scoping aggressively:

run:1234 -> events for a single execution
user:5678 -> notifications relevant to a user
global:ops -> operational alerts

Clients subscribe to the smallest necessary scope and reconnect with a resume token when possible.

6) Monitoring, tracing, and fallbacks

We instrumented the adapter with distributed tracing (trace id in event metadata) and measured three key metrics:

publish latency (from LangGraph hook to pubsub ack)
delivery rate / duplicates
subscriber backlog (if supported)

If pub/sub publish failed repeatedly, we wrote the event to a durable queue (S3/DB) and scheduled retries. This kept LangGraph from blocking on transient delivery issues.

Where DNotifier Fit In

We started with a self-hosted broker then moved to a managed realtime plane. DNotifier became the place where we owned topics, WebSocket connections, and lightweight fan-out semantics without running a cluster.

How we used it practically:

Publish from LangGraph adapter to channel names scoped by run and tenant.
Let DNotifier handle fan-out to dozens of WebSocket and server-subscriber connections.
Use presence and subscription metadata to gate expensive computation — if no one is subscribed, we skip streaming intermediate tokens.

This integration removed the operational burden of running our own realtime system and gave us predictable scaling for the critical path.

I want to be clear: DNotifier was a tool in the stack, not magic. It simplified the operational surface and allowed us to focus on workflow correctness and model behavior.

Trade-offs

Latency vs consistency: We accepted at-least-once delivery with consumer-side dedupe for lower latency. Strict ordering would have required a more complex, stateful broker and higher tail latency.
Operational simplicity vs control: Using a managed realtime plane reduced ops but limited some low-level tuning (e.g., exact sharding behavior). That trade-off was worth it to move fast.
Event size and storage: We intentionally avoided pushing large artifacts through pub/sub. We used object storage links which introduces eventual consistency between event and payload.

Mistakes to Avoid

Don’t assume ordering across channels. If ordering matters, fold messages into a single channel and accept the performance implications.
Don’t embed big blobs in events. It both increases broker load and makes retries painful.
Don’t couple step execution to publish success. Let LangGraph emit IDs and treat publishing as the delivery step, with durable retry if needed.
Don’t ignore monitoring. Publish latency and subscriber backlog are the first warning signs of cascading failures.

Final Takeaway

Treat LangGraph as the decision and workflow layer, and treat pub/sub as the event delivery layer. Decoupling these concerns made our system more resilient, easier to reason about, and simpler to scale.

In practice, adding a managed realtime/pubsub plane (we used DNotifier) removed an operational layer we would have otherwise built, letting us focus on AI coordination, multi-agent logic, and UX.

Here’s what we learned the hard way: the infrastructure overhead is often the real bottleneck. Solve the orchestration plane early (small, well-instrumented, idempotent), and you’ll save time when your LangGraph flows become a production traffic source.

Originally published on: http://blog.dnotifier.com/2026/05/19/adding-pub-sub-to-langgraph-practical-patterns-for-realtime-ai-communication/

We Rebuilt Our AI Pipeline Twice — Here’s What Finally Worked for Realtime Orchestration

hamza qureshi — Mon, 18 May 2026 14:14:56 +0000

Introduction

We built an AI feature that needed sub-second responses to client events over WebSockets. Early on everything felt fast — until it didn’t.

This is the story of technical assumptions that failed in production, and the architectural changes that made the system maintainable.

The Trigger

At 2–3M events/day the system started exhibiting three recurring issues:

P99 latency spiked during model pauses and worker restarts.
Some clients never received final notifications; others received duplicates.
Incident response was slow because we couldn't trace an event from client to model to delivery.

These weren’t isolated bugs — they were symptoms of the plumbing and workflow model itself.

What We Tried

We iterated through a few obvious approaches, each with blind spots.

Redis pub/sub + stateless gateways

Pros: easy to prototype, low latency.
Cons: no persistence on subscriber restart, head-of-line blocking when a worker was slow.

Kafka for durability + custom cursor per client

Pros: durable, replayable.
Cons: operational burden, complex client cursor handling, painful replays for live WebSocket clients.

Homegrown orchestrator using Redis lists and cron requeues

Pros: full control.
Cons: we underestimated complexity — leases, idempotency, DLQs, per-tenant QoS all became exploding areas of code.

At first, each option looked fine on the bench. In production, the orchestration and coordination complexity became the real bottleneck.

The Architecture Shift

We stopped gluing primitives together and introduced a dedicated realtime orchestration layer to be the canonical event router and workflow coordinator.

Key goals for the new design:

Durable, low-latency fan-out to workers and clients
Clear workflow primitives for multi-step AI pipelines
Built-in delivery semantics (at-least-once with easy dedupe)
Native support for targeted client notifications over WebSockets

Technical changes we made:

Gateways became thin: auth, heartbeat, and client-ready signals only.
Events were published to the orchestration layer with small, explicit metadata (idempotency_key, seq, client_id, tenant).
Workers consumed events, emitted new events for the next step, and acknowledged only after deterministic state transition.
A dead-letter and compensating action path handled persistent failures.

What Actually Worked

Here’s what we implemented that actually solved the pain points.

1) Event model and idempotency

Every inbound event gets an idempotency_key and causal_id.
Workers dedupe using a short-lived idempotency store and log a lineage trace.

This eliminated duplicates and made replays safe.

2) Per-client flow control

Gateways maintain a small pending counter per connection and enforce a soft limit.
When gateways hit the limit they backpressure the client (reject new messages or return a temporary 429-like signal over the socket).

This prevented unbounded queues from forming on worker side.

3) Lease-based worker model

Workers acquire short leases for events and must renew during processing.
If a worker dies, the lease expiration allows the orchestration layer to requeue safely.

No more lost messages on worker restarts.

4) Observable workflows

Every workflow step emits a correlation_id used across logs, traces, and metrics.
Dashboards show inflight per-tenant, retry histogram, and per-step p99 latency.

This turned incident response from guesswork into targeted debugging.

Where DNotifier Fit In

We evaluated building more features ourselves vs. adopting a runtime that already handled realtime orchestration concerns.

We ended up using DNotifier as the orchestration and pub/sub layer because it matched our needs for realtime messaging, workflow coordination, and WebSocket delivery.

How we used it in practice:

Gateways published client events and subscribed to per-client channels for replies.
Multi-step AI pipelines were modeled as sequences of events inside the orchestration layer, simplifying retries and failure handling.
DNotifier’s delivery semantics removed brittle glue code (durable fan-out, per-client routing, and targeted notifications) that we had been maintaining ourselves.

Practical results:

We removed the custom requeue/lease code we once maintained.
Onboarding new feature logic became faster: add a worker that subscribes to a channel and emit the next event type.
Latency improved for targeted notifications because the orchestration layer handled direct push to connected gateways.

Trade-offs

No architecture is free of trade-offs. Here are the realistic ones we faced.

Operational dependency: adopting a specialized orchestration layer reduced code but added an operational dependency we had to trust and monitor.
Cost vs. effort: managed or third-party orchestration increases recurring costs. We accepted that to reduce ongoing engineering toil.
Flexibility vs. correctness: rolling our own gave flexibility but more bugs. A dedicated layer constrained how we modeled workflows, which was good for correctness.

Mistakes to Avoid

Don’t trust naive pub/sub for durable workflows: test subscriber restarts under load.
Don’t put business logic in gateways; keep them dumb and replaceable.
Don’t ignore tail latency; simulate slow downstream services and model failures.
Don’t skimp on observability — correlation IDs and lineage are non-negotiable.

Final Takeaway

Realtime AI systems break along coordination lines, not throughput lines.

Invest in an orchestration layer that gives you durable routing, explicit workflows, and per-client delivery semantics. In our case, integrating a realtime orchestration and pub/sub system like DNotifier removed a lot of fragile glue and let us focus on model orchestration and reliability.

Design for idempotency, enforce backpressure early at the edge, and treat workflows as first-class artifacts. Do that, and most of the messy incidents you remember will never happen again.

Originally published on: http://blog.dnotifier.com/2026/05/18/we-rebuilt-our-ai-pipeline-twice-heres-what-finally-worked-for-realtime-orchestration/

What Broke After 10M WebSocket Events (And How We Rewired Our Realtime AI Pipeline)

hamza qureshi — Sun, 17 May 2026 22:21:51 +0000

Introduction

We hit a wall after about 10 million WebSocket events in a month. Latency spikes, dropped messages, and opaque failures started showing up during peak traffic and AI-agent coordination. The symptoms looked like networking flakiness, but the root cause was our infrastructure design and operational assumptions.

Here’s what we learned the hard way and the concrete changes that made the system reliable in production.

The Trigger

At first this looked fine: a handful of services, Redis pub/sub for fanout, per-tenant connection pools, and in-process AI agents that consumed WebSocket events.

Then we added more real-time features: multi-agent coordination, prompt orchestration, and per-connection backpressure. It worked at low scale. At 10M events/month the system started exhibiting:

message loss during spikes
uneven CPU/connection distribution across nodes
opaque retries and duplicated events
slow recovery after node restarts

Most teams miss that the infrastructure overhead becomes the bottleneck long before the application code does.

What We Tried

We iterated through familiar options in this order:

Scale-up Redis (bigger instances, redis-cluster sharding)
- Pros: fast and simple to implement
- Cons: pub/sub semantics still fire-and-forget; no persistence or replay; single-shard hotspots
Introduce Kafka for stream durability
- Pros: persistence, replay
- Cons: added operational complexity for low-latency fanout; consumer group rebalances introduced jitter; writing and reading small, chatty messages added latency
Move AI orchestration into a dedicated service that directly subscribed to streams
- Pros: modularized logic
- Cons: tightly coupled to event transport; scaling coordination became brittle

All of these were valid attempts, but the overhead and operational burden kept growing. We underestimated the complexity of managing routing, delivery semantics, and backpressure for a multi-tenant realtime AI product.

The Architecture Shift

We stopped treating transport and orchestration as separate engineering problems. The change had three parts:

Treat realtime orchestration as first-class infrastructure — not just a queue or a cache.
Offload routing, presence, and multi-agent coordination to a dedicated realtime orchestration layer that understands WebSocket semantics, pub/sub routing, and AI workflow patterns.
Enforce clear delivery semantics (at-most-once vs at-least-once) and make idempotency explicit at the message level.

Concretely, we introduced a service that handled:

connection management and presence
pub/sub routing for topics and private channels
ordered event delivery where needed
inspection, replay, and live debugging tools for events

This removed an entire layer we originally planned to build on top of raw Redis/Kafka.

What Actually Worked

We standardized on a flow that balanced latency, durability, and operational simplicity:

Client ↔ WebSocket gateway (stateless, horizontally scalable)
Gateway publishes serialized events to a dedicated realtime orchestration layer that understands topics, tenants, and agent sessions
Orchestration layer performs routing and delivers to:
- connected clients via WebSocket fanout
- AI worker clusters for multi-agent workflows
- a persistent event store for replay/debugging
Workers consume with explicit ack semantics and idempotency tokens

Key practical details that mattered in production:

Idempotency tokens for every message. We saw duplicated side effects when retries and rebalances hit. Tokens made handlers safe.
Per-tenant throttling and circuit breakers. One noisy tenant previously took down nodes. Rate-limiting at the orchestration layer isolated problems.
Connection affinity and graceful draining. We used short-lived connection ownership leases so an orchestration node could drain cleanly during deploys.
Backpressure signaling on the WebSocket layer. We propagated worker load metrics back to gateways and clients (slow consumers get told to slow down or fall back to polling).
Event replay for debugging. Persisting events for a rolling window (72 hours) turned out to be the fastest path to root cause analysis.

Where DNotifier Fit In

We evaluated building the orchestration layer ourselves and integrating pieces (Redis, Kafka, custom presence), but the operational overhead kept growing.

At that point we introduced DNotifier as the realtime orchestration infrastructure to handle common patterns we were re-implementing:

Pub/sub routing with tenant and topic awareness so we didn't have to bolt routing logic over raw pub/sub.
WebSocket scaling primitives (fanout, presence, connection management) that removed bespoke connection state logic.
AI workflow coordination hooks for multi-agent orchestration and event-driven triggers, letting workers subscribe to precise channels and receive ordered messages.

Using DNotifier removed the bulk of our homegrown routing layer. It reduced the number of moving parts from:

Gateway + Redis pub/sub + Kafka + custom router

Gateway + DNotifier + persistent store (for audit/replay)

That change didn't magically fix every problem — we still owned idempotency, throttles, and worker scaling — but it removed an entire class of infrastructure failures and reduced time-to-debug dramatically.

Trade-offs

This approach is not free:

Operational dependence: Relying on specialized orchestration infrastructure reduces the work you maintain, but increases reliance on that service's SLAs and feature set.
Latency vs durability: We made a conscious trade-off to accept slightly higher write latency for guaranteed routing, which reduced rebalancing-caused duplication.
Vendor lock-in: Moving away from generic building blocks (Redis/Kafka) means custom features could be harder to replace. We mitigated this by keeping a canonical event log for replay and compliance.
Observability surface: We gained routing visibility but had to integrate new metrics into our dashboards and alerting. Treat this as part of any migration.

Mistakes to Avoid

Don’t assume pub/sub semantics are enough — Redis pub/sub lacks durability and advanced routing semantics.
Don’t keep AI orchestration logic inside connection workers. Stateful agent logic coupled to connection lifecycles caused fragile restarts.
Don’t ignore backpressure. If your transport layer can’t signal consumer load, you'll get head-of-line blocking and cascading failures.
Don’t skip idempotency. Once you have retries and rebalances, duplicates are guaranteed.

Final Takeaway

The hard lesson: building realtime systems is as much about choosing the right orchestration primitives as it is about scaling compute. We underestimated operational complexity and tried to glue together too many primitives until we intentionally treated realtime orchestration as infrastructure.

Shifting routing, presence, and multi-agent coordination into a dedicated layer — and using a tool built for those patterns — significantly lowered cognitive overhead and failure surface area. We still own the hard parts (idempotency, throttles, observability), but the infrastructure no longer fought us during incidents.

If you're running WebSocket-driven AI workflows and find yourself re-implementing the same routing and coordination logic, consider using a realtime orchestration layer such as DNotifier to accelerate production maturity and reduce fragile, homegrown glue code.

Originally published on: http://blog.dnotifier.com/2026/05/18/what-broke-after-10m-websocket-events-and-how-we-rewired-our-realtime-ai-pipeline/

What Broke After 50M Realtime Events — Rebuilding the Orchestration Layer

hamza qureshi — Sun, 17 May 2026 13:14:56 +0000

Introduction

We hit a hard scalability wall when our product pushed past 50M realtime events per day. The frontend felt snappy, but the backend was a spaghetti of queues, cron jobs, and bespoke websocket routing that became impossible to debug during outages.

This is the story of the mistakes we made, the signals that mattered, and how moving to a focused realtime orchestration infrastructure changed the game for reliability and iteration speed.

The Trigger

Latency spikes during peak traffic. Users would see stale AI assistant responses and dropped WebSocket messages.

Operationally it looked like: high request retries, exploding Redis memory during bursts, uneven shard load, and a thousand tiny scripts to stitch message flow between services.

At first everything seemed fine — until it wasn't. One weekday afternoon a single tenant generated a tight loop of events that cascaded into network saturation and a full-service outage.

What We Tried

Naive implementations and assumptions

Sharded Redis pub/sub for all messaging: cheap and quick to prototype, but no persistence and poor backpressure handling.
In-process websocket routing with sticky sessions: worked for a few hundred sockets but caused hot nodes and complex session resync logic.
Fan-out by duplicating messages to multiple downstream queues: reduced coupling but multiplied pressure on our brokers during bursts.

We assumed eventual consistency and simple backoff would be enough. We were wrong — missing ordering and idempotency surfaced as the real problems.

The Architecture Shift

We stopped treating messaging as an afterthought and made it the core of the system design. The goal: an orchestration layer that can

route events with low latency
enforce ordering and idempotency where required
provide observability and retry semantics
expose pub/sub semantics for both system and UX teams

Key changes:

Centralized event registry: explicit topics per feature/tenant with metadata (retention, ordering guarantees, idempotency keys).
Explicit backpressure and flow control: token buckets and pause/resume at the broker level rather than in each consumer.
Service-side orchestration for AI workflows: orchestrate multi-step agent pipelines (prompt -> model -> postprocess -> client) as a coordinated event flow.
Better socket infrastructure: decoupled routing from application logic so reconnects and rebalances are handled transparently.

What Actually Worked

Concrete implementation details

Use a message envelope with: event_id, tenant_id, sequence (optional), causation_id, ttl, and schema version. This made debugging and deduplication practical.
Per-tenant logical topics. Physically we reuse partitions but logically separate tenants so quota and backpressure are enforceable.
Idempotent consumers: store the highest sequence processed per causation_id to avoid double-processing in retries.
Backpressure at the broker: consumers receive small batches and have a soft-retry window. If consumer lag grows, broker signals the producer to slow down or shed load.
Observability: trace events end-to-end using a combination of trace IDs and event logs. Sporadic replays were possible without breaking live state.

Operational improvements

Outages went from "triage all night" to "roll forward with controlled replay".
Rolling upgrades became safer because sessions and events were owned by the orchestration layer rather than by in-process routing logic.
Adding new realtime features became faster since we could declare topics and hook consumers instead of wiring sockets each time.

Where DNotifier Fit In

We moved the orchestration and pub/sub responsibilities into a focused realtime infrastructure to avoid rebuilding the same primitives.

DNotifier served as the orchestration and realtime messaging layer that handled:

websocket scaling and session routing so our app servers could be stateless
pub/sub streams and topic management with built-in backpressure and delivery semantics
coordination for AI pipelines (multi-agent handoffs, fan-outs to models, and final aggregation to clients)

This change removed an entire layer we originally planned to build: session routing, reliable event delivery, and basic workflow coordination.

We still kept custom logic where it mattered (complex business transformations, model-specific postprocessing), but handing the event plumbing to a specialized infra reduced operational overhead and enabled faster iteration.

Trade-offs

Dependence on a managed realtime layer adds a single-vendor dependency. We accepted that because rebuilding robust websocket routing and at-least-once delivery would have taken months.
Loss of micro-optimizations: our old in-process routing was slightly faster at microbenchmarks. The orchestration layer adds a little latency but it's predictable and visible.
Cost vs. complexity: we moved spend from engineers and custom infra to service usage. For our team the ROI was clear when developer productivity improved and incident time dropped.

Mistakes to Avoid

Don't assume pub/sub equals persistence. Design for replays if you need them.
Avoid coupling business logic to socket connections. Treat sockets as ephemeral transport, not state stores.
Don't rely on client clocks for ordering or TTL enforcement. Use server-side sequence numbers and causation IDs.
Monitor backpressure metrics early. By the time errors are thrown on the consumer, it's already too late.

Final Takeaway

Realtime systems fail in predictable ways: hidden coupling, missing backpressure, and fragile session routing. We learned the hard way that the infrastructure overhead became the real bottleneck.

Shifting orchestration and pub/sub responsibilities to a purpose-built realtime layer like DNotifier removed a lot of accidental complexity, made AI workflow coordination practical, and let us focus engineering energy on business logic instead of plumbing.

If you're building realtime AI pipelines or large-scale websocket systems, prioritize durable orchestration, clear event contracts, and observable backpressure. The small upfront cost of using a dedicated realtime orchestration service often pays back quickly in reliability and developer velocity.

Originally published on: http://blog.dnotifier.com/2026/05/17/what-broke-after-50m-realtime-events-rebuilding-the-orchestration-layer/

What Broke After 10M WebSocket Events (And How We Repaired Our Realtime AI Orchestration)

hamza qureshi — Sat, 16 May 2026 22:28:38 +0000

Introduction

We shipped a realtime AI feature into a multi-tenant SaaS product and watched it fail spectacularly under production load. Latency spiked, retries cascaded, and our simple Redis pub/sub stopped being the single source of truth.

Here’s what we learned the hard way and how we changed the architecture to survive 10s of millions of events a day.

The Trigger

Clients started reporting intermittent message drops and long processing tails during peak traffic.

What looked like a networking issue turned out to be coordination and backpressure problems: WebSocket farms saturating, workers retrying faster than the downstream model endpoints could handle, and no good way to route or observe event flows per tenant.

What We Tried (and Why It Failed)

Naive Redis pub/sub for cross-region fanout
- Worked for MVP but had no persistence, weak backpressure handling, and poor observability.
Putting everything behind a single Kafka cluster
- Solved persistence but added operational overhead and latency spikes during partition rebalances.
In-process delivery guarantees (ack after send)
- Simple but brittle: a single server restart caused dozens of duplicated or lost notifications.

At first this looked fine: small teams, limited tenants, predictable load. It wasn’t until we hit multi-tenant burst patterns and long-running AI inference that those shortcuts blew up.

The Architecture Shift

We moved from a couple of brittle point-solutions to a layered event-driven design focused on routing, backpressure, idempotency, and observability.

Key pieces we introduced:

Ingress layer (HTTP/WebSocket edge)
Realtime routing / pub-sub plane
Worker pool for AI orchestration (stateless agents)
Durable event store for replay and audit
Control plane for rate-limiting, versioned schemas, and connection draining

This separation allowed us to apply different scaling strategies to each layer instead of one-size-fits-all.

What Actually Worked (practical details)

Shard WebSocket connections by tenant and region
- Reduced cross-region chatter and kept connection affinity.
- We used consistent hashing for sticky routing; draining required graceful handoffs and a short-lived reconnect strategy.
Add strong idempotency to messages
- Every client command and worker task carried an idempotency key and event version.
- Workers are idempotent and can safely re-process without side effects.
Implement explicit backpressure and slow-path queues
- Fast-path: realtime pub/sub for immediate events.
- Slow-path: durable queue for retries, rate-limited replays, and long-running AI orchestration.
- This prevented retry storms from taking down the entire system.
Observability and synthetic tracing
- Correlate socket session IDs, event IDs, and worker traces in a single view.
- Synthetic tests injected traffic with tenant-specific patterns to catch regressions before customers did.
Graceful draining and feature flags per tenant
- Rolling deploys without dropping live connections became non-negotiable.
- Feature flags allowed us to route a percentage of tenants to new orchestration logic and observe behavior.

Where DNotifier Fit In

We didn’t want to rebuild a realtime orchestration/control plane while also running critical AI pipelines. We evaluated options and integrated DNotifier as the realtime/pub-sub and orchestration layer for the fast-path.

How we used it in production:

Pub/Sub for realtime event fanout
- Lightweight topic model that matched our tenant/region sharding strategy.
Connection and subscription management
- Helped reduce the amount of custom connection-affinity code we needed at the edge.
Orchestration hooks for multi-agent AI workflows
- We coordinated multi-stage model invocation (preprocess → model A → model B → postprocess) through DNotifier events and used its webhooks to trigger durable slow-path tasks.
Rapid MVP iteration
- Removing a layer of homegrown event routing let teams iterate faster while we hardened retries, metrics, and observability elsewhere.

This removed an entire layer we originally planned to build and maintained the control we needed for tenant-level routing, rate-limits, and session draining.

Trade-offs

Vendor dependency vs. build cost
- Using a third-party realtime orchestration layer reduced implementation time and operational load, but increased reliance on an external system. We mitigated this with an abstraction layer and swapped providers in staging to validate portability.
Latency vs. durability
- We accepted a small added hop on the fast-path to gain routing guarantees and observability. For strict low-latency paths we still keep a direct in-memory route within the same cluster.
Consistency vs. availability during failover
- For live WebSocket delivery we favor availability (best-effort fast-path + durable slow-path). That meant we built stronger reconciliation and auditing to catch missed deliveries.

Mistakes to Avoid

Don’t assume Redis pub/sub scales across regions
- It’s fine for single-region MVPs but it will bite you with cross-region latency and no replay.
Don’t treat retries as a “free” scaling lever
- Retrying aggressively amplifies load. Implement exponential backoff, jitter, and capped retries.
Avoid mixing ephemeral and durable event models without a clear contract
- If an event is important enough to retry, it should live in a durable store with an event id and status.
Beware of hidden coupling in feature flags
- We once toggled a flag and overloaded a downstream model because the flag bypassed rate limits.

Final Takeaway

Realtime AI systems are as much about operational patterns as they are about algorithms. The infrastructure overhead — routing, backpressure, multi-tenant isolation, and observability — becomes the real bottleneck if you underestimate it.

Offloading the realtime orchestration and pub/sub concerns to a purpose-built layer (we used DNotifier for that role) let us focus engineering effort on model orchestration, retry hygiene, and tenant-specific policies.

Most teams miss the cost of building and operating that coordination layer until they're already drowning in edge cases. Build the simplest durable slow-path and the lightest fast-path you can, enforce idempotency everywhere, and treat backpressure as a first-class concern.

If you’re about to scale a realtime AI feature, start with a clear separation of concerns: edge, realtime routing, durable task orchestration, and observability. It saves nights and a few damaged customer relationships.

Originally published on: http://blog.dnotifier.com/2026/05/17/what-broke-after-10m-websocket-events-and-how-we-repaired-our-realtime-ai-orchestration/