User Connectivity: Two Years in Production — Lessons Learned and New Patterns

#microsoft #eventdriven #softwareengineering #architecture

Introduction

This is a follow-up to my previous post on User Connectivity: A Real-time Web Solution for Online and Offline User Status. It has been almost two years since I first deployed my user connectivity solution to production. What started as an experiment has become a cornerstone of my architecture.
Could I make it simpler? Yes. However, the time and effort for an upgrade is not worth it given other priorities. The system has been rock solid, and I have expanded its use to solve other challenges in my application.

Expanding the Pattern: Solving Database Deadlocks

Recently, I encountered complex database deadlock issues caused by several API processes. After analyzing the problem, I recognized an opportunity to apply my event-driven architecture. By separating out specific tasks and processing them asynchronously, I eliminated the deadlocks entirely.
This led me to create a centralized service I call the Event Distributor. An Azure Function App responsible solely for listening to Redis expiration events and forwarding them to an Azure Event Grid queue. Multiple Azure Function App workers then process events from this queue.

Event Distributor Architecture

The goal of the Event Distributor is simple: one central service that listens to Redis expiration events and nothing else.

This separation provides several benefits:

Traceability: All events pass through a single point, making it easier to trace issues.
Logging: Every received event is logged to an Azure Cosmos DB with its status.
Scalability: Worker functions can scale independently based on queue depth.
Reliability: A single, well-monitored listener reduces the risk of missed events.

Key Observations: Azure Services

After two years of running this architecture, I have two critical observations for anyone implementing a similar solution with Azure services.

1. Use Premium Plan for Redis Listeners
If your Azure Function App uses the RedisPubSubTrigger to listen for Redis expiration events, you must use the Premium plan. The Consumption plan is not reliable for this use case.

Why? The Consumption plan has cold start delays and limited connection lifetimes. When the function app is idle, it can go to sleep and miss events from Redis. With the Premium plan, your function app is always on with long-lived connections — no cold starts, no missed events. I have not seen a single lost event since moving to Premium.

You may ask: how did the original user connectivity solution work on the Consumption plan?
The answer is traffic volume. Even in idle mode, my application generates 170–200 heartbeat events per second. During peak usage, this increases 6–10 times. The function app never goes to sleep because of this constant traffic. Additionally, my Heartbeat function app includes both the RedisPubSubTrigger and HTTP triggers, keeping it active.
For a dedicated Event Distributor that only listens to Redis expiration events, the Premium plan is essential.

2. Azure Managed Redis Limitation
Azure has introduced a new Redis service called Azure Managed Redis (formerly Redis Enterprise). This service runs on a clustered architecture with multiple Redis servers under the hood.

However, there is a significant limitation: these servers cannot be fully configured to send expiration events reliably. When the underlying servers restart or cycle, they lose their configuration and stop sending expiration events.

I have reported this issue to Microsoft. They confirmed it is on their backlog, but as of today, there is no solution. If your architecture depends on Redis key expiration events, be aware of this limitation before migrating to Azure Managed Redis.

Conclusion

My event-driven architecture has proven itself over two years of production use. It solved my original user connectivity requirements and has since helped me address complex database deadlock issues. The key takeaways: