DEV Community

Mustafa ERBAY
Mustafa ERBAY

Posted on • Originally published at mustafaerbay.com.tr

3 Architectural Mistakes That Undermine Reliability in Mobile Push

Push notifications in our mobile applications are a critical tool for keeping users engaged and providing real-time information. However, the reliability of these notifications means much more than a simple "sent" status; it implies that the user actually receives it, receives it at the right time, and that the notification's content serves its purpose. Based on my field experience, this article will deeply explore three fundamental architectural mistakes that many applications overlook in their push notification systems, significantly undermining reliability, and how they can be fixed.

These mistakes might not be apparent at first glance, but over time, they lead to degradations in user experience, missed notifications, and ultimately, questioning the application's trustworthiness. These issues become even more pronounced, especially in large-scale systems or complex deployment models. Let's now take a closer look at these common pitfalls and how we can avoid them.

1. Lack of Connection Management and Status Tracking: ConnectivityManager Pitfalls in Android

In the Android ecosystem, an application's ability to receive notifications largely depends on the device's network connection. However, simply having an internet connection is not enough; the application must manage this state correctly and behave accordingly. While the ConnectivityManager API helps us with this, it can lead to serious issues if not used properly.

For instance, an application stops receiving notifications when the network connection is lost. This is a normal situation. However, it becomes a major problem if notifications are not automatically synchronized or delayed notifications are not fetched from the server when the connection is re-established. Once, in a production tracking application, we observed that push notifications used for shipment status updates were frequently lost during transitions between mobile carrier networks or when switching from Wi-Fi to cellular data. Users were not receiving critical updates, causing disruptions in production planning.

The root of the problem was that the application was not correctly listening for network status changes in the background using ConnectivityManager.registerNetworkCallback and was not initiating a synchronization mechanism with the server upon connection establishment. Simply listening for "CONNECTIVITY_ACTION" with a BroadcastReceiver was not sufficient because it didn't guarantee the notifications themselves.

Recommended Solution: Smart Reconnection and Delayed Message Handling

Several strategies can be employed to overcome such issues. Firstly, background services (using Foreground Service or WorkManager) should continuously monitor the network connection status and initiate an immediate heartbeat or synchronization request with the server upon connection establishment. This ensures that delayed messages are received.

Furthermore, maintaining a "last seen" or "last synchronized" timestamp on the server side would be beneficial. When the application connects, it can request only updates since that timestamp by sending it. This prevents unnecessary data transfer while ensuring consistent message reception.

// Example Kotlin code (simplified)
class NotificationSyncWorker(appContext: Context, workerParams: WorkerParameters) : Worker(appContext, workerParams) {

    override fun doWork(): Result {
        val connectivityManager = applicationContext.getSystemService(Context.CONNECTIVITY_SERVICE) as ConnectivityManager
        val networkInfo = connectivityManager.activeNetworkInfo

        return if (networkInfo != null && networkInfo.isConnected) {
            syncPendingNotifications()
            Result.success()
        } else {
            Result.retry() // Retry if no connection
        }
    }

    private fun syncPendingNotifications() {
        // Synchronization logic with the server comes here
        // Get the last synchronized timestamp, send it to the server, process new notifications
        Log.d("NotificationSyncWorker", "Notifications are being synchronized...")
        // ... API calls and notification processing ...
    }
}
Enter fullscreen mode Exit fullscreen mode

This approach, especially when combined with WorkManager, provides smarter synchronization by considering the device's battery status and network conditions. It prevents notifications from being lost when the connection is interrupted and offers users a reliable experience.

2. Over-reliance on Platform Services: Apple Push Notification Service (APNS) and Latency Issues

On the iOS side, the situation is slightly different, but architectural mistakes still exist. Apple Push Notification Service (APNS) plays a central role in delivering notifications to devices. APNS itself is quite reliable, but developers not fully understanding its working mechanism and over-relying on it can lead to problems.

One of the most common issues is that APNS manages the connection itself, and the developer overlooks this. APNS tries to keep the TCP connection between the server and the device constantly open. If your server does not regularly refresh its connection with APNS or experiences connection interruptions, notifications can be delayed or never delivered. In an e-commerce platform, delayed delivery of promotional product notifications to the target audience led to missed sales opportunities. The problem stemmed from the inefficient management of the APNS connection pool on the server side.

Another critical point is that APNS's "delivery receipts" mechanism is not available by default, and its absence is a problem. This means when you send a notification, APNS says it has sent it to the device, but it doesn't guarantee that the device has actually received it. If the device is offline or rejects the notification, your server won't be immediately aware of it. This situation is unacceptable, especially for critical alerts (e.g., emergency notifications).

APNS Advanced Features and Troubleshooting

The feedback service provided by APNS and the more modern HTTP/2 APNS API's apns-id header can be used to solve these problems. The feedback service provides information about device tokens that can no longer be delivered. The HTTP/2 API is faster and returns a status code (success, failure, etc.) for each notification sent with apns-id.

On the server side, it's important to use reliable libraries like apns-client to manage APNS connections and correctly configure the connection pool mechanisms provided by these libraries. Additionally, assigning an apns-id to each notification sent and processing the results from APNS by tracking this ID in your own system is critical for delivery guarantees.

đź’ˇ Important Note: APNS V2 HTTP/2 API

Apple recommends the HTTP/2-based APNS API over the old binary protocol. This API is faster, consumes fewer resources, and makes it easier to track delivery status, especially with apns-id. If you are still using the old binary protocol, you should seriously consider migrating to HTTP/2.

For delivery receipts, while APNS itself doesn't offer a direct mechanism, you can maintain a "pending" status on the server for sent notifications and update this status if confirmation (e.g., an analytics event sent when the notification is opened) is received from the device. However, this indicates user interaction rather than the notification reaching the device. More complex solutions might be needed for actual delivery tracking.

3. Error Handling and State Synchronization: The "Eventual Consistency" Fallacy

The third major architectural mistake in push notification systems is the misapplication of the "eventual consistency" principle regarding error handling and state synchronization. Eventual consistency means that data in a system will become consistent over time. In the context of push notifications, this might mean that the notification might take a little time to reach the device. However, this principle does not mean that notifications might never arrive or arrive in the wrong state.

In one project, we needed to send a special discount code to a group of users. Notifications were sent, but while some users received the code, others did not receive it at all. Later investigation revealed that errors occurring during server-side message queue processing were not managed well enough. When an error occurred, the message was pulled from the queue but could not be processed. The system attempted to resolve this error with a retry mechanism, but the number of retries was limited, and after a certain period, the messages were lost forever.

This situation leads to serious problems, especially when we expect the user to take an action (e.g., a notification to complete an order). If the user doesn't receive the notification, they won't complete the action, causing a break in the workflow.

Robust Error Handling and State Synchronization

The key to solving this problem is to establish a robust error handling and state synchronization mechanism.

  1. Reliable Message Queue: Using reliable message queues like RabbitMQ, Kafka, or AWS SQS prevents messages from being lost. These systems hold messages in the queue until they are acknowledged as processed.
  2. Retry and Dead Letter Queue (DLQ): When a message cannot be processed, it should be retried with a specific strategy. If all retry attempts fail, a "Dead Letter Queue" (DLQ) should be used to move the message to a separate location so it is not lost. These DLQs can then be manually inspected or automatically reprocessed.
  3. State Tracking and Observability: A status record should be maintained on the server side for every notification sent. This record should include information such as whether the notification was sent, if it was forwarded to APNS/FCM, and ultimately, if it reached the device. Logging, metrics, and tracing tools play a critical role in this process.
  4. User-Based State Synchronization: When a user's device is offline or does not receive notifications, there should be a mechanism to compensate for this when the connection is next established. This means the server knowing the notification status on the user's device and sending any missing ones.

⚠️ Caution: Overly Simple Retry Mechanisms

A simple "try 3 times, give up if it fails" logic can lead to lost messages, especially in high-volume systems. Retry strategies should include smarter approaches like exponential backoff and must be supported by a DLQ.

These approaches help ensure that push notifications are not just sent but are actually received and processed. Such details, which directly impact user experience, are vital for the overall reliability of your application.

Conclusion and Next Steps

The reliability of mobile push notifications is not just a "feature" but a cornerstone of modern mobile applications. Connection management, proper understanding of platform services, and robust error handling are the fundamental elements for ensuring this reliability. The three architectural mistakes I've discussed above – missing connection tracking, over-reliance on APNS, and incorrectly implemented eventual consistency – are common issues faced by many applications.

The steps we take to resolve these issues not only ensure that notifications are delivered but also allow our users to have a smoother and more reliable interaction with your application. This directly increases user engagement and satisfaction, especially for revenue-generating applications.

The next step is to integrate these principles into your application's architecture and review them regularly. Especially after major updates or platform changes, checking the status of your notification system will help you detect potential issues early on. Paying attention to such details is critical for the long-term success of your application.

Top comments (0)