Push Notification Delivery Guarantees with Rails: A Spiral Through the Gray Hours

#webdev #programming #productivity #rails

I still remember the 3 a.m. Slack message that made my stomach drop.

“CEO just asked why 40% of our users didn’t get the flash sale alert. Said their Android phones show nothing. We’re losing revenue.”

We had everything right. Rpush gem configured. Firebase Cloud Messaging (FCM) credentials rotated. APNS certificates valid. Background jobs retrying on failure. And still—notifications vanished like whispers in a hurricane.

That night, I stopped believing in “delivery guarantees.” I started understanding push notifications as a probabilistic art—where your Rails backend can do everything perfectly, and the universe (read: carriers, battery optimizers, OS quirks) can still say no.

This is the journey of building trust from chaos. Senior full-stack folks, pull up a chair.

The Lie We Tell Ourselves

We write:

NotificationSenderJob.perform_later(user_id, "Your order shipped!")

And we think: it’ll get there. But between perform_later and a screen lighting up, there are nine circles of hell:

FCM/APNS rate limits
Device tokens that expired yesterday
Doze mode on Android 12+
Carrier-level SMS-to-push gateways losing packets
The user swiped away your app and background fetch is dead

Push notifications are not TCP. They are UDP with extra sadness.

The First Realization: Idempotency Is Not Enough

We all know idempotency. Retry a job 5 times with exponential backoff. Great for API calls. Useless when the provider returns 200 OK but the phone never shows the notification.

Because here’s the dirty secret: FCM’s 200 means “we accepted the message into our queue.” It does not mean “the user saw it.” I’ve had messages accepted at 2:01 PM and delivered at 3:47 AM the next day. Or never.

So we need a different mental model: at-least-once attempt, not delivery. You can’t guarantee delivery. You can guarantee you tried honestly and can measure the gap.

The Architecture of Honest Attempts (What Actually Works)

After that 3 a.m. incident, I rebuilt our notification pipeline into something I call the “spiral log”—because it twists back on itself, checking, reconciling, never trusting.

Here’s the Rails core that survived production:

# app/models/notification.rb
class Notification < ApplicationRecord
  belongs_to :user
  enum state: { pending: 0, sent_to_provider: 1, delivered_to_device: 2, failed: 3 }

  # provider_response stores FCM/APNS message ID and timestamp
  # delivery_attempts counts retries
  # last_attempt_at for backoff
end

# app/jobs/send_notification_job.rb
class SendNotificationJob < ApplicationJob
  retry_on ProviderTimeout, wait: :exponentially_longer, attempts: 5

  def perform(notification_id)
    notification = Notification.find(notification_id)
    return if notification.delivered_to_device?

    provider = PushProvider.for(notification.user.device_platform)
    response = provider.send(
      token: notification.user.push_token,
      payload: notification.payload,
      collapse_key: notification.collapse_key
    )

    notification.update!(
      state: :sent_to_provider,
      provider_message_id: response.message_id,
      sent_at: Time.current
    )

    # Schedule a delivery receipt check (more on this)
    CheckDeliveryReceiptJob.set(wait: 30.seconds).perform_later(notification.id)
  rescue ProviderInvalidToken => e
    notification.update!(state: :failed, error: "invalid_token")
    UserTokenRevocationService.call(notification.user)
  end
end

The game-changer was delivery receipts. APNS has them (via the apns-push-type header and apns-collapse-id). FCM has them via the delivery_receipt_requested flag in the HTTP v1 API.

We started storing every provider message ID and polling for delivery confirmation. When a receipt never arrived after 24 hours, we’d mark it as “suspected lost” and trigger a fallback channel (email or SMS).

The Art of the Receipt Reconciliation Loop

Imagine a background worker that runs every hour:

# app/jobs/reconcile_notifications_job.rb
class ReconcileNotificationsJob < ApplicationJob
  def perform
    Notification.sent_to_provider
      .where("sent_at < ?", 1.hour.ago)
      .find_each do |notification|

      status = PushProvider.status(notification.provider_message_id)

      case status
      when "delivered"
        notification.update!(state: :delivered_to_device, delivered_at: Time.current)
      when "failed", "expired"
        notification.update!(state: :failed, error: status)
      when "pending"
        # keep waiting, but log a metric
        Metrics.push_delivery_latency.observe(Time.current - notification.sent_at)
      end
    end
  end
end

This loop is the spiral. It doesn’t assume success. It asks the provider, repeatedly, like a worried parent texting “did you get my last text?”

The Human Layer: What Users Actually Experience

Here’s the part that separates senior devs from juniors. Delivery guarantees aren’t just bytes—they’re emotions.

A push notification that arrives 6 hours late for a “your food is ready” alert? That’s not a notification. That’s a cold dinner and a one-star review.

So we added time-to-live (TTL) for every message:

# For time-sensitive alerts
payload = {
  apns: { expiry: 300 }, # 5 minutes
  fcm: { time_to_live: 300 }
}

# For marketing (who cares if it's late)
payload = {
  apns: { expiry: 86400 }, # 1 day
  fcm: { time_to_live: 86400 }
}

And we taught product managers the phrase: “If the message isn’t relevant after X minutes, don’t send it at all.”

We also built a dashboard (just a Rails view with charts) showing:

Sent to provider rate
Delivery receipt rate (actual device ack)
Median latency per provider
Token invalidation rate per OS version

When we showed that to the CEO, he stopped asking why users missed messages. He started asking why Android 13 had a 12% higher drop rate than iOS 17. (Spoiler: battery optimizations.)

The One Thing That Still Hurts

Even with all this, push notifications are not guaranteed. A phone in a faraday cage (elevator, basement, airplane) will never get the message. A user who disabled notifications at the OS level—we can’t fix that. A carrier that drops our packets between FCM and the device—we can’t even detect it.

What we can guarantee is observability and fallback.

For every push notification we send, we also create an in-app inbox message. When the user opens the app, they see everything they missed. The push becomes a hint, not the source of truth.

And we stopped apologizing for the platform’s limits. We started explaining them. In the app’s settings: “Push notifications are best-effort. Check your in-app inbox for everything.”

The Masterpiece Isn’t Perfect Delivery—It’s Honest Failure

That 3 a.m. incident taught me: delivery guarantees are a myth. But delivery transparency is achievable. And users will forgive a lost notification if your app gives them another way to find the information.

So build the spiral. Poll for receipts. Log the latency. Have a fallback. And when someone asks “can you guarantee 100% delivery?”, smile and say: “No. But I can tell you exactly when and why each one failed, and I can try again smarter.”

That’s the art. That’s the Rails way.