Andreas Hatlem

Posted on Mar 6

Building Notification Infrastructure at Scale Is a Trap: Why Your Team Will Regret Rolling Their Own

#webdev #saas #architecture #devops

It starts innocently. A product manager asks for email notifications when a user signs up. A backend engineer adds a sendEmail() call after the registration handler. It works. It ships. Everyone moves on.

Three months later, the feature request list looks like this: send order confirmations via SMS, push a notification when a delivery is nearby, alert admins via Slack when a payment fails, batch a daily digest for inactive users, and let marketing send a promotional SMS to 200,000 opted-in customers on Black Friday.

The engineer who wrote sendEmail() is now staring at a queue system, a channel router, a retry engine, a preference center, an opt-out compliance layer, and a rate limiter — none of which exist. The original ten lines of code have metastasized into an infrastructure project that will consume the next six months of a team's roadmap.

This is not a contrived scenario. This is the default trajectory of notification systems in growing SaaS companies. And the reason it catches teams off guard is that sending a single message is trivially easy, while sending millions of messages reliably across multiple channels is one of the hardest distributed systems problems in application development.

The Iceberg Under "Just Send a Notification"

When engineers estimate notification work, they tend to scope the visible part: formatting a message, calling an API, maybe storing a record. The invisible part — the part that consumes 90% of the engineering effort — includes all of the following.

Message Queuing and Backpressure

Synchronous notification sending breaks the moment you need to send more than a handful of messages. A user action that triggers notifications to 10,000 followers cannot block the HTTP response for 45 seconds while those messages are dispatched.

You need a queue. But not just any queue — you need a queue that handles:

Priority levels. A password reset SMS must go out in under 5 seconds. A weekly digest can wait. If both are in the same FIFO queue, the digest backlog will delay the password reset.
Backpressure management. When a marketing campaign drops 500,000 messages into the queue at once, the system needs to throttle processing to avoid overwhelming downstream providers and blowing through rate limits.
Dead letter handling. Messages that fail after N retries need to go somewhere observable, not vanish silently.
Ordering guarantees. Some notification sequences are order-dependent. A "your order shipped" notification arriving before "your order was confirmed" confuses users.

Most teams start with Redis + Bull or a basic SQS setup. Within a year, they are maintaining a custom priority queue system with multiple consumer groups, retry policies per channel, and a monitoring dashboard they built from scratch because the default metrics were insufficient.

Delivery Guarantees: At-Least-Once Is Harder Than You Think

"Fire and forget" works for logging. It does not work for notifications. When a user expects an SMS with a two-factor code, that message must arrive. When it does not, the user cannot log in, and your support queue fills up.

Achieving reliable delivery means handling:

Provider outages. Your SMS provider will go down. Not if, when. Do you have automatic failover to a secondary provider? How do you avoid sending the same message twice during the failover window?
Idempotency. If a queue worker crashes after sending the message but before acknowledging it, the queue will redeliver. Without idempotency keys, the user gets duplicate messages. Duplicate marketing SMSes annoy users. Duplicate transactional messages (especially those with one-time codes) break workflows.
Delivery receipts. SMS providers accept your API call, but that does not mean the message was delivered. Carriers can reject messages, phone numbers can be invalid, devices can be unreachable. Tracking actual delivery requires webhook ingestion, status reconciliation, and retry logic that accounts for "soft bounce" vs "hard bounce" distinctions.
Cross-channel fallback. If a push notification is not acknowledged within 60 seconds, should the system fall back to SMS? If the SMS bounces, should it try email? This kind of cascading delivery logic sounds simple in a product spec and is genuinely difficult to implement without race conditions.

Rate Limiting at Multiple Levels

Rate limiting notifications is not a single problem. It is at least four:

Provider rate limits. Every SMS gateway, push notification service, and email provider has throughput limits. Twilio, for example, has per-number sending rates. Exceed them and messages get queued on their end (with unpredictable latency) or rejected outright. Your system needs to know these limits and pace itself accordingly.

Carrier rate limits. Even if your SMS provider accepts messages at 100/second, individual carriers throttle traffic. US carriers in particular have implemented 10DLC registration requirements and throughput caps per campaign. Sending above the allowed rate results in messages being silently filtered — not rejected, filtered. You will not even get an error.

User-level rate limits. No user should receive 15 push notifications in an hour, regardless of how many events triggered them. This requires per-user, per-channel rate tracking — typically a sliding window counter in Redis. But what happens when the rate limit is hit? Do you drop the notification? Batch it into a digest? Queue it for later? Each choice has implementation cost.

Cost rate limits. SMS is not free. An international SMS can cost $0.05-0.15 per segment. A bug that sends 1 million unintended messages can generate a five-figure bill in hours. You need circuit breakers on spend, not just throughput.

Multi-Channel Orchestration

Once you support more than one channel — and you will, because users expect it — you need an orchestration layer that decides:

Which channel(s) to use for a given notification type
Whether to send to multiple channels simultaneously or cascade with fallbacks
How to deduplicate across channels (if a user read the push notification, skip the email)
How to respect per-channel user preferences (user wants SMS for urgent alerts but email for marketing)
How to handle channel-specific formatting (SMS has 160-character segments, push has title/body/image, email has HTML templates)

This is where many homegrown systems become unmaintainable. The orchestration logic starts as a switch statement, grows into a configuration file, then becomes a rules engine that nobody fully understands. Each new channel or notification type adds combinatorial complexity.

Opt-Out Compliance Is Not Optional

Regulatory compliance around messaging is strict and the penalties are real.

SMS in the US: The Telephone Consumer Protection Act (TCPA) allows statutory damages of $500-$1,500 per unsolicited text message. Class action lawsuits under TCPA routinely result in settlements exceeding $10 million. You must maintain opt-out lists, honor STOP keywords in real time, and include opt-out instructions in every marketing message.

SMS in the EU: GDPR applies to SMS just as it applies to email. You need documented consent, purpose limitation, and the ability to delete all notification records for a given user on request.

Push notifications: Apple and Google can revoke your push certificate if abuse is detected. Uninstalled apps generate invalid tokens that must be pruned.

Email: CAN-SPAM, CASL, GDPR. Unsubscribe links must work. Suppression lists must be honored across all sending systems, not just the one where the user clicked unsubscribe.

Building a compliance layer means maintaining a real-time suppression system that every outbound path checks before sending. It means audit logs. It means consent records with timestamps. It means an unsubscribe handler that works across channels. Miss any of this and you are exposed to legal risk that dwarfs the engineering cost of getting it right.

Template Management and Personalization

At scale, notification content is not hardcoded. You need:

Template storage with versioning (so rolling back a bad template change does not require a deployment)
Variable interpolation that handles missing data gracefully (a notification that says "Hello {{name}}" when name is null looks broken)
Localization across languages and regions
Channel-specific rendering (the same notification looks different as an SMS vs. an email vs. a push)
Preview and testing so non-engineers can draft messages without deploying code

Many teams start by embedding templates in the codebase. Every change requires a PR, a review, a merge, and a deployment. Marketing hates this. The team builds a template editor. Then they need variable validation, a preview system, approval workflows, and A/B testing support. Another quarter gone.

Observability and Debugging

When a user says "I never got the notification," you need to answer:

Was the notification triggered?
Did it enter the queue?
Which channel was selected?
Was the user rate-limited or suppressed?
Did the provider accept the request?
What was the delivery status?
If it failed, what was the error?

This requires end-to-end tracing from the triggering event through every decision point to the final delivery status. Most teams bolt on logging after the fact, resulting in fragmented logs across services that require manual correlation. Building proper notification observability from the start is a project in itself.

The Real Cost: It Is Not Just Engineering Hours

The direct engineering cost of building notification infrastructure is significant — typically 6-12 months of a senior backend engineer's time to build something production-grade, plus ongoing maintenance. But the indirect costs are larger:

Opportunity cost. Every sprint spent on notification plumbing is a sprint not spent on your actual product. If your business is e-commerce, the notification system does not differentiate you. Your recommendation engine does. If your business is SaaS, users do not choose you because your SMS delivery is slightly faster. They choose you because your core product solves their problem.

Incident cost. Notification systems fail in ways that directly impact users. A delayed password reset SMS means a locked-out user. A missing order confirmation means a support ticket. A duplicate marketing blast means angry customers and potential TCPA exposure. Every outage in your homegrown system is an incident your on-call engineer handles instead of sleeping.

Scaling cost. The system that works for 10,000 users does not work for 1,000,000 users. Queue depths change. Provider contracts change. Compliance requirements change. Each order-of-magnitude growth forces a partial rewrite.

What a Purpose-Built Platform Handles for You

A dedicated mass messaging and notification platform eliminates the entire infrastructure layer described above. Instead of building queue systems, provider failover, rate limiters, compliance engines, and observability pipelines, your team integrates with a single API and configures behavior through a dashboard.

Specifically, a platform like Sendriot provides:

Multi-channel dispatch (SMS, push, email) through a unified API, with channel selection logic and fallback cascades handled by the platform
Built-in message queuing with priority levels, backpressure management, and dead letter visibility
Provider-level rate limiting that automatically paces messages to stay within carrier and gateway limits
User-level throttling to prevent notification fatigue, with configurable windows per channel
Delivery tracking with per-message status, provider receipts, and failure reasons accessible via API and dashboard
Opt-out management with real-time suppression, STOP keyword handling, and compliance-ready audit logs
Template management with variable interpolation, localization, and channel-specific rendering
Bulk sending capable of dispatching to hundreds of thousands of recipients with managed throughput

The integration surface for your application shrinks from "build and maintain a distributed notification system" to "call an API endpoint with the recipient, channel, and content."

When Building In-House Makes Sense (It Rarely Does)

There are legitimate cases for building notification infrastructure internally:

Messaging is your core product. If you are building a communications platform, notifications are not infrastructure — they are the product. You need full control.
Extreme latency requirements. Sub-100ms delivery guarantees for specific use cases (financial trading alerts, for example) may require custom infrastructure optimized for that specific path.
Regulatory constraints. Some industries (healthcare, government) have data residency or vendor restrictions that limit third-party options.

For the vast majority of SaaS applications, e-commerce platforms, and mobile apps, the notification system is a supporting function. It needs to be reliable, compliant, and observable, but it does not need to be custom-built.

The Decision Framework

Ask your team three questions:

Is notification delivery a core differentiator for our product? If the answer is no, do not build it.
Do we have the team capacity to build and maintain a distributed messaging system indefinitely? Not just build — maintain. Provider APIs change. Carrier regulations evolve. Compliance requirements tighten. This is not a build-once project.
What is the cost of getting it wrong? A bug in your notification system can send duplicate messages to your entire user base, violate TCPA, or silently drop critical transactional messages. The blast radius is large.

If the answer to question one is no, and the honest answer to question two is "not without significant tradeoffs," then using a platform is the correct engineering decision. It is not a shortcut — it is the same reasoning that leads teams to use managed databases instead of running Postgres on bare metal.

Notifications are infrastructure. Treat them that way.

If your team is building notification infrastructure and realizing the scope is larger than expected, Sendriot handles mass SMS, push notifications, and multi-channel messaging at scale — so your engineers can focus on your actual product instead of building queue systems and compliance engines.

DEV Community