How hard can it be?
I work at a property management SaaS company. Landlords use the platform to manage their properties, collect rent, track maintenance requests, handle lease contracts. Four developers, one product.
Small team means you don't get tickets scoped down to a single component, you get features, end to end. I joined as a backend developer but quickly ended up touching everything: APIs, frontend, infra, and eventually mobile. That's just how it works when there are four of you.
One day, a senior developer pinged me. The product needed a notification center. Users were asking for it, when a settlement gets transferred, when a maintenance request comes in, when a lease is about to expire. These are things landlords need to act on. Notifications made sense. The ask was simple. The notification system design turned out not to be.
I didn't just start coding. I went to the whiteboard, mapped out the flow, discussed edge cases with the senior developer, made sure I understood what we actually needed. I designed for the problem I was given.What I didn't do was pressure-test whether that problem was the complete one.
The design I was proud of
The first question I had to answer before writing a single line of code: how does the client get new notifications?
The standard answer is WebSockets. Open a persistent connection, server pushes events in real time. One of my teammates suggested exactly that But here is the problem.
We run on Cloud Run — Google's serverless platform. Serverless means instances spin up and down. Persistent connections and abrupt disconnections become your problem to manage. On top of that, did our users actually need real-time notifications? A landlord getting notified about a rent settlement doesn't need to know in under a second. Five to ten seconds is fine. Polling was cheaper, easier to scale, and good enough for what we needed. I had a clear answer for why and that mattered, because going against the standard approach means you better be able to defend it.
Here's how the full architecture looked:

Any service that wants to send a notification drops a job into BullMQ. A worker picks it up, writes to PostgreSQL, then pushes the notification ID into Redis. These two operations are sequential, not bundled. If the Redis push fails, it retries only that step,the DB write already happened and doesn't get touched again. One queue, one worker, two distinct responsibilities in sequence.The DB-generated notification ID becomes the Redis key. That ordering matters: you need the ID before you can cache it.On the read side, the client sends its last delivered notification ID. Server checks Redis — anything with an ID greater than that for this user? Return it. The client owns its cursor. No read/unread state to manage server-side. Stateless on the server, clean on the client.
Two fallbacks handle edge cases. Every 30 minutes, the client forces a direct DB fetch in case Redis missed something. When the browser tab comes back from being suspended, the Page Visibility API triggers another direct DB call. Redis has a 24-hour TTL but the ID cursor filters stale entries anyway.There's a subtle attack surface in the fallback design worth thinking through. If a malicious actor always passes db=true, every request bypasses Redis and hits the database directly — constant load, every poll cycle. The fix is server-controlled rate limiting:
// On direct DB fetch, set a rate-limit key
await redis.set(`last_db_fetch:${userId}`, "1", "EX", 60);
// On every incoming request, check before honoring db=true
const recentFetch = await redis.get(`last_db_fetch:${userId}`);
if (recentFetch) {
// Ignore client-supplied db=true, serve from Redis
return serveFromRedis(userId, lastNotificationId);
}
The client no longer controls when the DB gets called — the server does. Simple key, solves the problem cleanly.The architecture worked. But it had a blind spot I couldn't see from inside it.
The Redis TTL had to stay manually synced with the frontend polling interval — two things in two places, easy to drift. And the biggest problem: every new delivery channel meant building a separate integration from scratch. Email — build it. Mobile push — build it. Slack — build it. I owned all of it indefinitely.At the time, the brief was web only. So this felt fine. And then it wasn't.
The comment that broke the design
We have weekly meetings. Product updates, priorities, what's coming next. I was in one of those when our CEO mentioned, almost in passing, that we'd be launching a mobile app in a few months.
That was it. That was the comment.
I hadn't thought about it before that moment. But something clicked immediately. Mobile app means push notifications. Push notifications mean FCM integration. Email was also coming — the brief had been expanding quietly in the background. And I was sitting on an architecture I'd spent two days designing and defending — one that handled web delivery and nothing else
I went to the senior developer after the meeting. Laid it out: if we want email, I build that delivery layer. If we want mobile push, I build that too. If we want Slack someday, same story. Every new channel is a new integration I own, maintain, and debug when it breaks at 2am.That's not a notification system. That's a delivery platform. Different scope, different problem.I talked it through with them and made the call. Go third-party. Two days of architecture work had to go,the polling logic, the Redis TTL design, the fallback math. I scrapped it. Not because the thinking was wrong. The technical decisions were sound. But I had designed for a problem that was smaller than the actual one, and the right move was to own that and correct it before building further on a foundation that wouldn't hold.
Two days gone. The alternative was spending the next year maintaining delivery across three channels myself. That math wasn't hard.
The mistake wasn't the architecture. It was not pressure-testing the scope before I started. But catching it at two days instead of six months into a mobile launch — that's the part that mattered.
What I actually evaluated
Going third-party meant finding the right tool. I evaluated against a short list of non-negotiables.
We needed a React Native SDK with a prebuilt inbox component — we weren't going to build our own notification UI on mobile. We needed i18n support without enterprise pricing — we serve an Arabic-speaking market and that's table stakes, not a premium feature. And we needed a single API call that handled all delivery channels, so adding email or push later didn't mean a new integration.
I evaluated a few options. One required building the mobile UI ourselves. One priced i18n behind an enterprise tier. Courier hit every requirement: prebuilt React Native inbox, i18n on all plans, one API call for web, mobile, and email.
One practical note if you're on a similar stack: Courier has no Angular SDK. Our web client is Angular, so I had to inject their web component directly into the DOM. It works, but it's not the same developer experience as a native SDK. Worth factoring in before you commit.
The new notification system architecture
With Courier handling delivery, the system got significantly simpler.
The flow is straightforward. Any service calls NotificationHubService.create(). The service resolves recipients, checks preferences, saves to PostgreSQL, and enqueues a job to BullMQ. The worker picks it up and sends a single API call to Courier — user ID, title, body, action URL. Courier routes to web inbox, mobile push, or email depending on configuration. The worker then updates the notification record with Courier's request ID for traceability.
That's it. No Redis live queue. No polling. No fallbacks. No TTL drift.The diagram looks simple because the architecture is simple. The interesting complexity isn't in the flow — it's in the recipient resolution logic, which lives at the code layer, not the architecture layer. And that was a deliberate choice.
The interface supports three modes: explicit user IDs, resource-based access (pass a lease or settlement and the service resolves who has access using the existing RBAC layer), or broadcast to all active account members. The caller doesn't need to know the access rules. Any service can call the same interface with a category and a message. Adding a new notification type is one function call. The architecture is deliberately simple so the logic can be where it needs to be — in the code, not the infrastructure.
The tradeoff is vendor dependency. If Courier has an outage, notifications go down. That's a real risk we accepted — but for a four-person team, owning delivery reliability ourselves across every channel wasn't a trade we could win.
The question I should have asked first
I spent two days designing a solid architecture. BullMQ queue, Redis cache, polling logic, fallbacks for every edge case. I had reasoned through each decision and could defend them.
What I hadn't done was ask one question: where else do you want to send these notifications?
Business briefs describe what people want today. They don't always describe what the system needs to support tomorrow — not because anyone is hiding it, but because a CEO mentioning a mobile app in a standup isn't thinking about your queue architecture. That's their product, not their system.
The system has three distinct responsibilities: generating notifications, resolving who receives them, and delivering them. The first architecture conflated all three. The new one separates them — generation stays in each service, resolution lives in NotificationHubService, delivery belongs to Courier. Each part is independently replaceable. That separation is also what made the build-vs-buy call clear: delivery is a commodity problem. Recipient resolution, tied to your RBAC model and your business rules, is not. Own what's specific to your domain. Buy what isn't.
If I'd asked that question in the first meeting, none of this would have needed unwinding. I caught it at two days instead of six months into a mobile launch. Next time I start a system design, I know what the first question is.
If you've had a similar "comment that broke the design" moment — when did you catch it? Drop it in the comments. Always curious how other teams pressure-test scope before they commit.

Top comments (0)