<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Karan Mali</title>
    <description>The latest articles on DEV Community by Karan Mali (@karan5599).</description>
    <link>https://dev.to/karan5599</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3932996%2F1c0f0b32-572a-4a88-bf48-bf5b4327179b.jpg</url>
      <title>DEV Community: Karan Mali</title>
      <link>https://dev.to/karan5599</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/karan5599"/>
    <language>en</language>
    <item>
      <title>The const enum that took down our payments</title>
      <dc:creator>Karan Mali</dc:creator>
      <pubDate>Thu, 28 May 2026 13:01:29 +0000</pubDate>
      <link>https://dev.to/karan5599/the-const-enum-that-took-down-our-payments-pi8</link>
      <guid>https://dev.to/karan5599/the-const-enum-that-took-down-our-payments-pi8</guid>
      <description>&lt;h2&gt;
  
  
  Three minutes
&lt;/h2&gt;

&lt;p&gt;That's how long I'd sit there every time I changed a file on our server. Three to four minutes for the dev server to rebuild and come back up. Long enough to check Slack, scroll twitter, and then forget what I was about to do when it finally came back. I was the only one on Linux. The rest of the team was on Mac, and on their machines the dev server came back in twenty seconds. Nobody else felt the pain, so nobody else was looking for a fix. On my machine it was the bottleneck of my whole day. I started batching changes across multiple files so I'd only pay the rebuild cost once, which sounds clever but mostly meant I'd lose track of what I was actually testing. You don't really notice an hour disappearing into that. You just notice you're tired by 4pm and you can't figure out why.&lt;/p&gt;

&lt;p&gt;For a while, I lived with it.&lt;/p&gt;




&lt;h2&gt;
  
  
  A small experiment
&lt;/h2&gt;

&lt;p&gt;I'd used esbuild in some side projects before. It built things in milliseconds. After enough days of staring at a slow rebuild, I started wondering whether it would work for our server too. I tried it locally first. Wrote a small config, pointed it at our entry files, ran it. The build went from minutes to under a second. I ran the dev server. It came up. Things worked.&lt;/p&gt;

&lt;p&gt;I pinged a senior developer. Told him I had esbuild working locally and asked if we should give it a real shot. He said yes, we had fifteen to twenty days before the next production release, plenty of room to catch problems. So I opened a PR. The PR was small: a config file, a couple of package.json scripts, and two enum changes that I'll come back to in a minute. That was it. We talked about it again the next day. He said let's not merge it into dev right now because dev was being stabilised for the upcoming production release. "Use it locally for now, we'll merge it after the release." I said fine. Went back to my work and kept using esbuild on my machine. The PR sat there. A few days later the production release happened. Everything looked fine. I didn't think about the PR. I was busy with my own tickets, and honestly the esbuild thing had already paid off for me personally. I was the only one using it, my local was fast, life was good.&lt;/p&gt;

&lt;p&gt;That weekend, my phone buzzed.&lt;/p&gt;




&lt;h2&gt;
  
  
  One landlord, then five
&lt;/h2&gt;

&lt;p&gt;The first message was from on-call. A landlord couldn't access their payment link. The link was returning something weird, undefined or null, the kind of thing that usually smells like a bad record more than a bad code path. We looked into it. Their data on our end seemed off. We wrote a small script, patched the row, told the landlord we were sorry, and moved on. One landlord with broken data is the kind of thing you can rationalise away. Migrations are messy, maybe a job didn't run, whatever. We've all seen worse.&lt;/p&gt;

&lt;p&gt;The next day, another landlord reported the same thing. Then another. Then another. By the time we had five or six of these, the rationalisation stopped working. Data corruption does not politely affect five different accounts in the same exact way, in the same flow, on the same weekend. Something else was going on. And it was Saturday, and landlords were trying to collect rent, and the payment link, the one piece of the product that absolutely cannot be broken, was the thing that was broken.&lt;/p&gt;

&lt;p&gt;I pulled the flow up on my own machine. It worked. Pulled it up on dev. Reproduced it on the second try. Logs from every affected account hit the same callback handler before things fell over. So this wasn't bad data. This was code. Code that had shipped, code that had passed review, code that none of our tests had caught. That's when I started going through git history.&lt;/p&gt;




&lt;h2&gt;
  
  
  The PR I forgot about
&lt;/h2&gt;

&lt;p&gt;Our tickets that release had nothing to do with payments. Nobody had touched the payment files. Nobody except me, and only in one place: those two enum changes in the PR I thought we hadn't merged yet. I pulled up the git log. There it was. Merged into dev, shipped to prod, sitting there in the release as if it had always been planned. Apparently somewhere between "let's wait for the release" and the release itself, the PR had quietly gone in. I genuinely don't remember how. I don't remember being told. I had stopped tracking it because in my head it wasn't going out yet.&lt;/p&gt;

&lt;p&gt;I opened the commit. Two small lines stood out:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- export declare const enum PaymentStatus { ... }
&lt;/span&gt;&lt;span class="gi"&gt;+ export enum PaymentStatus { ... }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I had a feeling, but I wasn't sure. I dropped the diff into Claude and asked what esbuild does with &lt;code&gt;declare const enum&lt;/code&gt;. It told me everything I needed to know in about two paragraphs. I read it twice, then went looking through the actual codebase to confirm, because at this point I didn't really trust my own memory of what I had touched.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the build tool was the bug
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;tsc&lt;/code&gt; and esbuild look like they do the same job. You give them TypeScript, they give you JavaScript back. Except they really don't. &lt;code&gt;tsc&lt;/code&gt; understands types across files. It reads the &lt;code&gt;.d.ts&lt;/code&gt; files for every package you import from. When it sees a &lt;code&gt;declare const enum&lt;/code&gt;, it doesn't generate a runtime object at all, it just inlines the value directly into the call site. So &lt;code&gt;payment.status === PStatus.CAPTURED&lt;/code&gt; compiles down to &lt;code&gt;payment.status === "Captured"&lt;/code&gt;. The enum doesn't exist at runtime. It doesn't need to. esbuild doesn't read &lt;code&gt;.d.ts&lt;/code&gt; files. That's the whole story, really, but it took me a while to fully appreciate what it meant.&lt;/p&gt;

&lt;p&gt;We had an internal npm package that the server depended on heavily. Its types looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// internal-db-package/.../Payments.types.d.ts&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kr"&gt;declare&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="kr"&gt;enum&lt;/span&gt; &lt;span class="nx"&gt;PStatus&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;PENDING&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Pending&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;CAPTURED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Captured&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And its compiled &lt;code&gt;.js&lt;/code&gt; file looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// internal-db-package/.../Payments.types.js&lt;/span&gt;
&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;use strict&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;// (that's the whole file; declare const enum compiles to nothing)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;tsc&lt;/code&gt; would read the &lt;code&gt;.d.ts&lt;/code&gt;, find the enum, and inline the value into the consumer code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Captured&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;esbuild read the &lt;code&gt;.js&lt;/code&gt;, found nothing, and emitted this instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;CAPTURED&lt;/span&gt;
&lt;span class="c1"&gt;// TypeError: Cannot read properties of undefined (reading 'CAPTURED')&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The build printed a warning. Not an error, a warning. The server still booted. The dev environment behaved normally because that specific code path didn't run in any of our health checks. The only time it actually ran was when a real payment confirmation came back from the provider and the code tried to look up the status against the enum. That's when it crashed. The two enum flips in my PR (&lt;code&gt;PaymentStatus&lt;/code&gt; and &lt;code&gt;InvoiceTrackingEvent&lt;/code&gt;) were just there to get past esbuild's local build errors. I thought I had fixed the problem. What I had actually done was patch the two enums I happened to notice, while leaving every other enum import from that internal package silently broken in production.&lt;/p&gt;

&lt;p&gt;The hotfix itself was tiny. Revert the build script back to &lt;code&gt;tsc&lt;/code&gt;, restore the two &lt;code&gt;declare const enum&lt;/code&gt; keywords, push. Five lines. I had it in within the hour. Total production impact came out to roughly five to six hours, spread across a weekend, affecting somewhere between ten and fifteen landlords depending on how you count the ones who retried successfully on their own. We patched the affected payment records by hand the next morning. None of the actual money was lost, just the link state. Stripe held the charges fine. That was the only mercy in the whole thing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The part I didn't expect
&lt;/h2&gt;

&lt;p&gt;I waited for someone to be angry. Nobody was. A senior developer pinged me after the hotfix went out. No "why did you do this", no "why didn't we test it harder". Just: &lt;em&gt;"You know the fix, just get the PR up for production, we're good."&lt;/em&gt; That was the whole conversation. The story made the rounds internally as a joke, not an indictment. Nobody points at me with it.&lt;/p&gt;

&lt;p&gt;Two months later, with the dust settled, I came back to esbuild. By then I had switched to a Mac and the slow-build pain wasn't personal anymore, but the numbers were still real and the curiosity hadn't gone away. This time I did it differently. I started by actually mapping where the internal package was imported. Dozens of places, some I had no idea existed. The codebase had been growing for years and a few of the original authors were no longer around. That alone was a useful exercise, separate from anything to do with build tools. I now had a list of every file the change could touch. The rule I set for myself was simple. &lt;code&gt;tsc&lt;/code&gt; stays in the production Docker build, untouched. esbuild is only allowed near dev. Production keeps the boring, slow, correct thing. Local gets the fast, fragile thing, with guardrails.&lt;/p&gt;

&lt;p&gt;Then I wrote two small esbuild plugins. The first one rewrites &lt;code&gt;declare const enum&lt;/code&gt; into plain &lt;code&gt;enum&lt;/code&gt; on the way through esbuild. It walks the source, strips the &lt;code&gt;declare&lt;/code&gt; keyword, and lets the enum compile to a real runtime object. The local files now generate the JavaScript that &lt;code&gt;tsc&lt;/code&gt; would have inlined for free. The second plugin was the one that actually mattered. It virtualises the internal package's &lt;code&gt;.types&lt;/code&gt; modules. The internal package's compiled &lt;code&gt;.js&lt;/code&gt; files are empty (because, again, &lt;code&gt;declare const enum&lt;/code&gt; compiles to nothing). The plugin intercepts every import resolving into those &lt;code&gt;.types&lt;/code&gt; paths and substitutes a hand-written module with the same shape and the same values, but as a plain object literal that exists at runtime. The empty &lt;code&gt;.js&lt;/code&gt; files never get loaded again. Whatever esbuild can't infer from &lt;code&gt;.d.ts&lt;/code&gt;, the plugin supplies directly. The dev server rebuild went from 6.6 seconds with &lt;code&gt;tsc&lt;/code&gt; down to 119 milliseconds with esbuild. Roughly fifty-five times faster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tsc      ████████████████████████████████  6.6s
esbuild  ▏                                 119ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To be clear, the 6.6 seconds is the Mac number, not the Linux one. On Linux it had been minutes; on the Mac &lt;code&gt;tsc&lt;/code&gt; was bearable but esbuild was still in a different league. That setup is still running today, three months later, and nobody has noticed it exists, which is the highest compliment a dev tool can get. I also wrote a short doc explaining what the plugins do and why they exist, and pinned it in our engineering channel. Partly so the next person who touches the build doesn't wander into the same trap. Partly so future-me, in six months, doesn't either.&lt;/p&gt;




&lt;h2&gt;
  
  
  In closing
&lt;/h2&gt;

&lt;p&gt;Build tools are replaceable. Good teams aren't.&lt;/p&gt;

</description>
      <category>linux</category>
      <category>performance</category>
      <category>productivity</category>
      <category>typescript</category>
    </item>
    <item>
      <title>From custom polling architecture to one API call: rethinking notification delivery</title>
      <dc:creator>Karan Mali</dc:creator>
      <pubDate>Fri, 15 May 2026 12:23:43 +0000</pubDate>
      <link>https://dev.to/karan5599/notification-system-design-the-question-i-almost-missed-a1f</link>
      <guid>https://dev.to/karan5599/notification-system-design-the-question-i-almost-missed-a1f</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;How hard can it be?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I work at a property management SaaS company. Landlords use the platform to manage their properties, collect rent, track maintenance requests, handle lease contracts. Four developers, one product.&lt;/p&gt;

&lt;p&gt;Small team means you don't get tickets scoped down to a single component, you get features, end to end. I joined as a backend developer but quickly ended up touching everything: APIs, frontend, infra, and eventually mobile. That's just how it works when there are four of you.&lt;/p&gt;

&lt;p&gt;One day, a senior developer pinged me. The product needed a notification center. Users were asking for it, when a settlement gets transferred, when a maintenance request comes in, when a lease is about to expire. These are things landlords need to act on. Notifications made sense. The ask was simple. The notification system design turned out not to be.&lt;/p&gt;

&lt;p&gt;I didn't just start coding. I went to the whiteboard, mapped out the flow, discussed edge cases with the senior developer, made sure I understood what we actually needed. I designed for the problem I was given.What I didn't do was pressure-test whether that problem was the complete one.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The design I was proud of&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The first question I had to answer before writing a single line of code: how does the client get new notifications?&lt;/p&gt;

&lt;p&gt;The standard answer is &lt;strong&gt;WebSockets&lt;/strong&gt;. Open a persistent connection, server pushes events in real time. One of my teammates suggested exactly that But here is the problem.&lt;/p&gt;

&lt;p&gt;We run on &lt;a href="https://cloud.google.com/run/docs" rel="noopener noreferrer"&gt;Cloud Run&lt;/a&gt; — Google's serverless platform. Serverless means instances spin up and down. Persistent connections and abrupt disconnections become your problem to manage. On top of that, did our users actually need real-time notifications? A landlord getting notified about a rent settlement doesn't need to know in under a second. Five to ten seconds is fine. Polling was cheaper, easier to scale, and good enough for what we needed. I had a clear answer for why and that mattered, because going against the standard approach means you better be able to defend it.&lt;/p&gt;

&lt;p&gt;Here's how the full architecture looked:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2s2qf8nwlc784tb3rwk4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2s2qf8nwlc784tb3rwk4.png" alt="The ‘Impress the Recruiter’ Architecture" width="800" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Any service that wants to send a notification drops a job into &lt;a href="https://docs.bullmq.io/" rel="noopener noreferrer"&gt;BullMQ&lt;/a&gt;. A worker picks it up, writes to PostgreSQL, then pushes the notification ID into Redis. These two operations are sequential, not bundled. If the Redis push fails, it retries only that step,the DB write already happened and doesn't get touched again. One queue, one worker, two distinct responsibilities in sequence.The DB-generated notification ID becomes the Redis key. That ordering matters: you need the ID before you can cache it.On the read side, the client sends its last delivered notification ID. Server checks Redis — anything with an ID greater than that for this user? Return it. The client owns its cursor. No read/unread state to manage server-side. Stateless on the server, clean on the client.&lt;/p&gt;

&lt;p&gt;Two fallbacks handle edge cases. Every 30 minutes, the client forces a direct DB fetch in case Redis missed something. When the browser tab comes back from being suspended, the &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Page_Visibility_API" rel="noopener noreferrer"&gt;Page Visibility API&lt;/a&gt; triggers another direct DB call. Redis has a 24-hour TTL but the ID cursor filters stale entries anyway.There's a subtle attack surface in the fallback design worth thinking through. If a malicious actor always passes &lt;code&gt;db=true&lt;/code&gt;, every request bypasses Redis and hits the database directly — constant load, every poll cycle. The fix is server-controlled rate limiting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// On direct DB fetch, set a rate-limit key&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`last_db_fetch:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;EX&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// On every incoming request, check before honoring db=true&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;recentFetch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`last_db_fetch:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;recentFetch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Ignore client-supplied db=true, serve from Redis&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;serveFromRedis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;lastNotificationId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The client no longer controls when the DB gets called — the server does. Simple key, solves the problem cleanly.The architecture worked. But it had a blind spot I couldn't see from inside it.&lt;/p&gt;

&lt;p&gt;The Redis TTL had to stay manually synced with the frontend polling interval — two things in two places, easy to drift. And the biggest problem: every new delivery channel meant building a separate integration from scratch. Email — build it. Mobile push — build it. Slack — build it. I owned all of it indefinitely.At the time, the brief was web only. So this felt fine. And then it wasn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The comment that broke the design&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We have weekly meetings. Product updates, priorities, what's coming next. I was in one of those when our CEO mentioned, almost in passing, that we'd be launching a mobile app in a few months.&lt;/p&gt;

&lt;p&gt;That was it. That was the comment.&lt;/p&gt;

&lt;p&gt;I hadn't thought about it before that moment. But something clicked immediately. Mobile app means push notifications. Push notifications mean FCM integration. Email was also coming — the brief had been expanding quietly in the background. And I was sitting on an architecture I'd spent two days designing and defending — one that handled web delivery and nothing else&lt;/p&gt;

&lt;p&gt;I went to the senior developer after the meeting. Laid it out: if we want email, I build that delivery layer. If we want mobile push, I build that too. If we want Slack someday, same story. Every new channel is a new integration I own, maintain, and debug when it breaks at 2am.That's not a notification system. That's a delivery platform. Different scope, different problem.I talked it through with them and made the call. Go third-party. Two days of architecture work had to go,the polling logic, the Redis TTL design, the fallback math. I scrapped it. Not because the thinking was wrong. The technical decisions were sound. But I had designed for a problem that was smaller than the actual one, and the right move was to own that and correct it before building further on a foundation that wouldn't hold.&lt;/p&gt;

&lt;p&gt;Two days gone. The alternative was spending the next year maintaining delivery across three channels myself. That math wasn't hard.&lt;br&gt;
The mistake wasn't the architecture. It was not pressure-testing the scope before I started. But catching it at two days instead of six months into a mobile launch — that's the part that mattered.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What I actually evaluated&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Going third-party meant finding the right tool. I evaluated against a short list of non-negotiables.&lt;/p&gt;

&lt;p&gt;We needed a React Native SDK with a prebuilt inbox component — we weren't going to build our own notification UI on mobile. We needed i18n support without enterprise pricing — we serve an Arabic-speaking market and that's table stakes, not a premium feature. And we needed a single API call that handled all delivery channels, so adding email or push later didn't mean a new integration.&lt;/p&gt;

&lt;p&gt;I evaluated a few options. One required building the mobile UI ourselves. One priced i18n behind an enterprise tier. &lt;a href="https://www.courier.com/docs" rel="noopener noreferrer"&gt;Courier&lt;/a&gt; hit every requirement: prebuilt React Native inbox, i18n on all plans, one API call for web, mobile, and email.&lt;/p&gt;

&lt;p&gt;One practical note if you're on a similar stack: Courier has no Angular SDK. Our web client is Angular, so I had to inject their web component directly into the DOM. It works, but it's not the same developer experience as a native SDK. Worth factoring in before you commit.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The new notification system architecture&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;With Courier handling delivery, the system got significantly simpler.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6q5s1dd4ie9g65m3y7w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6q5s1dd4ie9g65m3y7w.png" alt="When Simplicity Meets the Use Case" width="799" height="293"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The flow is straightforward. Any service calls &lt;code&gt;NotificationHubService.create()&lt;/code&gt;. The service resolves recipients, checks preferences, saves to PostgreSQL, and enqueues a job to BullMQ. The worker picks it up and sends a single API call to Courier — user ID, title, body, action URL. Courier routes to web inbox, mobile push, or email depending on configuration. The worker then updates the notification record with Courier's request ID for traceability.&lt;/p&gt;

&lt;p&gt;That's it. No Redis live queue. No polling. No fallbacks. No TTL drift.The diagram looks simple because the architecture is simple. The interesting complexity isn't in the flow — it's in the recipient resolution logic, which lives at the code layer, not the architecture layer. And that was a deliberate choice.&lt;/p&gt;

&lt;p&gt;The interface supports three modes: explicit user IDs, resource-based access (pass a lease or settlement and the service resolves who has access using the existing RBAC layer), or broadcast to all active account members. The caller doesn't need to know the access rules. Any service can call the same interface with a category and a message. Adding a new notification type is one function call. The architecture is deliberately simple so the logic can be where it needs to be — in the code, not the infrastructure.&lt;/p&gt;

&lt;p&gt;The tradeoff is vendor dependency. If Courier has an outage, notifications go down. That's a real risk we accepted — but for a four-person team, owning delivery reliability ourselves across every channel wasn't a trade we could win.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The question I should have asked first&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I spent two days designing a solid architecture. BullMQ queue, Redis cache, polling logic, fallbacks for every edge case. I had reasoned through each decision and could defend them.&lt;/p&gt;

&lt;p&gt;What I hadn't done was ask one question: where else do you want to send these notifications?&lt;/p&gt;

&lt;p&gt;Business briefs describe what people want today. They don't always describe what the system needs to support tomorrow — not because anyone is hiding it, but because a CEO mentioning a mobile app in a standup isn't thinking about your queue architecture. That's their product, not their system.&lt;/p&gt;

&lt;p&gt;The system has three distinct responsibilities: generating notifications, resolving who receives them, and delivering them. The first architecture conflated all three. The new one separates them — generation stays in each service, resolution lives in &lt;code&gt;NotificationHubService&lt;/code&gt;, delivery belongs to Courier. Each part is independently replaceable. That separation is also what made the build-vs-buy call clear: delivery is a commodity problem. Recipient resolution, tied to your RBAC model and your business rules, is not. Own what's specific to your domain. Buy what isn't.&lt;/p&gt;

&lt;p&gt;If I'd asked that question in the first meeting, none of this would have needed unwinding. I caught it at two days instead of six months into a mobile launch. Next time I start a system design, I know what the first question is.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you've had a similar "comment that broke the design" moment — when did you catch it? Drop it in the comments. Always curious how other teams pressure-test scope before they commit.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>systemdesign</category>
      <category>backend</category>
      <category>fullstack</category>
    </item>
  </channel>
</rss>
