<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Karan Mali</title>
    <description>The latest articles on DEV Community by Karan Mali (@karan5599).</description>
    <link>https://dev.to/karan5599</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3932996%2F1c0f0b32-572a-4a88-bf48-bf5b4327179b.jpg</url>
      <title>DEV Community: Karan Mali</title>
      <link>https://dev.to/karan5599</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/karan5599"/>
    <language>en</language>
    <item>
      <title>I Didn't Need a Smarter Model. I Needed to Onboard It</title>
      <dc:creator>Karan Mali</dc:creator>
      <pubDate>Sat, 27 Jun 2026 20:18:49 +0000</pubDate>
      <link>https://dev.to/karan5599/i-didnt-need-a-smarter-model-i-needed-to-onboard-it-27dc</link>
      <guid>https://dev.to/karan5599/i-didnt-need-a-smarter-model-i-needed-to-onboard-it-27dc</guid>
      <description>&lt;p&gt;&lt;em&gt;Why the bottleneck for AI on real tickets isn't intelligence. It's that your brain starts warm and the model starts cold.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Two people get the same ticket. One starts from zero.
&lt;/h2&gt;

&lt;p&gt;When a ticket lands on me, my brain lights up before I've finished reading the title. This service owns that logic. The route is probably over here. That controller needs a new method, and there's a DAO at the end of the chain I'll have to touch. I'm not being clever. I've just been in this codebase long enough that the map is already in my head.&lt;/p&gt;

&lt;p&gt;Hand the same ticket to an AI and it's staring at a blank wall. It has the code. What it doesn't have is the map. It doesn't know our layers, our naming, or that we shipped something almost identical last week. Same ticket, same repo, and one of us starts warm while the other starts cold.&lt;/p&gt;

&lt;p&gt;I'd love to tell you I understood that from the start. I didn't. When I first wired AI into our workflow I assumed what most people assume: the model is smart enough to work the codebase out on its own. Point it at the repo, give it the ticket, get something useful back. That assumption is the exact thing that blew up in my face, and cleaning up the mess is what taught me the whole game is context, not intelligence.&lt;/p&gt;

&lt;p&gt;Here's how naive I was about it at first. For weeks, every ticket started the same way. I'd screenshot the Jira ticket, paste it into Claude, and wait while it dug through our code trying to work out where things lived. Then I'd start coding. I was the integration layer. Copy from Jira, paste into Claude, sit there, repeat. At some point it stopped feeling like using AI and started feeling like a chore I was doing for the AI. The loop was dumb enough that something else should be running it. So I built that something.&lt;/p&gt;

&lt;h2&gt;
  
  
  So I automated it, and it got confidently wrong
&lt;/h2&gt;

&lt;p&gt;The flow itself is simple. A ticket gets assigned, a branch gets cut for it on its own, and an agent writes an implementation plan into a folder we keep just for that. We've got dedicated folders for plans, for feature notes, for the stuff the team keeps coming back to. The agent reads the ticket, searches the code, and drops a plan: what the task is, which files probably change, how to go about it, what's still unclear. Then it links that plan back on the ticket so it's waiting for whoever picks the work up.&lt;/p&gt;

&lt;p&gt;On paper, exactly what I'd been doing by hand. In practice the plans were bad.&lt;/p&gt;

&lt;p&gt;Not bad like broken English. Bad like confidently wrong. The agent had read access across the whole repo and a search tool, so it would poke around, land on some file that looked relevant, and build a plan on top of it. The plans put validation logic in the controller. They put database queries straight in the service layer. Both of those are backwards from how we build. Validation belongs in middleware for us, and every database call lives in the DAO layer, never the service.&lt;/p&gt;

&lt;p&gt;So the plan read like it knew what it was talking about, and it would march a developer right into a rejected PR. Picture a junior trusting one of these. They write the code the plan describes, open the PR, get torn apart in review, then sit there confused about why the "AI plan" sent them the wrong way.&lt;/p&gt;

&lt;p&gt;It was slow on top of that. Every run, the agent rebuilt its picture of our architecture from nothing, because it had nothing to start from. That's the cold start again. I'd automated it without fixing the actual problem. I had handed off my busywork and thrown away the one thing that made the manual version worth doing. I already knew the shape of the code. The agent didn't, and I hadn't given it any way to find out.&lt;/p&gt;

&lt;h2&gt;
  
  
  It was never about how smart the model was
&lt;/h2&gt;

&lt;p&gt;My first instinct was the obvious wrong one. Maybe a better model fixes this. It doesn't. A better model still has no clue that in our repo the DAO is the only place allowed to touch the database. You can't buy that off a pricing page. It isn't intelligence.&lt;/p&gt;

&lt;p&gt;That's when it actually clicked. The problem was never how smart the model was. It started cold every single time, and I started warm. I'd been trying to upgrade the brain when the thing missing was the map in my head, and nobody had ever written that map down. A new hire would hit the exact same wall on day one. The difference is we don't expect a new hire to be brilliant. We expect to onboard them.&lt;/p&gt;

&lt;p&gt;So I stopped trying to make the agent smarter and started onboarding it instead. New hires don't get a bigger brain. They get docs, they get the conventions, and they get someone pointing at the last person's work saying "do it like that." I built the same thing, in layers instead of one giant file.&lt;/p&gt;

&lt;p&gt;There's a thin entry point at the top. Project basics, plus pointers that say "for this kind of work, go read that." The real knowledge isn't in there. Under it sits the core conventions doc, where the architecture is spelled out flat. Controllers call services, services call DAOs, DAOs own the database. Validation in middleware. The naming rules, and the patterns we don't break. Then there's a set of specialized files the agent only opens when it needs them. One for on-call database work, one for the data model, one for translations. It doesn't read those every time. It reaches for the right one when the ticket actually touches that area.&lt;/p&gt;

&lt;p&gt;Then I rewrote the agent's marching orders. Before it searches for anything, it reads the entry point and the core doc. It gets its bearings on the overall structure first. Only then does it go looking in the area the ticket is about. And the instruction that pulled the most weight was the simplest one. Find the closest existing implementation and copy its pattern. That is word for word what I'd tell a junior on day one. Don't invent an approach. Go find something we already built that's close, and follow it.&lt;/p&gt;

&lt;p&gt;The plans turned around almost immediately. Validation showed up in middleware. Queries showed up in the DAO. The model hadn't gotten smarter between Tuesday and Wednesday. I had just written down the rules I'd been carrying in my head and forced it to read them before it touched anything.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhjduuzlkplobffxw5mee.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhjduuzlkplobffxw5mee.png" alt="Same ticket, same repo. The only thing that changed is what the agent reads first." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The part most people skip is what comes after that. A memory layer you write once and forget about rots fast. The code moves, conventions shift, and those docs quietly go stale until the agent is following rules that stopped being true months ago. So we baked the upkeep into the workflow itself. When the agent learns something new, or when we ship a feature that changes how an area works, updating the memory is part of the job and not a thing someone remembers to do later. The repo keeps its own docs current. A person's mental model updates as they work. The agent's has to update the same way, or it slides right back to confidently wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Then the plan started arguing back
&lt;/h2&gt;

&lt;p&gt;Once the plans were trustworthy, the thing I didn't plan for turned out to be the best part. The plan stopped being the valuable bit.&lt;/p&gt;

&lt;p&gt;I added a step where the agent grills its own plan after writing it. It walks back through what it wrote the way a senior would in a design review. Where's the ambiguity. What edge case is missing. Which assumption needs a real answer before anyone writes a line. It writes those questions down, with its own recommended answer next to each one.&lt;/p&gt;

&lt;p&gt;And then it holds the line. When I come back to actually build the ticket, the agent won't start writing code until those questions are answered, even when I tell it to just get on with it. It drags the design conversation to the front, the one I'd normally skip and then pay for later in a half-built feature and an ugly PR.&lt;/p&gt;

&lt;p&gt;So the output grew up. It went from "a plan" to "a forced design conversation before a single line exists." The value moved off the answer and onto the questions. That's the piece I'd fight hardest to keep now, and it's the one I never set out to build.&lt;/p&gt;

&lt;h2&gt;
  
  
  Telling QA what actually shipped
&lt;/h2&gt;

&lt;p&gt;The other place this quietly pays off is after the code is done, at deploy time.&lt;/p&gt;

&lt;p&gt;A ticket almost never ships clean on the first try. It goes out, QA finds something, it comes back, gets patched, goes out again. A single ticket can rack up several of those rounds. The messy reality is that a few deploys in, nobody is totally sure what just went out the door. QA opens a ticket and can't tell which points got handled in this release and which are still sitting open. So they re-test the whole thing, or they re-test the wrong thing and miss the part that actually changed.&lt;/p&gt;

&lt;p&gt;Now when a deploy goes out, an agent reads what was in it and leaves a summary on the tickets it touched. Here's what shipped. These items from the ticket are done. These aren't, yet. QA opens it and knows what to check instead of guessing. Small thing. It kills a real, recurring source of confusion, and it's exactly the sort of glue nobody wants to write out by hand on every single release.&lt;/p&gt;

&lt;p&gt;That's the actual throughline. None of this is "AI writes my code." It's AI sitting in the gaps between the tools I already use, the ticket and the codebase and the deploy and the QA pass, carrying the context I used to carry myself.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed, and what I skip
&lt;/h2&gt;

&lt;p&gt;This isn't a demo I built once and screenshotted for a standup. We actually use it.&lt;/p&gt;

&lt;p&gt;The real change is in how I work, not in what the AI spits out. I keep three or four tickets moving at the same time now. Each one shows up with its branch already cut and its context already loaded, so I'm not paying the cold-start tax on every one of them. That tax, the "what is this, where does it live, how do we do it here" tax, is exactly what used to pin me to one ticket at a time.&lt;/p&gt;

&lt;p&gt;I'll be straight about where it doesn't pull its weight. For small tickets I ignore the plan and just do the thing. Faster that way. The pipeline isn't sacred, and shoving every tiny ticket through it would be its own kind of waste.&lt;/p&gt;

&lt;p&gt;There was dumber pain along the way too. We started on a plan that handed out an access token that expired, so for a stretch the whole thing kept dying until someone manually pasted a fresh secret back in. Very cutting-edge stuff. We switched to a proper key and it stopped fighting us. A couple of rough edges are still here. The agent finds files mostly by searching for names, which is great for "where's addInvoice" and useless for "where do we handle offline payments," so a smarter search step is the obvious next move. And I still don't measure whether the plans are any good, even though the signal is right there for the taking. Compare the files a merged PR actually changed against the files the plan said it would. Right now "it helps" is a feeling, not a number. That one bugs me.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;The model was never the bottleneck. Context was.&lt;/p&gt;

&lt;p&gt;I didn't make any of this better by reaching for a bigger model. I made it better by writing down the things I'd tell a new hire. Where validation goes. Where the database calls live. Go copy the nearest thing we've already built. Then forcing the agent to read all of it before it touched the code, and keeping those notes alive as the code kept moving. The boring fix beat the exciting one, and it kept on beating it.&lt;/p&gt;

&lt;p&gt;Once the AI started warm instead of cold, the plan stopped being the point. Every ticket lands with its context already loaded, and that's the only reason I can keep four of them in the air instead of grinding through them one at a time.&lt;/p&gt;

&lt;p&gt;So if you're about to do this where you work, don't start with the model. Start with the context your own brain loads without being asked, the stuff you never had to write down because you already knew it. That is the part the AI is missing. The rest is plumbing.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>claude</category>
      <category>productivity</category>
    </item>
    <item>
      <title>The const enum that took down our payments</title>
      <dc:creator>Karan Mali</dc:creator>
      <pubDate>Thu, 28 May 2026 13:01:29 +0000</pubDate>
      <link>https://dev.to/karan5599/the-const-enum-that-took-down-our-payments-pi8</link>
      <guid>https://dev.to/karan5599/the-const-enum-that-took-down-our-payments-pi8</guid>
      <description>&lt;h2&gt;
  
  
  Three minutes
&lt;/h2&gt;

&lt;p&gt;That's how long I'd sit there every time I changed a file on our server. Three to four minutes for the dev server to rebuild and come back up. Long enough to check Slack, scroll twitter, and then forget what I was about to do when it finally came back. I was the only one on Linux. The rest of the team was on Mac, and on their machines the dev server came back in twenty seconds. Nobody else felt the pain, so nobody else was looking for a fix. On my machine it was the bottleneck of my whole day. I started batching changes across multiple files so I'd only pay the rebuild cost once, which sounds clever but mostly meant I'd lose track of what I was actually testing. You don't really notice an hour disappearing into that. You just notice you're tired by 4pm and you can't figure out why.&lt;/p&gt;

&lt;p&gt;For a while, I lived with it.&lt;/p&gt;




&lt;h2&gt;
  
  
  A small experiment
&lt;/h2&gt;

&lt;p&gt;I'd used esbuild in some side projects before. It built things in milliseconds. After enough days of staring at a slow rebuild, I started wondering whether it would work for our server too. I tried it locally first. Wrote a small config, pointed it at our entry files, ran it. The build went from minutes to under a second. I ran the dev server. It came up. Things worked.&lt;/p&gt;

&lt;p&gt;I pinged a senior developer. Told him I had esbuild working locally and asked if we should give it a real shot. He said yes, we had fifteen to twenty days before the next production release, plenty of room to catch problems. So I opened a PR. The PR was small: a config file, a couple of package.json scripts, and two enum changes that I'll come back to in a minute. That was it. We talked about it again the next day. He said let's not merge it into dev right now because dev was being stabilised for the upcoming production release. "Use it locally for now, we'll merge it after the release." I said fine. Went back to my work and kept using esbuild on my machine. The PR sat there. A few days later the production release happened. Everything looked fine. I didn't think about the PR. I was busy with my own tickets, and honestly the esbuild thing had already paid off for me personally. I was the only one using it, my local was fast, life was good.&lt;/p&gt;

&lt;p&gt;That weekend, my phone buzzed.&lt;/p&gt;




&lt;h2&gt;
  
  
  One landlord, then five
&lt;/h2&gt;

&lt;p&gt;The first message was from on-call. A landlord couldn't access their payment link. The link was returning something weird, undefined or null, the kind of thing that usually smells like a bad record more than a bad code path. We looked into it. Their data on our end seemed off. We wrote a small script, patched the row, told the landlord we were sorry, and moved on. One landlord with broken data is the kind of thing you can rationalise away. Migrations are messy, maybe a job didn't run, whatever. We've all seen worse.&lt;/p&gt;

&lt;p&gt;The next day, another landlord reported the same thing. Then another. Then another. By the time we had five or six of these, the rationalisation stopped working. Data corruption does not politely affect five different accounts in the same exact way, in the same flow, on the same weekend. Something else was going on. And it was Saturday, and landlords were trying to collect rent, and the payment link, the one piece of the product that absolutely cannot be broken, was the thing that was broken.&lt;/p&gt;

&lt;p&gt;I pulled the flow up on my own machine. It worked. Pulled it up on dev. Reproduced it on the second try. Logs from every affected account hit the same callback handler before things fell over. So this wasn't bad data. This was code. Code that had shipped, code that had passed review, code that none of our tests had caught. That's when I started going through git history.&lt;/p&gt;




&lt;h2&gt;
  
  
  The PR I forgot about
&lt;/h2&gt;

&lt;p&gt;Our tickets that release had nothing to do with payments. Nobody had touched the payment files. Nobody except me, and only in one place: those two enum changes in the PR I thought we hadn't merged yet. I pulled up the git log. There it was. Merged into dev, shipped to prod, sitting there in the release as if it had always been planned. Apparently somewhere between "let's wait for the release" and the release itself, the PR had quietly gone in. I genuinely don't remember how. I don't remember being told. I had stopped tracking it because in my head it wasn't going out yet.&lt;/p&gt;

&lt;p&gt;I opened the commit. Two small lines stood out:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- export declare const enum PaymentStatus { ... }
&lt;/span&gt;&lt;span class="gi"&gt;+ export enum PaymentStatus { ... }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I had a feeling, but I wasn't sure. I dropped the diff into Claude and asked what esbuild does with &lt;code&gt;declare const enum&lt;/code&gt;. It told me everything I needed to know in about two paragraphs. I read it twice, then went looking through the actual codebase to confirm, because at this point I didn't really trust my own memory of what I had touched.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the build tool was the bug
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;tsc&lt;/code&gt; and esbuild look like they do the same job. You give them TypeScript, they give you JavaScript back. Except they really don't. &lt;code&gt;tsc&lt;/code&gt; understands types across files. It reads the &lt;code&gt;.d.ts&lt;/code&gt; files for every package you import from. When it sees a &lt;code&gt;declare const enum&lt;/code&gt;, it doesn't generate a runtime object at all, it just inlines the value directly into the call site. So &lt;code&gt;payment.status === PStatus.CAPTURED&lt;/code&gt; compiles down to &lt;code&gt;payment.status === "Captured"&lt;/code&gt;. The enum doesn't exist at runtime. It doesn't need to. esbuild doesn't read &lt;code&gt;.d.ts&lt;/code&gt; files. That's the whole story, really, but it took me a while to fully appreciate what it meant.&lt;/p&gt;

&lt;p&gt;We had an internal npm package that the server depended on heavily. Its types looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// internal-db-package/.../Payments.types.d.ts&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kr"&gt;declare&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="kr"&gt;enum&lt;/span&gt; &lt;span class="nx"&gt;PStatus&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;PENDING&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Pending&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;CAPTURED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Captured&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And its compiled &lt;code&gt;.js&lt;/code&gt; file looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// internal-db-package/.../Payments.types.js&lt;/span&gt;
&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;use strict&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;// (that's the whole file; declare const enum compiles to nothing)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;tsc&lt;/code&gt; would read the &lt;code&gt;.d.ts&lt;/code&gt;, find the enum, and inline the value into the consumer code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Captured&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;esbuild read the &lt;code&gt;.js&lt;/code&gt;, found nothing, and emitted this instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;CAPTURED&lt;/span&gt;
&lt;span class="c1"&gt;// TypeError: Cannot read properties of undefined (reading 'CAPTURED')&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The build printed a warning. Not an error, a warning. The server still booted. The dev environment behaved normally because that specific code path didn't run in any of our health checks. The only time it actually ran was when a real payment confirmation came back from the provider and the code tried to look up the status against the enum. That's when it crashed. The two enum flips in my PR (&lt;code&gt;PaymentStatus&lt;/code&gt; and &lt;code&gt;InvoiceTrackingEvent&lt;/code&gt;) were just there to get past esbuild's local build errors. I thought I had fixed the problem. What I had actually done was patch the two enums I happened to notice, while leaving every other enum import from that internal package silently broken in production.&lt;/p&gt;

&lt;p&gt;The hotfix itself was tiny. Revert the build script back to &lt;code&gt;tsc&lt;/code&gt;, restore the two &lt;code&gt;declare const enum&lt;/code&gt; keywords, push. Five lines. I had it in within the hour. Total production impact came out to roughly five to six hours, spread across a weekend, affecting somewhere between ten and fifteen landlords depending on how you count the ones who retried successfully on their own. We patched the affected payment records by hand the next morning. None of the actual money was lost, just the link state. Stripe held the charges fine. That was the only mercy in the whole thing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The part I didn't expect
&lt;/h2&gt;

&lt;p&gt;I waited for someone to be angry. Nobody was. A senior developer pinged me after the hotfix went out. No "why did you do this", no "why didn't we test it harder". Just: &lt;em&gt;"You know the fix, just get the PR up for production, we're good."&lt;/em&gt; That was the whole conversation. The story made the rounds internally as a joke, not an indictment. Nobody points at me with it.&lt;/p&gt;

&lt;p&gt;Two months later, with the dust settled, I came back to esbuild. By then I had switched to a Mac and the slow-build pain wasn't personal anymore, but the numbers were still real and the curiosity hadn't gone away. This time I did it differently. I started by actually mapping where the internal package was imported. Dozens of places, some I had no idea existed. The codebase had been growing for years and a few of the original authors were no longer around. That alone was a useful exercise, separate from anything to do with build tools. I now had a list of every file the change could touch. The rule I set for myself was simple. &lt;code&gt;tsc&lt;/code&gt; stays in the production Docker build, untouched. esbuild is only allowed near dev. Production keeps the boring, slow, correct thing. Local gets the fast, fragile thing, with guardrails.&lt;/p&gt;

&lt;p&gt;Then I wrote two small esbuild plugins. The first one rewrites &lt;code&gt;declare const enum&lt;/code&gt; into plain &lt;code&gt;enum&lt;/code&gt; on the way through esbuild. It walks the source, strips the &lt;code&gt;declare&lt;/code&gt; keyword, and lets the enum compile to a real runtime object. The local files now generate the JavaScript that &lt;code&gt;tsc&lt;/code&gt; would have inlined for free. The second plugin was the one that actually mattered. It virtualises the internal package's &lt;code&gt;.types&lt;/code&gt; modules. The internal package's compiled &lt;code&gt;.js&lt;/code&gt; files are empty (because, again, &lt;code&gt;declare const enum&lt;/code&gt; compiles to nothing). The plugin intercepts every import resolving into those &lt;code&gt;.types&lt;/code&gt; paths and substitutes a hand-written module with the same shape and the same values, but as a plain object literal that exists at runtime. The empty &lt;code&gt;.js&lt;/code&gt; files never get loaded again. Whatever esbuild can't infer from &lt;code&gt;.d.ts&lt;/code&gt;, the plugin supplies directly. The dev server rebuild went from 6.6 seconds with &lt;code&gt;tsc&lt;/code&gt; down to 119 milliseconds with esbuild. Roughly fifty-five times faster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tsc      ████████████████████████████████  6.6s
esbuild  ▏                                 119ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To be clear, the 6.6 seconds is the Mac number, not the Linux one. On Linux it had been minutes; on the Mac &lt;code&gt;tsc&lt;/code&gt; was bearable but esbuild was still in a different league. That setup is still running today, three months later, and nobody has noticed it exists, which is the highest compliment a dev tool can get. I also wrote a short doc explaining what the plugins do and why they exist, and pinned it in our engineering channel. Partly so the next person who touches the build doesn't wander into the same trap. Partly so future-me, in six months, doesn't either.&lt;/p&gt;




&lt;h2&gt;
  
  
  In closing
&lt;/h2&gt;

&lt;p&gt;Build tools are replaceable. Good teams aren't.&lt;/p&gt;

</description>
      <category>linux</category>
      <category>performance</category>
      <category>productivity</category>
      <category>typescript</category>
    </item>
    <item>
      <title>From custom polling architecture to one API call: rethinking notification delivery</title>
      <dc:creator>Karan Mali</dc:creator>
      <pubDate>Fri, 15 May 2026 12:23:43 +0000</pubDate>
      <link>https://dev.to/karan5599/notification-system-design-the-question-i-almost-missed-a1f</link>
      <guid>https://dev.to/karan5599/notification-system-design-the-question-i-almost-missed-a1f</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;How hard can it be?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I work at a property management SaaS company. Landlords use the platform to manage their properties, collect rent, track maintenance requests, handle lease contracts. Four developers, one product.&lt;/p&gt;

&lt;p&gt;Small team means you don't get tickets scoped down to a single component, you get features, end to end. I joined as a backend developer but quickly ended up touching everything: APIs, frontend, infra, and eventually mobile. That's just how it works when there are four of you.&lt;/p&gt;

&lt;p&gt;One day, a senior developer pinged me. The product needed a notification center. Users were asking for it, when a settlement gets transferred, when a maintenance request comes in, when a lease is about to expire. These are things landlords need to act on. Notifications made sense. The ask was simple. The notification system design turned out not to be.&lt;/p&gt;

&lt;p&gt;I didn't just start coding. I went to the whiteboard, mapped out the flow, discussed edge cases with the senior developer, made sure I understood what we actually needed. I designed for the problem I was given.What I didn't do was pressure-test whether that problem was the complete one.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The design I was proud of&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The first question I had to answer before writing a single line of code: how does the client get new notifications?&lt;/p&gt;

&lt;p&gt;The standard answer is &lt;strong&gt;WebSockets&lt;/strong&gt;. Open a persistent connection, server pushes events in real time. One of my teammates suggested exactly that But here is the problem.&lt;/p&gt;

&lt;p&gt;We run on &lt;a href="https://cloud.google.com/run/docs" rel="noopener noreferrer"&gt;Cloud Run&lt;/a&gt; — Google's serverless platform. Serverless means instances spin up and down. Persistent connections and abrupt disconnections become your problem to manage. On top of that, did our users actually need real-time notifications? A landlord getting notified about a rent settlement doesn't need to know in under a second. Five to ten seconds is fine. Polling was cheaper, easier to scale, and good enough for what we needed. I had a clear answer for why and that mattered, because going against the standard approach means you better be able to defend it.&lt;/p&gt;

&lt;p&gt;Here's how the full architecture looked:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2s2qf8nwlc784tb3rwk4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2s2qf8nwlc784tb3rwk4.png" alt="The ‘Impress the Recruiter’ Architecture" width="800" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Any service that wants to send a notification drops a job into &lt;a href="https://docs.bullmq.io/" rel="noopener noreferrer"&gt;BullMQ&lt;/a&gt;. A worker picks it up, writes to PostgreSQL, then pushes the notification ID into Redis. These two operations are sequential, not bundled. If the Redis push fails, it retries only that step,the DB write already happened and doesn't get touched again. One queue, one worker, two distinct responsibilities in sequence.The DB-generated notification ID becomes the Redis key. That ordering matters: you need the ID before you can cache it.On the read side, the client sends its last delivered notification ID. Server checks Redis — anything with an ID greater than that for this user? Return it. The client owns its cursor. No read/unread state to manage server-side. Stateless on the server, clean on the client.&lt;/p&gt;

&lt;p&gt;Two fallbacks handle edge cases. Every 30 minutes, the client forces a direct DB fetch in case Redis missed something. When the browser tab comes back from being suspended, the &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Page_Visibility_API" rel="noopener noreferrer"&gt;Page Visibility API&lt;/a&gt; triggers another direct DB call. Redis has a 24-hour TTL but the ID cursor filters stale entries anyway.There's a subtle attack surface in the fallback design worth thinking through. If a malicious actor always passes &lt;code&gt;db=true&lt;/code&gt;, every request bypasses Redis and hits the database directly — constant load, every poll cycle. The fix is server-controlled rate limiting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// On direct DB fetch, set a rate-limit key&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`last_db_fetch:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;EX&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// On every incoming request, check before honoring db=true&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;recentFetch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`last_db_fetch:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;recentFetch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Ignore client-supplied db=true, serve from Redis&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;serveFromRedis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;lastNotificationId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The client no longer controls when the DB gets called — the server does. Simple key, solves the problem cleanly.The architecture worked. But it had a blind spot I couldn't see from inside it.&lt;/p&gt;

&lt;p&gt;The Redis TTL had to stay manually synced with the frontend polling interval — two things in two places, easy to drift. And the biggest problem: every new delivery channel meant building a separate integration from scratch. Email — build it. Mobile push — build it. Slack — build it. I owned all of it indefinitely.At the time, the brief was web only. So this felt fine. And then it wasn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The comment that broke the design&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We have weekly meetings. Product updates, priorities, what's coming next. I was in one of those when our CEO mentioned, almost in passing, that we'd be launching a mobile app in a few months.&lt;/p&gt;

&lt;p&gt;That was it. That was the comment.&lt;/p&gt;

&lt;p&gt;I hadn't thought about it before that moment. But something clicked immediately. Mobile app means push notifications. Push notifications mean FCM integration. Email was also coming — the brief had been expanding quietly in the background. And I was sitting on an architecture I'd spent two days designing and defending — one that handled web delivery and nothing else&lt;/p&gt;

&lt;p&gt;I went to the senior developer after the meeting. Laid it out: if we want email, I build that delivery layer. If we want mobile push, I build that too. If we want Slack someday, same story. Every new channel is a new integration I own, maintain, and debug when it breaks at 2am.That's not a notification system. That's a delivery platform. Different scope, different problem.I talked it through with them and made the call. Go third-party. Two days of architecture work had to go,the polling logic, the Redis TTL design, the fallback math. I scrapped it. Not because the thinking was wrong. The technical decisions were sound. But I had designed for a problem that was smaller than the actual one, and the right move was to own that and correct it before building further on a foundation that wouldn't hold.&lt;/p&gt;

&lt;p&gt;Two days gone. The alternative was spending the next year maintaining delivery across three channels myself. That math wasn't hard.&lt;br&gt;
The mistake wasn't the architecture. It was not pressure-testing the scope before I started. But catching it at two days instead of six months into a mobile launch — that's the part that mattered.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What I actually evaluated&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Going third-party meant finding the right tool. I evaluated against a short list of non-negotiables.&lt;/p&gt;

&lt;p&gt;We needed a React Native SDK with a prebuilt inbox component — we weren't going to build our own notification UI on mobile. We needed i18n support without enterprise pricing — we serve an Arabic-speaking market and that's table stakes, not a premium feature. And we needed a single API call that handled all delivery channels, so adding email or push later didn't mean a new integration.&lt;/p&gt;

&lt;p&gt;I evaluated a few options. One required building the mobile UI ourselves. One priced i18n behind an enterprise tier. &lt;a href="https://www.courier.com/docs" rel="noopener noreferrer"&gt;Courier&lt;/a&gt; hit every requirement: prebuilt React Native inbox, i18n on all plans, one API call for web, mobile, and email.&lt;/p&gt;

&lt;p&gt;One practical note if you're on a similar stack: Courier has no Angular SDK. Our web client is Angular, so I had to inject their web component directly into the DOM. It works, but it's not the same developer experience as a native SDK. Worth factoring in before you commit.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The new notification system architecture&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;With Courier handling delivery, the system got significantly simpler.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6q5s1dd4ie9g65m3y7w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6q5s1dd4ie9g65m3y7w.png" alt="When Simplicity Meets the Use Case" width="799" height="293"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The flow is straightforward. Any service calls &lt;code&gt;NotificationHubService.create()&lt;/code&gt;. The service resolves recipients, checks preferences, saves to PostgreSQL, and enqueues a job to BullMQ. The worker picks it up and sends a single API call to Courier — user ID, title, body, action URL. Courier routes to web inbox, mobile push, or email depending on configuration. The worker then updates the notification record with Courier's request ID for traceability.&lt;/p&gt;

&lt;p&gt;That's it. No Redis live queue. No polling. No fallbacks. No TTL drift.The diagram looks simple because the architecture is simple. The interesting complexity isn't in the flow — it's in the recipient resolution logic, which lives at the code layer, not the architecture layer. And that was a deliberate choice.&lt;/p&gt;

&lt;p&gt;The interface supports three modes: explicit user IDs, resource-based access (pass a lease or settlement and the service resolves who has access using the existing RBAC layer), or broadcast to all active account members. The caller doesn't need to know the access rules. Any service can call the same interface with a category and a message. Adding a new notification type is one function call. The architecture is deliberately simple so the logic can be where it needs to be — in the code, not the infrastructure.&lt;/p&gt;

&lt;p&gt;The tradeoff is vendor dependency. If Courier has an outage, notifications go down. That's a real risk we accepted — but for a four-person team, owning delivery reliability ourselves across every channel wasn't a trade we could win.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The question I should have asked first&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I spent two days designing a solid architecture. BullMQ queue, Redis cache, polling logic, fallbacks for every edge case. I had reasoned through each decision and could defend them.&lt;/p&gt;

&lt;p&gt;What I hadn't done was ask one question: where else do you want to send these notifications?&lt;/p&gt;

&lt;p&gt;Business briefs describe what people want today. They don't always describe what the system needs to support tomorrow — not because anyone is hiding it, but because a CEO mentioning a mobile app in a standup isn't thinking about your queue architecture. That's their product, not their system.&lt;/p&gt;

&lt;p&gt;The system has three distinct responsibilities: generating notifications, resolving who receives them, and delivering them. The first architecture conflated all three. The new one separates them — generation stays in each service, resolution lives in &lt;code&gt;NotificationHubService&lt;/code&gt;, delivery belongs to Courier. Each part is independently replaceable. That separation is also what made the build-vs-buy call clear: delivery is a commodity problem. Recipient resolution, tied to your RBAC model and your business rules, is not. Own what's specific to your domain. Buy what isn't.&lt;/p&gt;

&lt;p&gt;If I'd asked that question in the first meeting, none of this would have needed unwinding. I caught it at two days instead of six months into a mobile launch. Next time I start a system design, I know what the first question is.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you've had a similar "comment that broke the design" moment — when did you catch it? Drop it in the comments. Always curious how other teams pressure-test scope before they commit.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>systemdesign</category>
      <category>backend</category>
      <category>fullstack</category>
    </item>
  </channel>
</rss>
