DEV Community: Karan Mali

Transifex now pays for 75% of our Claude Code subscription

Karan Mali — Sat, 11 Jul 2026 18:42:14 +0000

The $150 problem that was really a time problem

We were paying Transifex $150 a month, and the money wasn't even the part that hurt.

The real cost was time. Here's how a translation actually shipped. A developer adds a new string to the website, a button or a notification or an error message, in English. It goes to the dev branch. From there it shows up on the Transifex dashboard, sitting in a list of strings marked as "needs translation." Then someone from the business side, an Arabic speaker, goes in and translates it. The Transifex bot picks up those translations and injects them back into our repo through a GitHub PR. Only then does the Arabic version ship.

Count the handoffs. Developer → dashboard → business person → bot → repo. And here's the part that made it unavoidable: the product ships in both English and Arabic, but the engineering team speaks English. None of us could write or verify the Arabic ourselves, so every Arabic string was blocked on someone outside the team. If you shipped a big batch of strings, that person needed time to work through all of them.

In practice this took 3 to 4 days. For a one-word button label. Not because translating one word is hard, but because the word had to travel through a paid SaaS, a dashboard, a person, and a bot before it landed.

The thing about Transifex is that it worked. The human-in-the-loop part was genuinely good. You don't want to ship Arabic to real users without someone who speaks it signing off. But we were renting a $150/month round-trip to get a checkpoint we could have run ourselves, and paying for it in days of latency on every change.

Why we never fixed it before

The obvious question is: if the round-trip was that painful, why did we pay for it for years?

Because building your own translation pipeline used to be a bad trade. Think about what it would have taken pre-AI. You'd write code to parse the XLF files, figure out the placeholder syntax, handle the dynamic values, and then you'd still need a human who reads Arabic to do the translating. You haven't removed the slow part. You've just added a pile of parsing code to maintain on top of it. For a 4-person team with exactly two languages, English and Arabic, that math never closes. You shrug and pay the $150.

Quick context on those XLF files, because they matter later. XLF (short for XLIFF) is just an XML format for translations: a long list of units, each one pairing an English source with its Arabic target.

<trans-unit id="save_btn">
  <source>Save</source>
  <target>حفظ</target>
</trans-unit>

Our frontend build tool generates the source side automatically from the app, and something fills in the target, a human or, now, the agent. Clean and boring. That held right up until the strings stopped being plain words.

This was never a hard problem, and that's the point. There aren't a hundred people doing our translations. It isn't some massive localization operation with thirty locales and a compliance process. It's en and ar. The effort to fix it just never dropped below the effort to live with it.

AI changed that calculation. Not because it can translate. Transifex's human translators could already do that fine. It changed it because an agent could read the XLF, understand the placeholder syntax, learn what each dynamic value meant, and draft the Arabic itself. The part that used to need a parser and a person now needed an agent and a person to check it. The effort to own the pipeline collapsed.

So I built it on a Saturday, as a side project. It didn't touch the roadmap or block anyone's work, and if it had flopped I'd have lost an afternoon. That's the whole reason it was worth doing. The payoff wasn't huge. AI just made the cost of trying almost nothing.

First cut, teaching an agent to read the files

My first version was naive, and I knew it would be. Point an agent at the Arabic translation file, tell it "translate the missing strings into Kuwaiti Arabic," let it run. The goal of that first pass wasn't to get it right. It was to see where it would fall over.

Simple strings were fine. A button that says "Save" becomes "حفظ" and we're done. The agent handled those immediately.

The trouble started with strings that have values baked into them. A notification doesn't say "Lease expires soon." It says "Lease A-102 (Marina Tower) expires in 14 days." The lease ID, the property name, the number of days: none of that is part of the translation. It gets dropped in later, when the notification goes out. In the file, all you see is numbered blanks. No names, no hints, just placeholders waiting to be filled.

That's where it broke. The agent had no way to know what each blank was for. Is the first one the lease ID, the tenant's name, or the date? You can't tell from the file, and if you don't know what a blank holds, you can't place it correctly in the sentence.

In Arabic that matters more than it sounds, because Arabic reads right-to-left and you don't mirror the English word order, you rebuild the sentence. Something that sat at the end in English often moves near the front in Arabic. So the agent couldn't swap words in place. It had to follow the sentence well enough to know where each piece lands once everything is rearranged.

The first cut got this wrong in all the ways you'd expect. It lost blanks, put them in the wrong spot, and once tried to "translate" one that was code, not words. A missing blank isn't a harmless typo. The notification ships with an empty gap where the lease ID should be: it looks perfectly translated and is quietly broken.

So the real problem was never "translate English to Arabic." The agent could do that in its sleep. It was "rebuild the sentence in Arabic without losing the moving parts inside it."

Give it the context, then make it prove the work

The fix came in two parts: give the agent the missing context, then stop trusting it to be careful.

The context part was simple once I understood the real problem. The reason the agent couldn't place those values was that the translation file didn't tell it what they were. But the original code did. Somewhere in the source there's a line that builds that notification, something like "Lease {leaseId} ({name}) expires in {days} days", with real, named variables. So the rule became: before you translate a string with blanks in it, go open the source file and read the line. Learn that the first blank is the lease ID and the second is the property name. Then translate. Once the agent was reading the source instead of guessing from the file, it started putting values where they belonged.

The second part was trusting nothing. I'd seen the agent ship strings that looked translated and were quietly broken, so "looks done" couldn't be the bar. I wrote a small validation script that checks every translated string mechanically: does it still have every value the English version had, none added, none dropped, none mangled? The agent isn't done when it thinks it's done. It's done when that check passes. If a value went missing, the script fails, and the agent has to go fix it and run again.

On top of that I wrote down the rules it kept getting wrong, once, in a spec file: restructure the sentence for Arabic instead of mirroring English, never touch the placeholders, keep numbers in the format the rest of the app uses. Instead of correcting the same mistakes by hand every time, the agent reads the rules and the checker enforces them.

That's the part I'd underline for anyone building with agents. The model doing the creative work, the translation itself, was never the hard part. The hard part was everything around it: feeding it the context it was missing, and bolting on a dumb, strict check at the end that it can't talk its way past. The clever translation was the easy 80%. The boring check is the 20% that made it something I'd ship.

You still can't trust it blindly

Even with the context and the checker, I wasn't going to point this at real users and walk away. The validator proves the placeholders are intact. It can't tell you whether the Arabic reads well, or whether the agent picked a slightly-off word for something. For that you need a person who speaks the language.

That was the one genuinely good thing about Transifex: the human checkpoint. A native Arabic speaker on the business side looked at every translation before it shipped. I didn't want to lose that. I wanted to keep the exact same checkpoint and just stop paying $150 a month for the privilege.

The problem is where the agent's translations live. They're in files, in a git repo. A business reviewer isn't going to clone a repo and edit XML, and honestly I don't want them anywhere near it. One stray character and the build breaks. So the agent produces the Arabic, but a non-developer has no safe way to review or fix it.

So I built a small web app for that, a translation portal. It's just a table. English on one side, the agent's Arabic on the other, grouped into sections like Invoices or Auth so it's not one giant wall of strings. The reviewer reads down the list, fixes anything that's off, and saves. No git, no XML, no way to break anything. Just the part of the job they're good at.

There's a subtle problem hiding in that table. The agent writes Arabic into the same files a human later edits, and both commit to the same repo. So once a string is in there, you can't tell by looking whether a person signed off on it or whether it's raw machine output nobody has checked yet. Unreviewed AI text and human-approved text look identical. If the human check was going to mean anything, I needed to know which strings a person had signed off on and which were still the agent's raw guess.

The simple move was to keep that approval state completely out of the translation files. The files stay pure: the agent commits them, the apps build from them, nothing else lives in them. I looked at marking the files themselves, but XLIFF has a built-in "reviewed" flag that our frontend build tool wipes every time the agent regenerates the file, and the mobile JSON has nowhere to put per-string metadata without breaking the app that loads it. So the approval record lives somewhere separate, in blob storage the portal owns. It's just a ledger. For each string: the exact Arabic a human approved, who approved it, and when.

The part I like is that I don't store a status anywhere. I derive it. When the portal loads, it diffs the live translation against the ledger. If a string isn't in the ledger, nobody has approved it, so it's in the review queue. If it's in the ledger but the current text doesn't match what was approved, it changed since, so it goes back in the queue. Only an exact match counts as approved. Which means when the agent rewrites a string, it stops matching the approved snapshot and falls back into needs-review on its own. The agent doesn't know the review system exists, and it doesn't need to.

Translations stay in the repo. Approval state lives on its own. Status is computed, never stored in the files.

That's the whole shape of it. The agent drafts fast. The validator catches anything structurally broken. Then a human who speaks the language signs off, in a screen built for exactly that and nothing else. Same human-in-the-loop Transifex gave us. We just own all three pieces now, and pay for none of them.

Closing the loop

The loop is closed, and it runs without me in it. When a reviewer approves an edit in the portal, the portal commits that change straight back to the app repo through the GitHub API. Our existing deploy pipelines see the commit and ship it exactly like any other change. No dashboard, no third-party bot, no exporting files and pasting them in by hand. The reviewer hits approve, and the Arabic is on its way to production.

Two things were fiddlier than I expected. First, the obvious way to write a file through GitHub's API, the simple "update this file" endpoint, quietly falls over on big files, and our web translation file is over a megabyte. So I had to drop down to the lower-level API that commits the way git does under the hood: build the file, build the commit, move the branch pointer. More steps, but it doesn't choke on size.

The second one bit me harder, and it's the mistake I'd warn anyone else about. My first version, whenever a reviewer saved, just rewrote the entire Arabic file from whatever the editor was holding in memory. That seems completely fine: the editor has every string loaded, so writing them all back out should be lossless. It wasn't. A one-word fix came out as a commit that touched the whole file, so every line showed up as changed and you couldn't see what had changed. Worse, if someone had edited a different string in the meantime, my whole-file write would quietly stomp it flat. The portal only knew about the version it had loaded, and it wrote that version over the top of everything.

It also quietly lost things. The editor built its list from the English file, so anything that existed only in Arabic was invisible to it and got erased on save. The clearest example: a label as ordinary as Mr. would break the save. My code used dots to work out the file's nesting, so it read the dot inside Mr. as if it meant "go one level deeper" and mangled the structure. Arabic plural forms that English doesn't have vanished the same way, because nothing in the English-built list knew they existed.

The fix was to stop rewriting the file at all. Now the portal reads the real Arabic file, changes only the exact values the reviewer touched, leaves every other byte alone, and commits that. A one-word fix is a one-line commit. Nothing else moves, nothing gets dropped, and two people editing different strings don't overwrite each other.

The other surprise was how far the same table stretched. It started as one screen for the web app and the two mobile apps. It now feeds six surfaces across two separate repos and three different translation formats: the web app's XLIFF, the mobile apps' JSON, our listing site and its admin panel in another flavor of JSON, and a React Native app whose strings live as plain objects inside a TypeScript file. The reviewer sees the same table every time. All the format-specific mess stays behind the API, where each surface knows how to read and write its own files.

Where it sits right now: this is all running on a branch, not merged to main yet. But the loop is closed. And the difference is the whole point of the project. Under Transifex a new string took 3 to 4 days to ship in Arabic: dashboard, person, bot, repo. Now the agent drafts it in one pass, the reviewer checks it in the portal, they hit save, and it's a commit on the repo. Days down to hours.

Same checkpoint, minus the dashboard and the bot. The human sign-off stayed; the waiting didn't.

And it costs nothing to host. The portal is a small Next.js app on Vercel, the approval ledger lives in Vercel's blob storage, and both sit inside the free tier. So the $150 a month didn't move somewhere else, it went to zero. If anything, the money we stopped sending Transifex now covers about three-quarters of the Claude Code subscription the agent runs on. The tool we cancelled is paying for the tool that replaced it.

When this is the right call, and when it isn't

This isn't a case for everyone firing Transifex and building their own. For us it was right. For a lot of teams it would be a mistake, and the line between the two matters.

It worked for us because the dimension that decides this stayed small. The portal feeds six app surfaces now, but it's still two languages. One dialect of Arabic. One product domain the agent could learn: tenants, leases, listings, invoices. A handful of people, none of whom needed a fancy translation-management UI with roles and workflows and glossaries. When the whole problem fits in your head, owning it is cheap, and a tool built for a hundred-language operation is mostly paying for problems you don't have.

Flip any of those and I'd tell you to keep paying. Thirty languages, and now you're maintaining an agent and a review flow per language and you've built a localization company by accident. Legal or medical copy where a bad translation is a liability, and you want a vendor with a real process and someone to blame. A real localization team who live in translation-memory and termbases all day, and taking their tools away just makes their job worse to save a subscription.

The real thing I took from this isn't about translation. It's that AI quietly moved a line I'd stopped checking. "Just pay for it" is the right default for any problem that's annoying but not worth a week of engineering, and that default is built on what building it used to cost. That cost dropped. Some of the things you're renting out of habit are now a Saturday. Not most of them. But more than there used to be, and the only way to find out which is to try one instead of assuming the old answer still holds.

Links & tools

Transifex: the localization SaaS we were replacing
ICU message format: how translatable strings carry placeholders and plural forms, the part that makes machine translation tricky
XLIFF + the xliff npm package: the translation file format and the parser the portal reads it with
GitHub Git Data API: the low-level commit API the portal writes through (handles files past the Contents API 1MB limit)
Unicode CLDR plural rules: why Arabic has plural categories English doesn't
Claude Code subagents: what the translator agent runs on

I Didn't Need a Smarter Model. I Needed to Onboard It

Karan Mali — Sat, 27 Jun 2026 20:18:49 +0000

Why the bottleneck for AI on real tickets isn't intelligence. It's that your brain starts warm and the model starts cold.

Two people get the same ticket. One starts from zero.

When a ticket lands on me, my brain lights up before I've finished reading the title. This service owns that logic. The route is probably over here. That controller needs a new method, and there's a DAO at the end of the chain I'll have to touch. I'm not being clever. I've just been in this codebase long enough that the map is already in my head.

Hand the same ticket to an AI and it's staring at a blank wall. It has the code. What it doesn't have is the map. It doesn't know our layers, our naming, or that we shipped something almost identical last week. Same ticket, same repo, and one of us starts warm while the other starts cold.

I'd love to tell you I understood that from the start. I didn't. When I first wired AI into our workflow I assumed what most people assume: the model is smart enough to work the codebase out on its own. Point it at the repo, give it the ticket, get something useful back. That assumption is the exact thing that blew up in my face, and cleaning up the mess is what taught me the whole game is context, not intelligence.

Here's how naive I was about it at first. For weeks, every ticket started the same way. I'd screenshot the Jira ticket, paste it into Claude, and wait while it dug through our code trying to work out where things lived. Then I'd start coding. I was the integration layer. Copy from Jira, paste into Claude, sit there, repeat. At some point it stopped feeling like using AI and started feeling like a chore I was doing for the AI. The loop was dumb enough that something else should be running it. So I built that something.

So I automated it, and it got confidently wrong

The flow itself is simple. A ticket gets assigned, a branch gets cut for it on its own, and an agent writes an implementation plan into a folder we keep just for that. We've got dedicated folders for plans, for feature notes, for the stuff the team keeps coming back to. The agent reads the ticket, searches the code, and drops a plan: what the task is, which files probably change, how to go about it, what's still unclear. Then it links that plan back on the ticket so it's waiting for whoever picks the work up.

On paper, exactly what I'd been doing by hand. In practice the plans were bad.

Not bad like broken English. Bad like confidently wrong. The agent had read access across the whole repo and a search tool, so it would poke around, land on some file that looked relevant, and build a plan on top of it. The plans put validation logic in the controller. They put database queries straight in the service layer. Both of those are backwards from how we build. Validation belongs in middleware for us, and every database call lives in the DAO layer, never the service.

So the plan read like it knew what it was talking about, and it would march a developer right into a rejected PR. Picture a junior trusting one of these. They write the code the plan describes, open the PR, get torn apart in review, then sit there confused about why the "AI plan" sent them the wrong way.

It was slow on top of that. Every run, the agent rebuilt its picture of our architecture from nothing, because it had nothing to start from. That's the cold start again. I'd automated it without fixing the actual problem. I had handed off my busywork and thrown away the one thing that made the manual version worth doing. I already knew the shape of the code. The agent didn't, and I hadn't given it any way to find out.

It was never about how smart the model was

My first instinct was the obvious wrong one. Maybe a better model fixes this. It doesn't. A better model still has no clue that in our repo the DAO is the only place allowed to touch the database. You can't buy that off a pricing page. It isn't intelligence.

That's when it actually clicked. The problem was never how smart the model was. It started cold every single time, and I started warm. I'd been trying to upgrade the brain when the thing missing was the map in my head, and nobody had ever written that map down. A new hire would hit the exact same wall on day one. The difference is we don't expect a new hire to be brilliant. We expect to onboard them.

So I stopped trying to make the agent smarter and started onboarding it instead. New hires don't get a bigger brain. They get docs, they get the conventions, and they get someone pointing at the last person's work saying "do it like that." I built the same thing, in layers instead of one giant file.

There's a thin entry point at the top. Project basics, plus pointers that say "for this kind of work, go read that." The real knowledge isn't in there. Under it sits the core conventions doc, where the architecture is spelled out flat. Controllers call services, services call DAOs, DAOs own the database. Validation in middleware. The naming rules, and the patterns we don't break. Then there's a set of specialized files the agent only opens when it needs them. One for on-call database work, one for the data model, one for translations. It doesn't read those every time. It reaches for the right one when the ticket actually touches that area.

Then I rewrote the agent's marching orders. Before it searches for anything, it reads the entry point and the core doc. It gets its bearings on the overall structure first. Only then does it go looking in the area the ticket is about. And the instruction that pulled the most weight was the simplest one. Find the closest existing implementation and copy its pattern. That is word for word what I'd tell a junior on day one. Don't invent an approach. Go find something we already built that's close, and follow it.

The plans turned around almost immediately. Validation showed up in middleware. Queries showed up in the DAO. The model hadn't gotten smarter between Tuesday and Wednesday. I had just written down the rules I'd been carrying in my head and forced it to read them before it touched anything.

The part most people skip is what comes after that. A memory layer you write once and forget about rots fast. The code moves, conventions shift, and those docs quietly go stale until the agent is following rules that stopped being true months ago. So we baked the upkeep into the workflow itself. When the agent learns something new, or when we ship a feature that changes how an area works, updating the memory is part of the job and not a thing someone remembers to do later. The repo keeps its own docs current. A person's mental model updates as they work. The agent's has to update the same way, or it slides right back to confidently wrong.

Then the plan started arguing back

Once the plans were trustworthy, the thing I didn't plan for turned out to be the best part. The plan stopped being the valuable bit.

I added a step where the agent grills its own plan after writing it. It walks back through what it wrote the way a senior would in a design review. Where's the ambiguity. What edge case is missing. Which assumption needs a real answer before anyone writes a line. It writes those questions down, with its own recommended answer next to each one.

And then it holds the line. When I come back to actually build the ticket, the agent won't start writing code until those questions are answered, even when I tell it to just get on with it. It drags the design conversation to the front, the one I'd normally skip and then pay for later in a half-built feature and an ugly PR.

So the output grew up. It went from "a plan" to "a forced design conversation before a single line exists." The value moved off the answer and onto the questions. That's the piece I'd fight hardest to keep now, and it's the one I never set out to build.

Telling QA what actually shipped

The other place this quietly pays off is after the code is done, at deploy time.

A ticket almost never ships clean on the first try. It goes out, QA finds something, it comes back, gets patched, goes out again. A single ticket can rack up several of those rounds. The messy reality is that a few deploys in, nobody is totally sure what just went out the door. QA opens a ticket and can't tell which points got handled in this release and which are still sitting open. So they re-test the whole thing, or they re-test the wrong thing and miss the part that actually changed.

Now when a deploy goes out, an agent reads what was in it and leaves a summary on the tickets it touched. Here's what shipped. These items from the ticket are done. These aren't, yet. QA opens it and knows what to check instead of guessing. Small thing. It kills a real, recurring source of confusion, and it's exactly the sort of glue nobody wants to write out by hand on every single release.

That's the actual throughline. None of this is "AI writes my code." It's AI sitting in the gaps between the tools I already use, the ticket and the codebase and the deploy and the QA pass, carrying the context I used to carry myself.

What changed, and what I skip

This isn't a demo I built once and screenshotted for a standup. We actually use it.

The real change is in how I work, not in what the AI spits out. I keep three or four tickets moving at the same time now. Each one shows up with its branch already cut and its context already loaded, so I'm not paying the cold-start tax on every one of them. That tax, the "what is this, where does it live, how do we do it here" tax, is exactly what used to pin me to one ticket at a time.

I'll be straight about where it doesn't pull its weight. For small tickets I ignore the plan and just do the thing. Faster that way. The pipeline isn't sacred, and shoving every tiny ticket through it would be its own kind of waste.

There was dumber pain along the way too. We started on a plan that handed out an access token that expired, so for a stretch the whole thing kept dying until someone manually pasted a fresh secret back in. Very cutting-edge stuff. We switched to a proper key and it stopped fighting us. A couple of rough edges are still here. The agent finds files mostly by searching for names, which is great for "where's addInvoice" and useless for "where do we handle offline payments," so a smarter search step is the obvious next move. And I still don't measure whether the plans are any good, even though the signal is right there for the taking. Compare the files a merged PR actually changed against the files the plan said it would. Right now "it helps" is a feeling, not a number. That one bugs me.

The takeaway

The model was never the bottleneck. Context was.

I didn't make any of this better by reaching for a bigger model. I made it better by writing down the things I'd tell a new hire. Where validation goes. Where the database calls live. Go copy the nearest thing we've already built. Then forcing the agent to read all of it before it touched the code, and keeping those notes alive as the code kept moving. The boring fix beat the exciting one, and it kept on beating it.

Once the AI started warm instead of cold, the plan stopped being the point. Every ticket lands with its context already loaded, and that's the only reason I can keep four of them in the air instead of grinding through them one at a time.

So if you're about to do this where you work, don't start with the model. Start with the context your own brain loads without being asked, the stuff you never had to write down because you already knew it. That is the part the AI is missing. The rest is plumbing.

The const enum that took down our payments

Karan Mali — Thu, 28 May 2026 13:01:29 +0000

Three minutes

That's how long I'd sit there every time I changed a file on our server. Three to four minutes for the dev server to rebuild and come back up. Long enough to check Slack, scroll twitter, and then forget what I was about to do when it finally came back. I was the only one on Linux. The rest of the team was on Mac, and on their machines the dev server came back in twenty seconds. Nobody else felt the pain, so nobody else was looking for a fix. On my machine it was the bottleneck of my whole day. I started batching changes across multiple files so I'd only pay the rebuild cost once, which sounds clever but mostly meant I'd lose track of what I was actually testing. You don't really notice an hour disappearing into that. You just notice you're tired by 4pm and you can't figure out why.

For a while, I lived with it.

A small experiment

I'd used esbuild in some side projects before. It built things in milliseconds. After enough days of staring at a slow rebuild, I started wondering whether it would work for our server too. I tried it locally first. Wrote a small config, pointed it at our entry files, ran it. The build went from minutes to under a second. I ran the dev server. It came up. Things worked.

I pinged a senior developer. Told him I had esbuild working locally and asked if we should give it a real shot. He said yes, we had fifteen to twenty days before the next production release, plenty of room to catch problems. So I opened a PR. The PR was small: a config file, a couple of package.json scripts, and two enum changes that I'll come back to in a minute. That was it. We talked about it again the next day. He said let's not merge it into dev right now because dev was being stabilised for the upcoming production release. "Use it locally for now, we'll merge it after the release." I said fine. Went back to my work and kept using esbuild on my machine. The PR sat there. A few days later the production release happened. Everything looked fine. I didn't think about the PR. I was busy with my own tickets, and honestly the esbuild thing had already paid off for me personally. I was the only one using it, my local was fast, life was good.

That weekend, my phone buzzed.

One landlord, then five

The first message was from on-call. A landlord couldn't access their payment link. The link was returning something weird, undefined or null, the kind of thing that usually smells like a bad record more than a bad code path. We looked into it. Their data on our end seemed off. We wrote a small script, patched the row, told the landlord we were sorry, and moved on. One landlord with broken data is the kind of thing you can rationalise away. Migrations are messy, maybe a job didn't run, whatever. We've all seen worse.

The next day, another landlord reported the same thing. Then another. Then another. By the time we had five or six of these, the rationalisation stopped working. Data corruption does not politely affect five different accounts in the same exact way, in the same flow, on the same weekend. Something else was going on. And it was Saturday, and landlords were trying to collect rent, and the payment link, the one piece of the product that absolutely cannot be broken, was the thing that was broken.

I pulled the flow up on my own machine. It worked. Pulled it up on dev. Reproduced it on the second try. Logs from every affected account hit the same callback handler before things fell over. So this wasn't bad data. This was code. Code that had shipped, code that had passed review, code that none of our tests had caught. That's when I started going through git history.

The PR I forgot about

Our tickets that release had nothing to do with payments. Nobody had touched the payment files. Nobody except me, and only in one place: those two enum changes in the PR I thought we hadn't merged yet. I pulled up the git log. There it was. Merged into dev, shipped to prod, sitting there in the release as if it had always been planned. Apparently somewhere between "let's wait for the release" and the release itself, the PR had quietly gone in. I genuinely don't remember how. I don't remember being told. I had stopped tracking it because in my head it wasn't going out yet.

I opened the commit. Two small lines stood out:

- export declare const enum PaymentStatus { ... }
+ export enum PaymentStatus { ... }

I had a feeling, but I wasn't sure. I dropped the diff into Claude and asked what esbuild does with declare const enum. It told me everything I needed to know in about two paragraphs. I read it twice, then went looking through the actual codebase to confirm, because at this point I didn't really trust my own memory of what I had touched.

Why the build tool was the bug

tsc and esbuild look like they do the same job. You give them TypeScript, they give you JavaScript back. Except they really don't. tsc understands types across files. It reads the .d.ts files for every package you import from. When it sees a declare const enum, it doesn't generate a runtime object at all, it just inlines the value directly into the call site. So payment.status === PStatus.CAPTURED compiles down to payment.status === "Captured". The enum doesn't exist at runtime. It doesn't need to. esbuild doesn't read .d.ts files. That's the whole story, really, but it took me a while to fully appreciate what it meant.

We had an internal npm package that the server depended on heavily. Its types looked like this:

// internal-db-package/.../Payments.types.d.ts
export declare const enum PStatus {
  PENDING = 'Pending',
  CAPTURED = 'Captured',
}

And its compiled .js file looked like this:

// internal-db-package/.../Payments.types.js
"use strict";
// (that's the whole file; declare const enum compiles to nothing)

tsc would read the .d.ts, find the enum, and inline the value into the consumer code:

status === "Captured"

esbuild read the .js, found nothing, and emitted this instead:

status === (void 0).CAPTURED
// TypeError: Cannot read properties of undefined (reading 'CAPTURED')

The build printed a warning. Not an error, a warning. The server still booted. The dev environment behaved normally because that specific code path didn't run in any of our health checks. The only time it actually ran was when a real payment confirmation came back from the provider and the code tried to look up the status against the enum. That's when it crashed. The two enum flips in my PR (PaymentStatus and InvoiceTrackingEvent) were just there to get past esbuild's local build errors. I thought I had fixed the problem. What I had actually done was patch the two enums I happened to notice, while leaving every other enum import from that internal package silently broken in production.

The hotfix itself was tiny. Revert the build script back to tsc, restore the two declare const enum keywords, push. Five lines. I had it in within the hour. Total production impact came out to roughly five to six hours, spread across a weekend, affecting somewhere between ten and fifteen landlords depending on how you count the ones who retried successfully on their own. We patched the affected payment records by hand the next morning. None of the actual money was lost, just the link state. Stripe held the charges fine. That was the only mercy in the whole thing.

The part I didn't expect

I waited for someone to be angry. Nobody was. A senior developer pinged me after the hotfix went out. No "why did you do this", no "why didn't we test it harder". Just: "You know the fix, just get the PR up for production, we're good." That was the whole conversation. The story made the rounds internally as a joke, not an indictment. Nobody points at me with it.

Two months later, with the dust settled, I came back to esbuild. By then I had switched to a Mac and the slow-build pain wasn't personal anymore, but the numbers were still real and the curiosity hadn't gone away. This time I did it differently. I started by actually mapping where the internal package was imported. Dozens of places, some I had no idea existed. The codebase had been growing for years and a few of the original authors were no longer around. That alone was a useful exercise, separate from anything to do with build tools. I now had a list of every file the change could touch. The rule I set for myself was simple. tsc stays in the production Docker build, untouched. esbuild is only allowed near dev. Production keeps the boring, slow, correct thing. Local gets the fast, fragile thing, with guardrails.

Then I wrote two small esbuild plugins. The first one rewrites declare const enum into plain enum on the way through esbuild. It walks the source, strips the declare keyword, and lets the enum compile to a real runtime object. The local files now generate the JavaScript that tsc would have inlined for free. The second plugin was the one that actually mattered. It virtualises the internal package's .types modules. The internal package's compiled .js files are empty (because, again, declare const enum compiles to nothing). The plugin intercepts every import resolving into those .types paths and substitutes a hand-written module with the same shape and the same values, but as a plain object literal that exists at runtime. The empty .js files never get loaded again. Whatever esbuild can't infer from .d.ts, the plugin supplies directly. The dev server rebuild went from 6.6 seconds with tsc down to 119 milliseconds with esbuild. Roughly fifty-five times faster.

tsc      ████████████████████████████████  6.6s
esbuild  ▏                                 119ms

To be clear, the 6.6 seconds is the Mac number, not the Linux one. On Linux it had been minutes; on the Mac tsc was bearable but esbuild was still in a different league. That setup is still running today, three months later, and nobody has noticed it exists, which is the highest compliment a dev tool can get. I also wrote a short doc explaining what the plugins do and why they exist, and pinned it in our engineering channel. Partly so the next person who touches the build doesn't wander into the same trap. Partly so future-me, in six months, doesn't either.

In closing

Build tools are replaceable. Good teams aren't.

From custom polling architecture to one API call: rethinking notification delivery

Karan Mali — Fri, 15 May 2026 12:23:43 +0000

How hard can it be?

I work at a property management SaaS company. Landlords use the platform to manage their properties, collect rent, track maintenance requests, handle lease contracts. Four developers, one product.

Small team means you don't get tickets scoped down to a single component, you get features, end to end. I joined as a backend developer but quickly ended up touching everything: APIs, frontend, infra, and eventually mobile. That's just how it works when there are four of you.

One day, a senior developer pinged me. The product needed a notification center. Users were asking for it, when a settlement gets transferred, when a maintenance request comes in, when a lease is about to expire. These are things landlords need to act on. Notifications made sense. The ask was simple. The notification system design turned out not to be.

I didn't just start coding. I went to the whiteboard, mapped out the flow, discussed edge cases with the senior developer, made sure I understood what we actually needed. I designed for the problem I was given.What I didn't do was pressure-test whether that problem was the complete one.

The design I was proud of

The first question I had to answer before writing a single line of code: how does the client get new notifications?

The standard answer is WebSockets. Open a persistent connection, server pushes events in real time. One of my teammates suggested exactly that But here is the problem.

We run on Cloud Run — Google's serverless platform. Serverless means instances spin up and down. Persistent connections and abrupt disconnections become your problem to manage. On top of that, did our users actually need real-time notifications? A landlord getting notified about a rent settlement doesn't need to know in under a second. Five to ten seconds is fine. Polling was cheaper, easier to scale, and good enough for what we needed. I had a clear answer for why and that mattered, because going against the standard approach means you better be able to defend it.

Here's how the full architecture looked:

Any service that wants to send a notification drops a job into BullMQ. A worker picks it up, writes to PostgreSQL, then pushes the notification ID into Redis. These two operations are sequential, not bundled. If the Redis push fails, it retries only that step,the DB write already happened and doesn't get touched again. One queue, one worker, two distinct responsibilities in sequence.The DB-generated notification ID becomes the Redis key. That ordering matters: you need the ID before you can cache it.On the read side, the client sends its last delivered notification ID. Server checks Redis — anything with an ID greater than that for this user? Return it. The client owns its cursor. No read/unread state to manage server-side. Stateless on the server, clean on the client.

Two fallbacks handle edge cases. Every 30 minutes, the client forces a direct DB fetch in case Redis missed something. When the browser tab comes back from being suspended, the Page Visibility API triggers another direct DB call. Redis has a 24-hour TTL but the ID cursor filters stale entries anyway.There's a subtle attack surface in the fallback design worth thinking through. If a malicious actor always passes db=true, every request bypasses Redis and hits the database directly — constant load, every poll cycle. The fix is server-controlled rate limiting:

// On direct DB fetch, set a rate-limit key
await redis.set(`last_db_fetch:${userId}`, "1", "EX", 60);

// On every incoming request, check before honoring db=true
const recentFetch = await redis.get(`last_db_fetch:${userId}`);
if (recentFetch) {
  // Ignore client-supplied db=true, serve from Redis
  return serveFromRedis(userId, lastNotificationId);
}

The client no longer controls when the DB gets called — the server does. Simple key, solves the problem cleanly.The architecture worked. But it had a blind spot I couldn't see from inside it.

The Redis TTL had to stay manually synced with the frontend polling interval — two things in two places, easy to drift. And the biggest problem: every new delivery channel meant building a separate integration from scratch. Email — build it. Mobile push — build it. Slack — build it. I owned all of it indefinitely.At the time, the brief was web only. So this felt fine. And then it wasn't.

The comment that broke the design

We have weekly meetings. Product updates, priorities, what's coming next. I was in one of those when our CEO mentioned, almost in passing, that we'd be launching a mobile app in a few months.

That was it. That was the comment.

I hadn't thought about it before that moment. But something clicked immediately. Mobile app means push notifications. Push notifications mean FCM integration. Email was also coming — the brief had been expanding quietly in the background. And I was sitting on an architecture I'd spent two days designing and defending — one that handled web delivery and nothing else

I went to the senior developer after the meeting. Laid it out: if we want email, I build that delivery layer. If we want mobile push, I build that too. If we want Slack someday, same story. Every new channel is a new integration I own, maintain, and debug when it breaks at 2am.That's not a notification system. That's a delivery platform. Different scope, different problem.I talked it through with them and made the call. Go third-party. Two days of architecture work had to go,the polling logic, the Redis TTL design, the fallback math. I scrapped it. Not because the thinking was wrong. The technical decisions were sound. But I had designed for a problem that was smaller than the actual one, and the right move was to own that and correct it before building further on a foundation that wouldn't hold.

Two days gone. The alternative was spending the next year maintaining delivery across three channels myself. That math wasn't hard.
The mistake wasn't the architecture. It was not pressure-testing the scope before I started. But catching it at two days instead of six months into a mobile launch — that's the part that mattered.

What I actually evaluated

Going third-party meant finding the right tool. I evaluated against a short list of non-negotiables.

We needed a React Native SDK with a prebuilt inbox component — we weren't going to build our own notification UI on mobile. We needed i18n support without enterprise pricing — we serve an Arabic-speaking market and that's table stakes, not a premium feature. And we needed a single API call that handled all delivery channels, so adding email or push later didn't mean a new integration.

I evaluated a few options. One required building the mobile UI ourselves. One priced i18n behind an enterprise tier. Courier hit every requirement: prebuilt React Native inbox, i18n on all plans, one API call for web, mobile, and email.

One practical note if you're on a similar stack: Courier has no Angular SDK. Our web client is Angular, so I had to inject their web component directly into the DOM. It works, but it's not the same developer experience as a native SDK. Worth factoring in before you commit.

The new notification system architecture

With Courier handling delivery, the system got significantly simpler.

The flow is straightforward. Any service calls NotificationHubService.create(). The service resolves recipients, checks preferences, saves to PostgreSQL, and enqueues a job to BullMQ. The worker picks it up and sends a single API call to Courier — user ID, title, body, action URL. Courier routes to web inbox, mobile push, or email depending on configuration. The worker then updates the notification record with Courier's request ID for traceability.

That's it. No Redis live queue. No polling. No fallbacks. No TTL drift.The diagram looks simple because the architecture is simple. The interesting complexity isn't in the flow — it's in the recipient resolution logic, which lives at the code layer, not the architecture layer. And that was a deliberate choice.

The interface supports three modes: explicit user IDs, resource-based access (pass a lease or settlement and the service resolves who has access using the existing RBAC layer), or broadcast to all active account members. The caller doesn't need to know the access rules. Any service can call the same interface with a category and a message. Adding a new notification type is one function call. The architecture is deliberately simple so the logic can be where it needs to be — in the code, not the infrastructure.

The tradeoff is vendor dependency. If Courier has an outage, notifications go down. That's a real risk we accepted — but for a four-person team, owning delivery reliability ourselves across every channel wasn't a trade we could win.

The question I should have asked first

I spent two days designing a solid architecture. BullMQ queue, Redis cache, polling logic, fallbacks for every edge case. I had reasoned through each decision and could defend them.

What I hadn't done was ask one question: where else do you want to send these notifications?

Business briefs describe what people want today. They don't always describe what the system needs to support tomorrow — not because anyone is hiding it, but because a CEO mentioning a mobile app in a standup isn't thinking about your queue architecture. That's their product, not their system.

The system has three distinct responsibilities: generating notifications, resolving who receives them, and delivering them. The first architecture conflated all three. The new one separates them — generation stays in each service, resolution lives in NotificationHubService, delivery belongs to Courier. Each part is independently replaceable. That separation is also what made the build-vs-buy call clear: delivery is a commodity problem. Recipient resolution, tied to your RBAC model and your business rules, is not. Own what's specific to your domain. Buy what isn't.

If I'd asked that question in the first meeting, none of this would have needed unwinding. I caught it at two days instead of six months into a mobile launch. Next time I start a system design, I know what the first question is.

If you've had a similar "comment that broke the design" moment — when did you catch it? Drop it in the comments. Always curious how other teams pressure-test scope before they commit.