<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Damir Karimov</title>
    <description>The latest articles on DEV Community by Damir Karimov (@damir-karimov).</description>
    <link>https://dev.to/damir-karimov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2575304%2Fe501ae75-9f5b-4d85-9dd7-670b54fe522c.png</url>
      <title>DEV Community: Damir Karimov</title>
      <link>https://dev.to/damir-karimov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/damir-karimov"/>
    <language>en</language>
    <item>
      <title>Why Your Background Jobs Fail in Production</title>
      <dc:creator>Damir Karimov</dc:creator>
      <pubDate>Mon, 29 Jun 2026 10:41:23 +0000</pubDate>
      <link>https://dev.to/damir-karimov/why-your-background-jobs-fail-in-production-3nne</link>
      <guid>https://dev.to/damir-karimov/why-your-background-jobs-fail-in-production-3nne</guid>
      <description>&lt;p&gt;Most developers first meet background jobs through a deceptively simple model: put work in a queue, let a worker pick it up, and assume the job will be processed once. In production, that mental model breaks quickly because queue systems are designed for reliability under failure, not for “perfect” one-shot execution.&lt;/p&gt;

&lt;p&gt;Need to send an email? Queue it. Generate a PDF? Queue it. Process images? Queue it.&lt;/p&gt;

&lt;p&gt;That feels elegant at first. The request finishes fast, the slow work happens elsewhere, and everything looks stable.&lt;/p&gt;

&lt;p&gt;Then production happens.&lt;/p&gt;

&lt;p&gt;A worker crashes after sending the email.&lt;br&gt;&lt;br&gt;
A network hiccup prevents the ACK from reaching the broker.&lt;br&gt;&lt;br&gt;
A third-party API times out halfway through processing.&lt;br&gt;&lt;br&gt;
The same job gets retried and runs again.&lt;/p&gt;

&lt;p&gt;Now you have duplicate emails, duplicate charges, or jobs that appear to vanish and then come back later. That is not an edge case. That is normal distributed-systems behavior.&lt;/p&gt;


&lt;h2&gt;
  
  
  The happy path is a lie
&lt;/h2&gt;

&lt;p&gt;The classic flow looks clean:&lt;/p&gt;

&lt;p&gt;User registration&lt;br&gt;&lt;br&gt;
→ Create user&lt;br&gt;&lt;br&gt;
→ Push job to queue&lt;br&gt;&lt;br&gt;
→ Worker processes job&lt;br&gt;&lt;br&gt;
→ Send email&lt;/p&gt;

&lt;p&gt;That flow is useful for explaining the idea, but it hides the real failure modes. In practice, each step can fail independently: enqueue can succeed while downstream storage is unavailable, a worker can complete side effects and then crash before acknowledging the message, or a retry can happen because the system cannot tell whether the previous attempt fully finished.&lt;/p&gt;

&lt;p&gt;A queue does not guarantee execution exactly once.&lt;br&gt;&lt;br&gt;
It coordinates work, and it does so in a failure-prone environment.&lt;/p&gt;

&lt;p&gt;That distinction matters because it changes how you design the job handler, the data model, and the retry strategy.&lt;/p&gt;


&lt;h2&gt;
  
  
  At-least-once delivery
&lt;/h2&gt;

&lt;p&gt;Most production queue systems follow &lt;strong&gt;at-least-once delivery&lt;/strong&gt;. Amazon SQS explicitly documents that a message can be received again, and it recommends designing applications to be idempotent so repeated processing does not cause harm.&lt;/p&gt;

&lt;p&gt;That means a job may run more than once.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;Worker receives job&lt;br&gt;&lt;br&gt;
→ Sends email&lt;br&gt;&lt;br&gt;
→ Crashes before ACK&lt;br&gt;&lt;br&gt;
→ Queue retries job&lt;br&gt;&lt;br&gt;
→ Email is sent again&lt;/p&gt;

&lt;p&gt;From the queue’s point of view, the job was never confirmed as complete. Retrying is the correct behavior. The problem is that the side effect already happened.&lt;/p&gt;

&lt;p&gt;This is why “the queue handled it” is not the same as “the system handled it.”&lt;/p&gt;


&lt;h2&gt;
  
  
  Idempotency is required
&lt;/h2&gt;

&lt;p&gt;Once duplicate delivery is possible, idempotency becomes mandatory. The handler should produce the same final result even if it runs multiple times with the same input.&lt;/p&gt;

&lt;p&gt;Bad example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;account&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this runs twice, the user gets credited twice.&lt;/p&gt;

&lt;p&gt;Better:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;processPaymentJob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;job&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;transactionId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;alreadyProcessed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;paymentEvents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findUnique&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;transactionId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transactionId&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;alreadyProcessed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;$transaction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;paymentEvents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;transactionId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transactionId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;accounts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;amount&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important part is not just the &lt;code&gt;findUnique&lt;/code&gt; check. You also need a database constraint on &lt;code&gt;transactionId&lt;/code&gt;, plus a transactional write boundary, so two workers cannot race and apply the same effect twice. That combination is much closer to production reality than a simple in-memory guard.&lt;/p&gt;

&lt;p&gt;A good rule: store a stable business key for every side effect, then make that key unique.&lt;/p&gt;




&lt;h2&gt;
  
  
  Exactly-once is a trap
&lt;/h2&gt;

&lt;p&gt;Engineers often ask whether exactly-once processing is possible. In practice, the answer is usually “not in the way people imagine.”&lt;/p&gt;

&lt;p&gt;A job may charge a customer successfully, then fail while persisting the result. Or the database write may succeed while the worker crashes before confirming the message. Either way, the system ends up with inconsistent state unless it is explicitly designed for idempotency and reconciliation.&lt;/p&gt;

&lt;p&gt;What people call “exactly-once” in production is usually a mix of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;deduplication,&lt;/li&gt;
&lt;li&gt;idempotent handlers,&lt;/li&gt;
&lt;li&gt;atomic state transitions,&lt;/li&gt;
&lt;li&gt;checkpointing,&lt;/li&gt;
&lt;li&gt;and careful recovery logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The useful mental model is not “never run twice.”&lt;br&gt;&lt;br&gt;
It is “running twice must not break anything.”&lt;/p&gt;




&lt;h2&gt;
  
  
  Retries can make outages worse
&lt;/h2&gt;

&lt;p&gt;Retries are necessary, but uncontrolled retries can amplify a problem.&lt;/p&gt;

&lt;p&gt;Imagine 50,000 jobs depending on a third-party API. If that API slows down or goes offline and every worker retries immediately, you create a retry storm. Instead of letting the system recover, you add more pressure to the failing dependency.&lt;/p&gt;

&lt;p&gt;A better pattern is exponential backoff with jitter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retry #1 after a short delay&lt;/li&gt;
&lt;li&gt;Retry #2 after a longer delay&lt;/li&gt;
&lt;li&gt;Retry #3 after an even longer delay&lt;/li&gt;
&lt;li&gt;Add random jitter so retries do not synchronize
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getRetryDelay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="nx"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="nx"&gt;_600_000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;jitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;base&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;jitter&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without jitter, thousands of jobs can wake up at the same time and hit the same dependency together. With jitter, the pressure spreads out, which is exactly what you want during recovery.&lt;/p&gt;




&lt;h2&gt;
  
  
  Dead letter queues matter early
&lt;/h2&gt;

&lt;p&gt;Some jobs are never going to succeed.&lt;/p&gt;

&lt;p&gt;Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;invalid payloads,&lt;/li&gt;
&lt;li&gt;missing records,&lt;/li&gt;
&lt;li&gt;deleted resources,&lt;/li&gt;
&lt;li&gt;permanent validation errors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a job fails for a non-transient reason, retrying it forever wastes capacity and makes the queue harder to reason about. That is what a dead letter queue is for: a place to send messages that have exceeded their retry budget or failed in a non-recoverable way.&lt;/p&gt;

&lt;p&gt;A healthy production setup usually treats DLQ as a first-class path, not an afterthought. It gives you a clear place to inspect poison messages, track recurring failures, and keep the main queue clean.&lt;/p&gt;

&lt;p&gt;If you do one thing after adding retries, make sure you also define where failed jobs go next.&lt;/p&gt;




&lt;h2&gt;
  
  
  Long jobs should be split up
&lt;/h2&gt;

&lt;p&gt;A common mistake is to make one job do everything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generate a report,&lt;/li&gt;
&lt;li&gt;download assets,&lt;/li&gt;
&lt;li&gt;resize images,&lt;/li&gt;
&lt;li&gt;upload files,&lt;/li&gt;
&lt;li&gt;update the database,&lt;/li&gt;
&lt;li&gt;send a notification.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That looks convenient until step five fails and you have to decide whether to rerun the whole thing. The larger the job, the harder it is to retry safely and the more expensive each failure becomes.&lt;/p&gt;

&lt;p&gt;A better shape is a chain of smaller units:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generate report,&lt;/li&gt;
&lt;li&gt;process assets,&lt;/li&gt;
&lt;li&gt;upload results,&lt;/li&gt;
&lt;li&gt;notify user.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Smaller jobs are easier to retry, easier to observe, and easier to scale independently. They also reduce blast radius when something breaks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Monitoring matters more than processing
&lt;/h2&gt;

&lt;p&gt;Many teams spend a lot of time building queue workers and very little time watching them. That is a mistake.&lt;/p&gt;

&lt;p&gt;At minimum, track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;queue length,&lt;/li&gt;
&lt;li&gt;queue lag,&lt;/li&gt;
&lt;li&gt;processing latency,&lt;/li&gt;
&lt;li&gt;job duration,&lt;/li&gt;
&lt;li&gt;failure rate,&lt;/li&gt;
&lt;li&gt;retry volume,&lt;/li&gt;
&lt;li&gt;DLQ growth.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These signals tell you whether workers are keeping up, whether a dependency is slowing the pipeline, and whether failures are starting to accumulate. Azure’s background-job guidance also emphasizes reliability and scaling as first-class concerns, not optional extras.&lt;/p&gt;

&lt;p&gt;If you cannot see your queue health, you are debugging blind.&lt;/p&gt;




&lt;h2&gt;
  
  
  Backpressure is the silent killer
&lt;/h2&gt;

&lt;p&gt;Backpressure is what happens when work arrives faster than you can process it.&lt;/p&gt;

&lt;p&gt;If your system receives 100 jobs per second but can only complete 50, the backlog grows continuously. Eventually that leads to rising latency, memory pressure, and a degraded user experience.&lt;/p&gt;

&lt;p&gt;Common ways to deal with it include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rate limiting,&lt;/li&gt;
&lt;li&gt;worker autoscaling,&lt;/li&gt;
&lt;li&gt;priority queues,&lt;/li&gt;
&lt;li&gt;load shedding,&lt;/li&gt;
&lt;li&gt;rejecting or delaying non-critical work.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ignoring backpressure does not remove it. It just makes the failure slower and more expensive.&lt;/p&gt;




&lt;h2&gt;
  
  
  What production systems actually look like
&lt;/h2&gt;

&lt;p&gt;What tutorials often show:&lt;/p&gt;

&lt;p&gt;Queue → Worker → Done&lt;/p&gt;

&lt;p&gt;What production systems usually need:&lt;/p&gt;

&lt;p&gt;App&lt;br&gt;&lt;br&gt;
→ Queue&lt;br&gt;&lt;br&gt;
→ Workers&lt;br&gt;&lt;br&gt;
→ Retry policy&lt;br&gt;&lt;br&gt;
→ Dead letter queue&lt;br&gt;&lt;br&gt;
→ Metrics&lt;br&gt;&lt;br&gt;
→ Alerting&lt;br&gt;&lt;br&gt;
→ Recovery process&lt;/p&gt;

&lt;p&gt;The processing code is rarely the hardest part. The hard part is designing for failure, visibility, and recovery.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production checklist
&lt;/h2&gt;

&lt;p&gt;Before shipping a queue-based system, verify that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The job is idempotent.&lt;/li&gt;
&lt;li&gt;The same job can run multiple times safely.&lt;/li&gt;
&lt;li&gt;Retry delays use exponential backoff.&lt;/li&gt;
&lt;li&gt;Jitter is enabled.&lt;/li&gt;
&lt;li&gt;A dead letter queue exists.&lt;/li&gt;
&lt;li&gt;Metrics and alerts are in place.&lt;/li&gt;
&lt;li&gt;Failure scenarios were tested intentionally.&lt;/li&gt;
&lt;li&gt;Workers can scale horizontally.&lt;/li&gt;
&lt;li&gt;Backpressure is handled explicitly.&lt;/li&gt;
&lt;li&gt;Jobs are small enough to retry without fear.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If several of these are missing, the system is probably more fragile than it looks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;Background jobs fail in production not because queues are bad, but because the happy-path mental model is incomplete.&lt;/p&gt;

&lt;p&gt;Real systems deal with crashes, retries, duplicate delivery, slow dependencies, and poisoned messages. The goal is not to pretend those failures do not exist. The goal is to make them survivable.&lt;/p&gt;

&lt;p&gt;Once you design for failure instead of assuming success, background jobs become much more trustworthy.&lt;/p&gt;

</description>
      <category>backend</category>
      <category>systemdesign</category>
      <category>distributedsystems</category>
      <category>queue</category>
    </item>
    <item>
      <title>Designing a Real-Time Chat System at Scale</title>
      <dc:creator>Damir Karimov</dc:creator>
      <pubDate>Tue, 23 Jun 2026 09:02:11 +0000</pubDate>
      <link>https://dev.to/damir-karimov/designing-a-real-time-chat-system-at-scale-53k7</link>
      <guid>https://dev.to/damir-karimov/designing-a-real-time-chat-system-at-scale-53k7</guid>
      <description>&lt;p&gt;Most developers underestimate how hard messaging systems really are.&lt;/p&gt;

&lt;p&gt;A basic chat demo with WebSockets is easy to build. A production-grade messaging platform like WhatsApp, Telegram, Discord, or Slack is a completely different engineering problem. The hard part is not rendering messages in the UI. The hard parts are keeping millions of persistent connections alive, delivering messages reliably, preserving ordering, handling offline sync, scaling group chats, surviving partial failures, and keeping latency low around the world.&lt;/p&gt;

&lt;p&gt;In this post, we'll design a scalable real-time chat architecture and walk through the trade-offs behind modern messaging systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why chat gets hard
&lt;/h2&gt;

&lt;p&gt;At first glance, chat seems simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;send a message&lt;/li&gt;
&lt;li&gt;store it&lt;/li&gt;
&lt;li&gt;show it to the other user&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That works for a demo. It falls apart fast in production.&lt;/p&gt;

&lt;p&gt;Once you add real users, the system has to handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;persistent connections&lt;/li&gt;
&lt;li&gt;reconnects&lt;/li&gt;
&lt;li&gt;message ordering&lt;/li&gt;
&lt;li&gt;offline delivery&lt;/li&gt;
&lt;li&gt;read receipts&lt;/li&gt;
&lt;li&gt;typing indicators&lt;/li&gt;
&lt;li&gt;multi-device sync&lt;/li&gt;
&lt;li&gt;large group chats&lt;/li&gt;
&lt;li&gt;global latency&lt;/li&gt;
&lt;li&gt;partial failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where chat stops being a UI feature and becomes a distributed systems problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;p&gt;Before designing the system, we need clear requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Functional requirements
&lt;/h3&gt;

&lt;p&gt;Our chat system should support:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One-to-one chats&lt;/li&gt;
&lt;li&gt;Group chats&lt;/li&gt;
&lt;li&gt;Real-time message delivery&lt;/li&gt;
&lt;li&gt;Read receipts&lt;/li&gt;
&lt;li&gt;Typing indicators&lt;/li&gt;
&lt;li&gt;Push notifications&lt;/li&gt;
&lt;li&gt;Media attachments&lt;/li&gt;
&lt;li&gt;Message history&lt;/li&gt;
&lt;li&gt;Multi-device synchronization&lt;/li&gt;
&lt;li&gt;Online and offline presence&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Non-functional requirements
&lt;/h3&gt;

&lt;p&gt;The system must also provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Low latency&lt;/li&gt;
&lt;li&gt;High availability&lt;/li&gt;
&lt;li&gt;Horizontal scalability&lt;/li&gt;
&lt;li&gt;Fault tolerance&lt;/li&gt;
&lt;li&gt;Reliable delivery&lt;/li&gt;
&lt;li&gt;Event ordering&lt;/li&gt;
&lt;li&gt;Efficient storage&lt;/li&gt;
&lt;li&gt;Global distribution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At small scale, these are manageable. At millions of users, every one of them becomes a distributed systems concern.&lt;/p&gt;




&lt;h2&gt;
  
  
  High-level architecture
&lt;/h2&gt;

&lt;p&gt;A modern chat platform is usually event-driven and distributed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Clients
   │
   ▼
Load Balancer
   │
   ▼
WebSocket Gateway Cluster
   │
   ├── Authentication Service
   ├── Presence Service
   ├── Chat Service
   ├── Notification Service
   └── Media Service
          │
          ▼
    Kafka / Redis Streams
          │
          ▼
     Storage Layer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each component has a separate responsibility. That separation is what makes the system scalable and easier to evolve.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why polling fails
&lt;/h2&gt;

&lt;p&gt;Many beginner chat apps start with polling:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET /messages every 2 seconds
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works for a demo. It fails badly at scale.&lt;/p&gt;

&lt;p&gt;Polling creates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;massive request overhead&lt;/li&gt;
&lt;li&gt;unnecessary database reads&lt;/li&gt;
&lt;li&gt;increased latency&lt;/li&gt;
&lt;li&gt;battery drain on mobile devices&lt;/li&gt;
&lt;li&gt;poor real-time responsiveness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now imagine 2 million connected users polling every 2 seconds. That becomes 1 million requests per second even when nobody is sending anything. Real chat systems avoid this by using persistent connections.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why WebSockets became the default
&lt;/h2&gt;

&lt;p&gt;WebSockets keep a bidirectional connection open between the client and the server.&lt;/p&gt;

&lt;p&gt;That gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;near real-time communication&lt;/li&gt;
&lt;li&gt;lower overhead&lt;/li&gt;
&lt;li&gt;reduced latency&lt;/li&gt;
&lt;li&gt;efficient server push&lt;/li&gt;
&lt;li&gt;better mobile performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The client connects once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client ───── persistent connection ───── Server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that, both sides can exchange events instantly. That is the foundation of modern messaging systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hidden WebSocket cost
&lt;/h2&gt;

&lt;p&gt;WebSockets solve one problem and introduce another.&lt;/p&gt;

&lt;p&gt;A persistent connection is not free. Every connected user consumes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a TCP socket&lt;/li&gt;
&lt;li&gt;memory buffers&lt;/li&gt;
&lt;li&gt;heartbeat state&lt;/li&gt;
&lt;li&gt;authentication context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At 5 million concurrent users, this is no longer a normal API problem. It becomes a connection management problem.&lt;/p&gt;

&lt;p&gt;That is why serious chat systems build specialized gateway infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  WebSocket gateway layer
&lt;/h2&gt;

&lt;p&gt;The gateway layer is responsible for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;maintaining persistent connections&lt;/li&gt;
&lt;li&gt;authenticating users&lt;/li&gt;
&lt;li&gt;routing events&lt;/li&gt;
&lt;li&gt;managing heartbeats&lt;/li&gt;
&lt;li&gt;detecting disconnects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Gateways should stay as stateless as possible. Stateless gateways are much easier to scale, replace, and recover after failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Authentication flow
&lt;/h3&gt;

&lt;p&gt;A common flow looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User logs in via HTTP.&lt;/li&gt;
&lt;li&gt;Backend issues a JWT token.&lt;/li&gt;
&lt;li&gt;Client opens a WebSocket connection.&lt;/li&gt;
&lt;li&gt;Token is validated during the handshake.&lt;/li&gt;
&lt;li&gt;Connection is associated with a user session.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;wss://chat.example.com?token=JWT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After authentication, the gateway knows which user owns the connection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sending a message
&lt;/h2&gt;

&lt;p&gt;Now let's look at the real delivery pipeline.&lt;/p&gt;

&lt;p&gt;Suppose User A sends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Hello
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A simplified flow is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Client sends a message event.&lt;/li&gt;
&lt;li&gt;Gateway validates authentication.&lt;/li&gt;
&lt;li&gt;Chat service validates permissions.&lt;/li&gt;
&lt;li&gt;Message is persisted in durable storage.&lt;/li&gt;
&lt;li&gt;Event is published into Kafka.&lt;/li&gt;
&lt;li&gt;Recipient gateway receives the event.&lt;/li&gt;
&lt;li&gt;Message is pushed to the recipient.&lt;/li&gt;
&lt;li&gt;Delivery ACK is generated.&lt;/li&gt;
&lt;li&gt;Read receipt is generated later.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This looks simple on paper. In reality, every step can fail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why persistence comes first
&lt;/h2&gt;

&lt;p&gt;A lot of beginners try this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;deliver → save later
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is dangerous.&lt;/p&gt;

&lt;p&gt;If the server crashes before persistence, the recipient saw the message, but the database lost it. Now the system is inconsistent.&lt;/p&gt;

&lt;p&gt;Production systems usually do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;persist → publish → deliver
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Durability comes first.&lt;/p&gt;




&lt;h2&gt;
  
  
  Message IDs and ordering
&lt;/h2&gt;

&lt;p&gt;Distributed systems do not guarantee ordering automatically.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Message A&lt;/li&gt;
&lt;li&gt;Message B&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Recipient receives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Message B&lt;/li&gt;
&lt;li&gt;Message A&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why does this happen? Because messages may travel through different gateway servers, queues, and network paths.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common ordering strategies
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Timestamp ordering
&lt;/h4&gt;

&lt;p&gt;Simple, but unreliable. Clock drift breaks consistency.&lt;/p&gt;

&lt;h4&gt;
  
  
  Incremental sequence IDs
&lt;/h4&gt;

&lt;p&gt;More reliable.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;conversation_id: 42

messages:
1
2
3
4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This guarantees ordering inside a conversation. Most real systems only guarantee local ordering per chat. Global ordering across the entire platform is usually impossible at scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Database design
&lt;/h2&gt;

&lt;p&gt;Messaging systems are write-heavy. A popular group chat can generate thousands of writes per second.&lt;/p&gt;

&lt;p&gt;Typical tables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;users&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;conversations&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;conversation_members&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;messages&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;message_status&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;attachments&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real challenge is partitioning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why a single database eventually fails
&lt;/h3&gt;

&lt;p&gt;A single PostgreSQL instance works at the beginning. Over time, problems show up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;write bottlenecks&lt;/li&gt;
&lt;li&gt;storage growth&lt;/li&gt;
&lt;li&gt;replication lag&lt;/li&gt;
&lt;li&gt;index size explosion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At scale, systems introduce sharding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sharding strategy
&lt;/h3&gt;

&lt;p&gt;A common strategy is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;shard by conversation_id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;messages for the same chat stay colocated&lt;/li&gt;
&lt;li&gt;ordering becomes easier&lt;/li&gt;
&lt;li&gt;queries stay efficient&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bad shard keys create hotspots. For example, sharding by &lt;code&gt;user_id&lt;/code&gt; can spread large group chats across multiple shards and make fan-out expensive.&lt;/p&gt;




&lt;h2&gt;
  
  
  Kafka and event streaming
&lt;/h2&gt;

&lt;p&gt;Modern messaging systems are heavily event-driven.&lt;/p&gt;

&lt;p&gt;Kafka is useful because it provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;durable event logs&lt;/li&gt;
&lt;li&gt;replayability&lt;/li&gt;
&lt;li&gt;partitioned scalability&lt;/li&gt;
&lt;li&gt;consumer groups&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of services calling each other directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chat Service → Kafka → Consumers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Consumers may include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;delivery service&lt;/li&gt;
&lt;li&gt;notification service&lt;/li&gt;
&lt;li&gt;analytics&lt;/li&gt;
&lt;li&gt;moderation&lt;/li&gt;
&lt;li&gt;push notifications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This decouples the system and makes failures easier to isolate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Presence system
&lt;/h2&gt;

&lt;p&gt;Presence is deceptively expensive.&lt;/p&gt;

&lt;p&gt;Tracking online, offline, typing, and &lt;code&gt;last_seen&lt;/code&gt; for millions of users creates a huge amount of event traffic. That is why most systems isolate presence into a dedicated service.&lt;/p&gt;

&lt;h3&gt;
  
  
  Presence implementation
&lt;/h3&gt;

&lt;p&gt;A common architecture is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Gateway → Redis → Presence Service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gateways periodically send heartbeats.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PING every 30 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If heartbeats stop, the user is marked offline.&lt;/p&gt;

&lt;p&gt;Redis works well here because presence data is ephemeral. Not everything belongs in a relational database.&lt;/p&gt;

&lt;h2&gt;
  
  
  Typing indicators
&lt;/h2&gt;

&lt;p&gt;Typing indicators look trivial. They are not.&lt;/p&gt;

&lt;p&gt;Problems include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;high event frequency&lt;/li&gt;
&lt;li&gt;noisy updates&lt;/li&gt;
&lt;li&gt;unnecessary fan-out&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most systems heavily throttle typing events.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User typing → emit once every 3 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Without throttling, typing indicators can overload the infrastructure faster than messages.
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Fan-out in group chats
&lt;/h2&gt;

&lt;p&gt;Suppose a group contains 500,000 users.&lt;/p&gt;

&lt;p&gt;One message may require 500,000 deliveries. This is called fan-out.&lt;/p&gt;

&lt;p&gt;Large fan-out is one of the hardest messaging problems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fan-out strategies
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Fan-out on write
&lt;/h4&gt;

&lt;p&gt;The server precomputes deliveries immediately.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fast reads&lt;/li&gt;
&lt;li&gt;expensive writes&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Fan-out on read
&lt;/h4&gt;

&lt;p&gt;Messages are stored once.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cheaper writes&lt;/li&gt;
&lt;li&gt;more expensive reads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Different platforms choose different trade-offs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Offline synchronization
&lt;/h2&gt;

&lt;p&gt;Users disconnect constantly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;mobile app closed&lt;/li&gt;
&lt;li&gt;network loss&lt;/li&gt;
&lt;li&gt;airplane mode&lt;/li&gt;
&lt;li&gt;battery saver&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system must synchronize missed events efficiently.&lt;/p&gt;

&lt;p&gt;Typical approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fetch all events after last_sequence_id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;last_seen_message = 10451
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Server returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10452+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Incremental synchronization is critical. Full synchronization is too expensive.&lt;/p&gt;




&lt;h2&gt;
  
  
  Push notifications
&lt;/h2&gt;

&lt;p&gt;Offline users still need notifications.&lt;/p&gt;

&lt;p&gt;Pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;message event
   ↓
notification service
   ↓
APNs / FCM
   ↓
mobile device
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Push systems are eventually consistent. Notifications may arrive late, duplicated, or out of order. Clients need to tolerate that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-device sync
&lt;/h2&gt;

&lt;p&gt;Modern users expect sync across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;phone&lt;/li&gt;
&lt;li&gt;desktop&lt;/li&gt;
&lt;li&gt;browser&lt;/li&gt;
&lt;li&gt;tablet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each device may keep its own session. The backend tracks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;user_id&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;device_id&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;connection_id&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;last_sync_state&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Events are usually delivered independently per device.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reliability guarantees
&lt;/h2&gt;

&lt;p&gt;Messaging systems need clear delivery semantics.&lt;/p&gt;

&lt;h3&gt;
  
  
  At-most-once
&lt;/h3&gt;

&lt;p&gt;Fastest. Messages may disappear.&lt;/p&gt;

&lt;h3&gt;
  
  
  At-least-once
&lt;/h3&gt;

&lt;p&gt;Reliable. Duplicates are possible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exactly-once
&lt;/h3&gt;

&lt;p&gt;Extremely expensive and difficult in distributed systems.&lt;/p&gt;

&lt;p&gt;Most production systems use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;at-least-once + idempotency
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the practical choice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why exactly-once is mostly marketing
&lt;/h2&gt;

&lt;p&gt;Exactly-once delivery across distributed infrastructure is incredibly hard.&lt;/p&gt;

&lt;p&gt;Network failures create ambiguity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did the recipient receive the message?&lt;/li&gt;
&lt;li&gt;Did the ACK get lost?&lt;/li&gt;
&lt;li&gt;Did the retry create a duplicate?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sometimes the sender cannot know for sure.&lt;/p&gt;

&lt;p&gt;That is why many systems rely on retries, deduplication, and idempotent consumers instead of true exactly-once guarantees.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common failure scenarios
&lt;/h2&gt;

&lt;p&gt;Real systems fail all the time.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;gateway crashes&lt;/li&gt;
&lt;li&gt;Kafka partition unavailable&lt;/li&gt;
&lt;li&gt;Redis outage&lt;/li&gt;
&lt;li&gt;slow consumers&lt;/li&gt;
&lt;li&gt;duplicate events&lt;/li&gt;
&lt;li&gt;partial synchronization&lt;/li&gt;
&lt;li&gt;network splits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reliable systems assume failure is normal.&lt;/p&gt;




&lt;h2&gt;
  
  
  Scaling strategies
&lt;/h2&gt;

&lt;p&gt;As traffic grows, teams usually add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;horizontal scaling for gateway nodes&lt;/li&gt;
&lt;li&gt;partitioned Kafka topics&lt;/li&gt;
&lt;li&gt;regional infrastructure&lt;/li&gt;
&lt;li&gt;CDN for media&lt;/li&gt;
&lt;li&gt;caching to reduce database pressure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The biggest bottleneck is usually not CPU.&lt;/p&gt;

&lt;p&gt;At scale, bottlenecks are often:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;network throughput&lt;/li&gt;
&lt;li&gt;memory usage&lt;/li&gt;
&lt;li&gt;hot partitions&lt;/li&gt;
&lt;li&gt;connection limits&lt;/li&gt;
&lt;li&gt;disk I/O&lt;/li&gt;
&lt;li&gt;replication lag&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Chat systems are infrastructure-heavy workloads.&lt;/p&gt;




&lt;h2&gt;
  
  
  What makes messaging hard
&lt;/h2&gt;

&lt;p&gt;The frontend UI may look simple. Underneath is a distributed system balancing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;consistency&lt;/li&gt;
&lt;li&gt;availability&lt;/li&gt;
&lt;li&gt;latency&lt;/li&gt;
&lt;li&gt;reliability&lt;/li&gt;
&lt;li&gt;scalability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every architectural decision introduces trade-offs. You can optimize latency, durability, throughput, or cost, but rarely all at once.&lt;/p&gt;

&lt;p&gt;That is why messaging systems remain one of the most interesting areas in system design.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical takeaway
&lt;/h2&gt;

&lt;p&gt;If you are building chat, do not think only in terms of sockets and message lists. Think in terms of delivery guarantees, storage strategy, ordering, recovery, and fan-out.&lt;/p&gt;

&lt;p&gt;A chat product becomes serious very quickly. The moment you need offline sync, presence, multi-device support, and reliable delivery, you are no longer building a UI feature. You are building a distributed messaging platform.&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>backend</category>
      <category>websockets</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>CI/CD for Modern Applications: From Manual Deployments to Reliable Delivery Pipelines</title>
      <dc:creator>Damir Karimov</dc:creator>
      <pubDate>Mon, 15 Jun 2026 10:20:05 +0000</pubDate>
      <link>https://dev.to/damir-karimov/cicd-for-modern-applications-from-manual-deployments-to-reliable-delivery-pipelines-52j5</link>
      <guid>https://dev.to/damir-karimov/cicd-for-modern-applications-from-manual-deployments-to-reliable-delivery-pipelines-52j5</guid>
      <description>&lt;p&gt;Last year my team had a production incident because someone manually deployed without running tests. The fix took 47 minutes. That's when I realized: CI/CD isn't about automation — it's about controlling risk in software delivery.&lt;/p&gt;

&lt;p&gt;In this article, you'll learn how to design reliable pipelines for frontend (React/Next.js) and backend systems, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Artifact strategies and reproducible builds&lt;/li&gt;
&lt;li&gt;Testing pyramid (unit, integration, contract, e2e)&lt;/li&gt;
&lt;li&gt;Safe database migrations&lt;/li&gt;
&lt;li&gt;Production observability and DORA metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plus: a production checklist you can use immediately.&lt;/p&gt;




&lt;p&gt;Modern application development is constrained less by coding speed and more by delivery reliability. Teams can ship features quickly, but production stability breaks when deployments remain manual or partially automated. CI/CD solves this by turning software delivery into a deterministic, repeatable pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What CI/CD Actually Solves
&lt;/h2&gt;

&lt;p&gt;CI/CD (Continuous Integration and Continuous Delivery/Deployment) is not a toolset — it is a system design approach for software delivery.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core problems it addresses
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Integration conflicts in branches&lt;/td&gt;
&lt;td&gt;Merge hell, delayed releases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unpredictable release cycles&lt;/td&gt;
&lt;td&gt;Business uncertainty, missed SLAs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manual deployment errors&lt;/td&gt;
&lt;td&gt;Production incidents, data corruption&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Environment drift&lt;/td&gt;
&lt;td&gt;"Works in CI, fails in prod"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Slow rollback processes&lt;/td&gt;
&lt;td&gt;Extended MTTR, higher user impact&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Not "faster deploys", but &lt;strong&gt;"reliable, repeatable deploys"&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  CI: Continuous Integration as a Validation Layer
&lt;/h2&gt;

&lt;p&gt;Continuous Integration ensures every change is validated in a controlled pipeline &lt;strong&gt;before merging&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A minimal CI pipeline includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linting (code quality enforcement)&lt;/li&gt;
&lt;li&gt;Unit tests&lt;/li&gt;
&lt;li&gt;Type checking (TypeScript, etc.)&lt;/li&gt;
&lt;li&gt;Build verification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example flow:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;feature branch → PR → CI pipeline → checks pass → merge to main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key principle:&lt;/strong&gt; Every commit must result in a &lt;strong&gt;deployable artifact&lt;/strong&gt; after CI. Without this rule, CI becomes cosmetic.&lt;/p&gt;




&lt;h2&gt;
  
  
  CD: Continuous Delivery vs Continuous Deployment
&lt;/h2&gt;

&lt;p&gt;These terms are often mixed but differ structurally:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Continuous Delivery&lt;/th&gt;
&lt;th&gt;Continuous Deployment&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Build state&lt;/td&gt;
&lt;td&gt;Always deployable&lt;/td&gt;
&lt;td&gt;Always deployable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment trigger&lt;/td&gt;
&lt;td&gt;Manual (human approval)&lt;/td&gt;
&lt;td&gt;Automatic (after CI success)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Release decision&lt;/td&gt;
&lt;td&gt;Controlled (gates, approvals)&lt;/td&gt;
&lt;td&gt;Fully automated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Regulated/large systems&lt;/td&gt;
&lt;td&gt;Fast-moving products with strong test coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;CD controls to include:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Approval gates&lt;/li&gt;
&lt;li&gt;Feature flags&lt;/li&gt;
&lt;li&gt;RBAC &amp;amp; audit trails&lt;/li&gt;
&lt;li&gt;Environment-specific configuration (not baked into artifacts)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Typical CI/CD Architecture
&lt;/h2&gt;

&lt;p&gt;A production-grade pipeline includes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Code → CI → Build artifact → Store → Deploy → Monitor
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Components
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Source control&lt;/td&gt;
&lt;td&gt;GitHub, GitLab&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI runner&lt;/td&gt;
&lt;td&gt;GitHub Actions, GitLab CI, Jenkins&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Build system&lt;/td&gt;
&lt;td&gt;Docker, native builds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Artifact storage&lt;/td&gt;
&lt;td&gt;Docker registry, S3, artifact repo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment system&lt;/td&gt;
&lt;td&gt;Kubernetes, Vercel, ECS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitoring&lt;/td&gt;
&lt;td&gt;Logs, metrics (Prometheus), traces (Jaeger)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Pipeline Stages in Modern Apps
&lt;/h2&gt;

&lt;p&gt;A realistic pipeline is &lt;strong&gt;layered validation&lt;/strong&gt;, not linear "test → build → deploy".&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Pre-merge checks (on PR)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;ESLint / Prettier&lt;/li&gt;
&lt;li&gt;TypeScript compilation&lt;/li&gt;
&lt;li&gt;Unit tests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Target:&lt;/strong&gt; &amp;lt; 10 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Post-merge CI
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Full test suite&lt;/li&gt;
&lt;li&gt;Integration tests&lt;/li&gt;
&lt;li&gt;Security scanning (SAST)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Target:&lt;/strong&gt; &amp;lt; 30 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Build stage
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Docker image build&lt;/li&gt;
&lt;li&gt;Dependency locking&lt;/li&gt;
&lt;li&gt;Artifact versioning (semver + git SHA)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Deployment stage
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Staging rollout&lt;/li&gt;
&lt;li&gt;Smoke tests&lt;/li&gt;
&lt;li&gt;Production deployment (manual or automatic)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Post-deployment validation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Health checks&lt;/li&gt;
&lt;li&gt;Observability signals (logs, metrics, traces)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Pipeline Internals: Artifacts, Provenance, Reproducibility
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Artifact Strategy
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Store &lt;strong&gt;immutable artifacts&lt;/strong&gt;: Docker images, frontend build bundles, npm tarballs&lt;/li&gt;
&lt;li&gt;Tag by &lt;strong&gt;semver + git SHA&lt;/strong&gt;, store digest&lt;/li&gt;
&lt;li&gt;Deploy by &lt;strong&gt;digest&lt;/strong&gt; (not tag) to ensure consistency&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Reproducible Builds
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use Docker images with pinned base versions&lt;/li&gt;
&lt;li&gt;Lock dependency versions (&lt;code&gt;package-lock.json&lt;/code&gt;, &lt;code&gt;pnpm-lock.yaml&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Deterministic builds (no random timestamps)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Caching &amp;amp; Parallelization
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cache &lt;code&gt;node_modules&lt;/code&gt; / &lt;code&gt;pnpm store&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Parallelize test suites&lt;/li&gt;
&lt;li&gt;Incremental builds (Next.js, TypeScript)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Testing Strategy: The Pyramid
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test Type&lt;/th&gt;
&lt;th&gt;When to Run&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Unit tests&lt;/td&gt;
&lt;td&gt;Pre-merge (PR)&lt;/td&gt;
&lt;td&gt;Fast feedback on logic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Integration&lt;/td&gt;
&lt;td&gt;Post-merge&lt;/td&gt;
&lt;td&gt;API, DB, service interactions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Contract tests&lt;/td&gt;
&lt;td&gt;Post-merge&lt;/td&gt;
&lt;td&gt;Cross-service compatibility&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E2e tests&lt;/td&gt;
&lt;td&gt;Staging/Nightly&lt;/td&gt;
&lt;td&gt;Full user flow validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance&lt;/td&gt;
&lt;td&gt;Nightly&lt;/td&gt;
&lt;td&gt;Latency, throughput baselines&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Flaky tests:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Measure flakiness rate (target &amp;lt; 1–2%)&lt;/li&gt;
&lt;li&gt;Quarantine flaky tests (separate suite)&lt;/li&gt;
&lt;li&gt;Retry vs fix: fix critical flakies, quarantine low-priority&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  CI/CD in Frontend-heavy Systems (React / Next.js)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Typical Pipeline
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Install dependencies (cached)&lt;/li&gt;
&lt;li&gt;Lint + TypeScript check&lt;/li&gt;
&lt;li&gt;Unit tests (Jest / Vitest)&lt;/li&gt;
&lt;li&gt;Build (Next.js build)&lt;/li&gt;
&lt;li&gt;Static analysis (bundle size, unused deps)&lt;/li&gt;
&lt;li&gt;Deploy (Vercel / Docker / CDN)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Key Optimizations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cache &lt;code&gt;node_modules&lt;/code&gt; / &lt;code&gt;pnpm store&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Incremental builds&lt;/li&gt;
&lt;li&gt;Split preview deployments per PR&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  CI/CD for Scalable Backend Systems
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Backend Pipeline Requirements
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Database migration strategy&lt;/li&gt;
&lt;li&gt;Backward compatibility checks&lt;/li&gt;
&lt;li&gt;Blue-green or canary deployments&lt;/li&gt;
&lt;li&gt;Queue compatibility (Kafka / RabbitMQ)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Deployment Models
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Risk&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Rolling&lt;/td&gt;
&lt;td&gt;Simple services&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Blue-green&lt;/td&gt;
&lt;td&gt;Critical services, fast rollback&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Canary&lt;/td&gt;
&lt;td&gt;Gradual traffic shift, metrics&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Safe Database Migrations in CI/CD
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Migration Checklist
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Add column &lt;strong&gt;nullable&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Deploy service with &lt;strong&gt;backward-compatible&lt;/strong&gt; code (read new column with fallback)&lt;/li&gt;
&lt;li&gt;Run &lt;strong&gt;backfill&lt;/strong&gt; job&lt;/li&gt;
&lt;li&gt;Make column &lt;strong&gt;non-nullable&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Deploy service using new column &lt;strong&gt;without fallback&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Clean up old code/columns&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Orchestration:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Migration step in pipeline&lt;/li&gt;
&lt;li&gt;Feature flag to enable new behavior&lt;/li&gt;
&lt;li&gt;Health gate: error rate &amp;lt; 0.5%, latency P95 stable&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Security in CI/CD
&lt;/h2&gt;

&lt;p&gt;Modern pipelines must include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dependency scanning (&lt;code&gt;npm audit&lt;/code&gt;, Snyk)&lt;/li&gt;
&lt;li&gt;Secret detection&lt;/li&gt;
&lt;li&gt;Container vulnerability scanning&lt;/li&gt;
&lt;li&gt;Signed artifacts (supply chain security)&lt;/li&gt;
&lt;li&gt;SBOM (Software Bill of Materials)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;CI/CD without security gates becomes an attack surface.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability as Final Stage of CI/CD
&lt;/h2&gt;

&lt;p&gt;Delivery is incomplete without feedback loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logs (structured logging)&lt;/li&gt;
&lt;li&gt;Metrics (latency, error rate)&lt;/li&gt;
&lt;li&gt;Traces (request flow)&lt;/li&gt;
&lt;li&gt;Alerting (SLA violations)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Pipeline &amp;amp; Production Metrics (DORA)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Target&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lead time for changes&lt;/td&gt;
&lt;td&gt;&amp;lt; 1 day&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment frequency&lt;/td&gt;
&lt;td&gt;Daily or on-demand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Change failure rate&lt;/td&gt;
&lt;td&gt;&amp;lt; 5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MTTR (Mean Time to Restore)&lt;/td&gt;
&lt;td&gt;&amp;lt; 1 hour&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flaky-test rate&lt;/td&gt;
&lt;td&gt;&amp;lt; 1–2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;CI/CD ends where production behavior becomes visible.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Failure Points in CI/CD Systems
&lt;/h2&gt;

&lt;p&gt;Most failures are due to &lt;strong&gt;design flaws&lt;/strong&gt;, not tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Point&lt;/th&gt;
&lt;th&gt;Risk&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No clear artifact strategy&lt;/td&gt;
&lt;td&gt;Inconsistent production builds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Missing rollback strategy&lt;/td&gt;
&lt;td&gt;Increased risk, extended MTTR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overloaded pipelines&lt;/td&gt;
&lt;td&gt;Slow feedback, reduced developer productivity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lack of environment parity&lt;/td&gt;
&lt;td&gt;"Works in CI, fails in prod"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Rollback should be:&lt;/strong&gt; Version switch (traffic to previous artifact digest) — &lt;strong&gt;NOT&lt;/strong&gt; manual hotfix process&lt;/p&gt;




&lt;h2&gt;
  
  
  Production Checklist
&lt;/h2&gt;

&lt;p&gt;Before promoting to production:&lt;/p&gt;

&lt;p&gt;✅ Artifact immutable and signed&lt;br&gt;&lt;br&gt;
✅ DB migration safe (nullable → backfill → non-nullable)&lt;br&gt;&lt;br&gt;
✅ Feature flag ready for gradual enable&lt;br&gt;&lt;br&gt;
✅ Health gates configured (error rate, latency)&lt;br&gt;&lt;br&gt;
✅ Monitoring dashboards updated (deploy events visible)&lt;br&gt;&lt;br&gt;
✅ Rollback plan tested (traffic switch &amp;lt; 5 min)&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;CI/CD is not automation of deployment. It is a &lt;strong&gt;system for controlling risk in software delivery&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A well-designed pipeline ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predictable releases&lt;/li&gt;
&lt;li&gt;Consistent environments&lt;/li&gt;
&lt;li&gt;Fast feedback loops&lt;/li&gt;
&lt;li&gt;Safe iteration speed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Teams that implement CI/CD correctly shift from &lt;strong&gt;"deploying code"&lt;/strong&gt; to &lt;strong&gt;"managing delivery systems"&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>webdev</category>
      <category>kubernetes</category>
      <category>node</category>
    </item>
    <item>
      <title>Real-Time Notification Systems Are Harder Than Most Teams Expect</title>
      <dc:creator>Damir Karimov</dc:creator>
      <pubDate>Tue, 09 Jun 2026 10:26:13 +0000</pubDate>
      <link>https://dev.to/damir-karimov/real-time-notification-systems-are-harder-than-most-teams-expect-od1</link>
      <guid>https://dev.to/damir-karimov/real-time-notification-systems-are-harder-than-most-teams-expect-od1</guid>
      <description>&lt;p&gt;If you’ve ever thought, “It’s just a WebSocket event,” this article is for you.&lt;/p&gt;

&lt;p&gt;Notification systems look simple on the surface, but in production they fail in annoying, expensive, and user-visible ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;duplicate notifications&lt;/li&gt;
&lt;li&gt;missing events&lt;/li&gt;
&lt;li&gt;race conditions&lt;/li&gt;
&lt;li&gt;delayed delivery&lt;/li&gt;
&lt;li&gt;mobile disconnects&lt;/li&gt;
&lt;li&gt;retry storms&lt;/li&gt;
&lt;li&gt;ordering bugs&lt;/li&gt;
&lt;li&gt;state drift across regions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tricky part is not sending a message.&lt;/p&gt;

&lt;p&gt;The tricky part is making sure the right user gets the right notification, in the right order, with enough reliability that the system can survive crashes, retries, and mobile networks.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem with “just send a WebSocket event”
&lt;/h2&gt;

&lt;p&gt;A basic notification flow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Backend → WebSocket Server → Client
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That works in local dev. It even works for a while in production.&lt;/p&gt;

&lt;p&gt;Then real traffic arrives, and the system suddenly has to handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reconnects&lt;/li&gt;
&lt;li&gt;offline users&lt;/li&gt;
&lt;li&gt;multiple devices&lt;/li&gt;
&lt;li&gt;persistence&lt;/li&gt;
&lt;li&gt;retries&lt;/li&gt;
&lt;li&gt;fan-out&lt;/li&gt;
&lt;li&gt;backpressure&lt;/li&gt;
&lt;li&gt;push fallbacks&lt;/li&gt;
&lt;li&gt;deduplication&lt;/li&gt;
&lt;li&gt;ordering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, your “WebSocket feature” has become a distributed messaging system.&lt;/p&gt;

&lt;p&gt;And that is a very different problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  Delivery semantics matter first
&lt;/h2&gt;

&lt;p&gt;Before you design the system, decide what guarantees you need.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Semantics&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;At-most-once&lt;/td&gt;
&lt;td&gt;Messages may be lost, but won’t be duplicated&lt;/td&gt;
&lt;td&gt;Low-priority updates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;At-least-once&lt;/td&gt;
&lt;td&gt;Messages won’t be lost, but may be duplicated&lt;/td&gt;
&lt;td&gt;Payments, security alerts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Effectively-once&lt;/td&gt;
&lt;td&gt;Duplicates are removed with dedupe logic&lt;/td&gt;
&lt;td&gt;Critical product events&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most teams make a mistake here.&lt;/p&gt;

&lt;p&gt;They start building transport first, then discover later that they actually needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;idempotency keys&lt;/li&gt;
&lt;li&gt;durable cursors&lt;/li&gt;
&lt;li&gt;sequence numbers&lt;/li&gt;
&lt;li&gt;replay support&lt;/li&gt;
&lt;li&gt;acknowledgements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why notification systems become expensive: the real problem is not delivery, it is &lt;strong&gt;delivery semantics&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why fan-out breaks systems
&lt;/h2&gt;

&lt;p&gt;One event is easy.&lt;/p&gt;

&lt;p&gt;One event to 10,000 users is not.&lt;/p&gt;

&lt;p&gt;A single action can trigger:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;feed updates&lt;/li&gt;
&lt;li&gt;badge counter updates&lt;/li&gt;
&lt;li&gt;push notifications&lt;/li&gt;
&lt;li&gt;email digests&lt;/li&gt;
&lt;li&gt;analytics events&lt;/li&gt;
&lt;li&gt;moderation triggers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That creates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;queue amplification&lt;/li&gt;
&lt;li&gt;retry cascades&lt;/li&gt;
&lt;li&gt;hot partitions&lt;/li&gt;
&lt;li&gt;uneven load&lt;/li&gt;
&lt;li&gt;latency spikes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the system stops being “send a notification” and becomes “shape traffic safely under failure.”&lt;/p&gt;




&lt;h2&gt;
  
  
  Why duplicates happen
&lt;/h2&gt;

&lt;p&gt;Duplicates usually do not come from one single bug. They appear from the interaction of retries, crashes, and missing idempotency.&lt;/p&gt;

&lt;p&gt;A common chain looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Message is written to Kafka
2. Consumer processes it
3. Consumer crashes before committing offset
4. Partition is reassigned
5. Another consumer reads the same message
6. User gets the notification twice
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is not random.&lt;/p&gt;

&lt;p&gt;That is at-least-once delivery with missing deduplication.&lt;/p&gt;

&lt;h3&gt;
  
  
  The fix
&lt;/h3&gt;

&lt;p&gt;Use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;idempotency keys&lt;/li&gt;
&lt;li&gt;dedupe storage&lt;/li&gt;
&lt;li&gt;sequence numbers&lt;/li&gt;
&lt;li&gt;consumer-side protection before side effects&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Ordering is harder than throughput
&lt;/h2&gt;

&lt;p&gt;Most users don’t complain about 200ms of delay.&lt;/p&gt;

&lt;p&gt;They absolutely notice this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Payment refunded” arrives before “Payment received”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That destroys trust immediately.&lt;/p&gt;

&lt;p&gt;Global ordering is usually too expensive. In practice, teams often choose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;per-user ordering&lt;/li&gt;
&lt;li&gt;per-conversation ordering&lt;/li&gt;
&lt;li&gt;approximate ordering&lt;/li&gt;
&lt;li&gt;causal consistency where needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For most products, per-user ordering is the best balance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example: sequence numbers per user
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;NotificationLog&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sequence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;notification&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sequence&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sequence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sequence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;notification&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idempotency_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sequence&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;kafka&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;produce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notifications&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a simple way to keep ordering stable inside a user shard.&lt;/p&gt;




&lt;h2&gt;
  
  
  Mobile makes everything worse
&lt;/h2&gt;

&lt;p&gt;Desktop clients are relatively stable.&lt;/p&gt;

&lt;p&gt;Mobile clients are not.&lt;/p&gt;

&lt;p&gt;You have to deal with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;app backgrounding&lt;/li&gt;
&lt;li&gt;battery optimization&lt;/li&gt;
&lt;li&gt;network switching&lt;/li&gt;
&lt;li&gt;silent disconnects&lt;/li&gt;
&lt;li&gt;delayed push delivery&lt;/li&gt;
&lt;li&gt;OS throttling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s why real systems often combine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;WebSockets for active sessions&lt;/li&gt;
&lt;li&gt;APNs for iOS&lt;/li&gt;
&lt;li&gt;FCM for Android&lt;/li&gt;
&lt;li&gt;polling or pull fallback&lt;/li&gt;
&lt;li&gt;local persistence&lt;/li&gt;
&lt;li&gt;sync checkpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Important detail
&lt;/h3&gt;

&lt;p&gt;APNs and FCM are not guaranteed single-delivery transport.&lt;/p&gt;

&lt;p&gt;They can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;delay notifications&lt;/li&gt;
&lt;li&gt;drop messages under pressure&lt;/li&gt;
&lt;li&gt;coalesce updates&lt;/li&gt;
&lt;li&gt;expire tokens&lt;/li&gt;
&lt;li&gt;throttle traffic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So if the notification matters, the server still needs durable state.&lt;/p&gt;




&lt;h2&gt;
  
  
  A real incident example
&lt;/h2&gt;

&lt;p&gt;At 3AM, an on-call engineer gets paged because one user received dozens of duplicate payment emails.&lt;/p&gt;

&lt;p&gt;What happened?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the consumer crashed mid-batch&lt;/li&gt;
&lt;li&gt;the offset was not committed&lt;/li&gt;
&lt;li&gt;Kafka redelivered the same event&lt;/li&gt;
&lt;li&gt;the email sender had no dedupe check&lt;/li&gt;
&lt;li&gt;the user got spammed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That kind of issue is painful because it is not one bug.&lt;/p&gt;

&lt;p&gt;It is a chain of small design decisions that only becomes visible under failure.&lt;/p&gt;

&lt;h3&gt;
  
  
  The practical fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;send_notification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;notification&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;idempotency_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;notification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;notification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sequence&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dedupe:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;

    &lt;span class="n"&gt;email_service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;notification&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dedupe:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important part is not the code style.&lt;/p&gt;

&lt;p&gt;It is the fact that the system now assumes duplicates can happen and is built to survive them.&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability is not optional
&lt;/h2&gt;

&lt;p&gt;If you cannot observe the pipeline, you cannot debug it.&lt;/p&gt;

&lt;p&gt;Useful signals include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;queue lag&lt;/li&gt;
&lt;li&gt;retry count&lt;/li&gt;
&lt;li&gt;delivery success rate&lt;/li&gt;
&lt;li&gt;connection churn&lt;/li&gt;
&lt;li&gt;consumer health&lt;/li&gt;
&lt;li&gt;fan-out latency&lt;/li&gt;
&lt;li&gt;push provider errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real question is not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Did we send the event?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can we prove the user received it?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Those are very different questions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metrics worth tracking
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Good target&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Delivery success rate&lt;/td&gt;
&lt;td&gt;&amp;gt;99.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99 delivery latency&lt;/td&gt;
&lt;td&gt;&amp;lt;500ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Consumer lag&lt;/td&gt;
&lt;td&gt;low and stable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retry rate&lt;/td&gt;
&lt;td&gt;close to zero&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Connection churn&lt;/td&gt;
&lt;td&gt;predictable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What to do in production
&lt;/h2&gt;

&lt;p&gt;A good notification system needs a runbook, not just code.&lt;/p&gt;

&lt;p&gt;If retries spike:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Throttle producers.&lt;/li&gt;
&lt;li&gt;Pause non-critical workers.&lt;/li&gt;
&lt;li&gt;Increase retry backoff.&lt;/li&gt;
&lt;li&gt;Check consumer lag.&lt;/li&gt;
&lt;li&gt;Check push provider errors.&lt;/li&gt;
&lt;li&gt;Rehydrate missed clients from durable state.&lt;/li&gt;
&lt;li&gt;Replay safely with idempotency keys.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is what makes the system operationally survivable.&lt;/p&gt;




&lt;h2&gt;
  
  
  The scaling path
&lt;/h2&gt;

&lt;p&gt;A lot of teams go through the same evolution.&lt;/p&gt;

&lt;h3&gt;
  
  
  Startup
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API → WebSocket Server → Client
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Mid-scale
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API → Kafka → Notification Workers → WebSocket Gateway Cluster → Client
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  High-scale
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API → Multi-region event bus → regional workers → regional gateways → clients
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At higher scale, the hardest problems are usually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;state distribution&lt;/li&gt;
&lt;li&gt;per-user ordering&lt;/li&gt;
&lt;li&gt;region routing&lt;/li&gt;
&lt;li&gt;dedupe&lt;/li&gt;
&lt;li&gt;offline recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not CPU.&lt;/p&gt;

&lt;p&gt;State.&lt;/p&gt;




&lt;h2&gt;
  
  
  What strong teams optimize for
&lt;/h2&gt;

&lt;p&gt;Early teams optimize for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;speed of delivery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Strong teams optimize for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;correctness&lt;/li&gt;
&lt;li&gt;recoverability&lt;/li&gt;
&lt;li&gt;observability&lt;/li&gt;
&lt;li&gt;graceful degradation&lt;/li&gt;
&lt;li&gt;idempotency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That difference matters a lot in production.&lt;/p&gt;

&lt;p&gt;A notification that is slightly late is usually acceptable.&lt;/p&gt;

&lt;p&gt;A notification that is duplicated, lost, or out of order is not.&lt;/p&gt;




&lt;h2&gt;
  
  
  Testing the system
&lt;/h2&gt;

&lt;p&gt;You should test the failure modes, not just the happy path.&lt;/p&gt;

&lt;p&gt;Try:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;dropping WebSocket connections mid-message&lt;/li&gt;
&lt;li&gt;killing consumers during processing&lt;/li&gt;
&lt;li&gt;simulating mobile sleep/wake cycles&lt;/li&gt;
&lt;li&gt;forcing Kafka rebalances&lt;/li&gt;
&lt;li&gt;replaying duplicate events&lt;/li&gt;
&lt;li&gt;load testing fan-out spikes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the system only works when nothing fails, it is not ready.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;Real-time notification systems look simple until scale, retries, mobile behavior, ordering, and distributed state show up.&lt;/p&gt;

&lt;p&gt;Then they become one of the hardest backend problems in the product.&lt;/p&gt;

&lt;p&gt;The goal is not just to send events.&lt;/p&gt;

&lt;p&gt;The goal is to make sure the right user gets the right notification, with the right semantics, even when the system is under stress.&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>websockets</category>
      <category>kafka</category>
      <category>backend</category>
    </item>
    <item>
      <title>AI Wrappers Are Dying: Why Most AI Products Fail</title>
      <dc:creator>Damir Karimov</dc:creator>
      <pubDate>Wed, 27 May 2026 12:01:10 +0000</pubDate>
      <link>https://dev.to/damir-karimov/ai-wrappers-are-dying-why-most-ai-products-fail-ano</link>
      <guid>https://dev.to/damir-karimov/ai-wrappers-are-dying-why-most-ai-products-fail-ano</guid>
      <description>&lt;p&gt;In 2026, building an app on top of OpenAI or Anthropic is easier than ever. But wrappers are dying.&lt;/p&gt;

&lt;p&gt;A polished UI and a few RAG pipelines can get you to launch. They will not get you lasting advantage.&lt;/p&gt;

&lt;p&gt;OpenAI API is not a competitive moat.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wrappers Are Dying
&lt;/h2&gt;

&lt;p&gt;The first wave of AI startups was inevitable. Foundation models became powerful enough that developers could ship useful products without training models from scratch. The barrier to entry dropped dramatically.&lt;/p&gt;

&lt;p&gt;The market filled up with wrappers.&lt;/p&gt;

&lt;p&gt;That was not irrational. It was the fastest way to test demand and prove people would pay for AI-enabled outcomes. For many founders, a wrapper was the right starting point. It reduced time-to-market and let them focus on distribution.&lt;/p&gt;

&lt;p&gt;But wrappers that worked for speed do not work for defensibility.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Wrappers Are Fragile
&lt;/h2&gt;

&lt;p&gt;A wrapper around an LLM is a thin interface over someone else's intelligence.&lt;/p&gt;

&lt;p&gt;When the underlying model improves, your product advantage shrinks. When a competitor copies your UX, your edge disappears. When the model provider ships your core feature natively, your differentiation collapses overnight.&lt;/p&gt;

&lt;p&gt;The closer your product is to a generic interface over a foundation model, the easier it is to clone.&lt;/p&gt;

&lt;p&gt;Three problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The UI is visible and easy to imitate.&lt;/li&gt;
&lt;li&gt;The prompts and workflows are often not deeply proprietary.&lt;/li&gt;
&lt;li&gt;The core model capability is rented, not owned.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Many AI products compete on packaging rather than infrastructure.&lt;/p&gt;

&lt;p&gt;If your product can be described as "ChatGPT, but for X," you have product-market fit risk before you have a moat.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Creates Real Moat
&lt;/h2&gt;

&lt;p&gt;A real moat in AI is not "we use GPT." It is owning something the next startup cannot easily replicate.&lt;/p&gt;

&lt;p&gt;That includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Proprietary data&lt;/li&gt;
&lt;li&gt;Embedded workflows&lt;/li&gt;
&lt;li&gt;Deep enterprise integration&lt;/li&gt;
&lt;li&gt;Distribution advantages&lt;/li&gt;
&lt;li&gt;Domain-specific expertise&lt;/li&gt;
&lt;li&gt;Feedback loops that improve the product over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Model access is replaceable. Workflow capture is sticky.&lt;/p&gt;

&lt;p&gt;If your product becomes part of how a team actually works, not just a tool they try once, you build defensibility. If you own the system of record, the approval flow, the compliance layer, or the operational pipeline, you are selling infrastructure, not AI.&lt;/p&gt;

&lt;p&gt;The more your product learns from user behavior, customer data, and domain-specific outcomes, the harder it becomes to copy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Moat Patterns That Survive
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Proprietary Data Moat
&lt;/h3&gt;

&lt;p&gt;If your product collects high-signal, domain-specific data that competitors cannot access, you improve faster over time.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;labeled support cases&lt;/li&gt;
&lt;li&gt;medical annotations&lt;/li&gt;
&lt;li&gt;legal review outcomes&lt;/li&gt;
&lt;li&gt;sales conversation feedback&lt;/li&gt;
&lt;li&gt;codebase-specific assistant traces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The moat works only if the data turns into better predictions, better retrieval, or better workflow decisions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Workflow Moat
&lt;/h3&gt;

&lt;p&gt;If your product becomes the place where work starts, gets reviewed, and gets approved, switching becomes painful.&lt;/p&gt;

&lt;p&gt;Workflow moats require:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;native integrations&lt;/li&gt;
&lt;li&gt;permissions and access control&lt;/li&gt;
&lt;li&gt;human-in-the-loop steps&lt;/li&gt;
&lt;li&gt;audit logs&lt;/li&gt;
&lt;li&gt;reliable outputs that fit existing processes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Enterprise AI products win by becoming infrastructure, not assistants.&lt;/p&gt;

&lt;h3&gt;
  
  
  Distribution Moat
&lt;/h3&gt;

&lt;p&gt;If your product is embedded in Slack, email, CRM, IDEs, or internal tooling, it becomes harder to displace. Adoption is already inside the user's daily flow.&lt;/p&gt;

&lt;p&gt;The best model in the world loses if users never reach it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trust and Compliance Moat
&lt;/h3&gt;

&lt;p&gt;In regulated environments, trust is product value.&lt;/p&gt;

&lt;p&gt;If you can prove data handling, retention rules, access controls, auditability, and predictable behavior, you compete on more than output quality. For enterprise buyers, this is the difference between a demo and a contract.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost and Infrastructure Moat
&lt;/h3&gt;

&lt;p&gt;Some AI products create advantage by reducing inference cost, latency, or operational overhead at scale.&lt;/p&gt;

&lt;p&gt;This moat is weaker than proprietary data or workflow lock-in. It matters when usage volume is high. If you deliver similar quality at lower cost, your margin improves and pricing flexibility increases.&lt;/p&gt;




&lt;h2&gt;
  
  
  RAG Alone Is Not Enough
&lt;/h2&gt;

&lt;p&gt;RAG is useful. It is not a moat.&lt;/p&gt;

&lt;p&gt;Retrieval connects foundation models to private corpora, internal docs, and customer-specific context. But if every competitor can index similar documents and call the same model, the architecture is not defensible.&lt;/p&gt;

&lt;p&gt;RAG becomes valuable when paired with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;proprietary corpora&lt;/li&gt;
&lt;li&gt;strong ranking and retrieval quality&lt;/li&gt;
&lt;li&gt;feedback loops&lt;/li&gt;
&lt;li&gt;domain-specific evaluation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The moat is not the retrieval layer. It is retrieval, data quality, and embedded usage over time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Platform Dependency Is a Liability
&lt;/h2&gt;

&lt;p&gt;The biggest hidden risk in AI startups is platform dependency.&lt;/p&gt;

&lt;p&gt;If your roadmap depends on a single provider, you inherit their pricing, latency, policy changes, rate limits, and feature roadmap. That is not a moat. That is a liability.&lt;/p&gt;

&lt;p&gt;When OpenAI improves a capability, it helps the whole market, including your competitors. When OpenAI ships a built-in feature that overlaps with your product, your differentiation evaporates overnight.&lt;/p&gt;

&lt;p&gt;Relying entirely on external model APIs is dangerous for long-term architecture. The more your product is a front-end to a general model, the more exposed you are to commoditization.&lt;/p&gt;

&lt;p&gt;Ask this: if model prices change, if output quality improves, or if the model vendor ships your core feature natively, what still makes you valuable?&lt;/p&gt;




&lt;h2&gt;
  
  
  Enterprise Workflows Are Where Winners Live
&lt;/h2&gt;

&lt;p&gt;The strongest AI products solve a workflow that already exists inside a company. They do more than "answer questions."&lt;/p&gt;

&lt;p&gt;Enterprise buyers care about more than output quality. They care about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Access control&lt;/li&gt;
&lt;li&gt;Compliance&lt;/li&gt;
&lt;li&gt;Auditability&lt;/li&gt;
&lt;li&gt;Data retention&lt;/li&gt;
&lt;li&gt;Integrations with existing systems&lt;/li&gt;
&lt;li&gt;Human approval steps&lt;/li&gt;
&lt;li&gt;Reliability at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Workflow-based products have stronger moats than generic assistants. They do not just generate text. They become part of operational machinery.&lt;/p&gt;

&lt;p&gt;Once AI is embedded in billing, support, procurement, legal review, or internal knowledge systems, switching costs rise quickly.&lt;/p&gt;

&lt;p&gt;The best products feel "boring" from the outside. They are not flashy consumer apps. They are operational systems that save time, reduce risk, or increase throughput.&lt;/p&gt;




&lt;h2&gt;
  
  
  Vertical AI Wins
&lt;/h2&gt;

&lt;p&gt;Vertical AI is stronger than horizontal AI because it combines domain data, workflow design, and distribution.&lt;/p&gt;

&lt;p&gt;A vertical product knows the problem deeply. It understands terminology, edge cases, compliance rules, and customer expectations in a specific domain. This makes it harder to replace with a generic chatbot.&lt;/p&gt;

&lt;p&gt;Proprietary data becomes especially important here. The more your product learns from a narrow, high-value domain, the more its quality ties to data that others do not have.&lt;/p&gt;

&lt;p&gt;Winners connect three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;domain-specific data&lt;/li&gt;
&lt;li&gt;operational workflow&lt;/li&gt;
&lt;li&gt;recurring business value&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A good vertical AI product is deeply fitted to a single job. That fit becomes harder to copy with every interaction.&lt;/p&gt;




&lt;h2&gt;
  
  
  Which AI Companies Survive
&lt;/h2&gt;

&lt;p&gt;AI companies that survive are not the ones with the flashiest demos. They turn model capability into durable product advantage.&lt;/p&gt;

&lt;p&gt;They:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;own proprietary or hard-to-access data&lt;/li&gt;
&lt;li&gt;sit inside critical workflows&lt;/li&gt;
&lt;li&gt;integrate deeply into enterprise systems&lt;/li&gt;
&lt;li&gt;build operational infrastructure, not interfaces&lt;/li&gt;
&lt;li&gt;create switching costs through usage, trust, and process&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model may be replaceable. The product around it should not be.&lt;/p&gt;

&lt;p&gt;This is the difference between a temporary AI app and a lasting business.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Measure Moat
&lt;/h2&gt;

&lt;p&gt;Signals that the moat is getting stronger:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retention stays high even when model quality changes&lt;/li&gt;
&lt;li&gt;Customers rely on the product as part of a repeatable workflow&lt;/li&gt;
&lt;li&gt;The cost to replicate your dataset is high&lt;/li&gt;
&lt;li&gt;More value comes from your proprietary layer than from the base model&lt;/li&gt;
&lt;li&gt;Integrations increase switching costs over time&lt;/li&gt;
&lt;li&gt;Unit economics improve as usage and feedback grow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Test: if a competitor copied your UI tomorrow, would they still need the same data, trust, integrations, and operational context to match your product?&lt;/p&gt;

&lt;p&gt;If yes, you are building a real moat.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The problem with most AI products is not that they use AI. They confuse access to AI with defensibility.&lt;/p&gt;

&lt;p&gt;A great interface gets attention. It rarely creates a moat. Real technical moats come from data, workflow, infrastructure, and integration — things hard to copy and harder to unwind.&lt;/p&gt;

&lt;p&gt;The right question is not "How can we add a model?" The right question is: What do we own that becomes more valuable over time?&lt;/p&gt;

&lt;p&gt;The best AI companies are not the ones with the loudest demo. They are the ones whose product gets more embedded, more trusted, and more expensive to replace every quarter.&lt;/p&gt;

&lt;p&gt;Wrappers are dying. Build a moat instead.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwareengineering</category>
      <category>startup</category>
      <category>openai</category>
    </item>
    <item>
      <title>Why Good Abstractions Make Debugging Harder</title>
      <dc:creator>Damir Karimov</dc:creator>
      <pubDate>Thu, 21 May 2026 15:03:00 +0000</pubDate>
      <link>https://dev.to/damir-karimov/why-good-abstractions-make-debugging-harder-lo</link>
      <guid>https://dev.to/damir-karimov/why-good-abstractions-make-debugging-harder-lo</guid>
      <description>&lt;p&gt;Good abstractions are great when you are building software.&lt;/p&gt;

&lt;p&gt;They are much less great when you are debugging production.&lt;/p&gt;

&lt;p&gt;The reason is simple: abstraction hides details, and debugging often depends on the details you hoped to ignore.&lt;/p&gt;

&lt;p&gt;In small codebases, this is barely noticeable. In real systems, especially with caches, async flows, optimistic UI, and multiple state owners, it becomes a serious problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The core issue
&lt;/h2&gt;

&lt;p&gt;The more layers you add, the easier it is for the system to become “locally correct” and “globally wrong”.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the frontend thinks the payment succeeded,&lt;/li&gt;
&lt;li&gt;the backend committed the transaction,&lt;/li&gt;
&lt;li&gt;the event was published,&lt;/li&gt;
&lt;li&gt;the cache still serves the old value,&lt;/li&gt;
&lt;li&gt;the UI shows stale data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every layer is doing something reasonable.&lt;/p&gt;

&lt;p&gt;The problem is that they are not all talking about the same version of reality.&lt;/p&gt;

&lt;h2&gt;
  
  
  A simple example
&lt;/h2&gt;

&lt;p&gt;Imagine this flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User clicks &lt;strong&gt;Retry payment&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Frontend updates UI optimistically&lt;/li&gt;
&lt;li&gt;API returns &lt;code&gt;200 OK&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Database is updated&lt;/li&gt;
&lt;li&gt;Event is sent to downstream systems&lt;/li&gt;
&lt;li&gt;Redis still serves old state&lt;/li&gt;
&lt;li&gt;UI refreshes from cache and shows stale data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the kind of bug that wastes hours.&lt;/p&gt;

&lt;p&gt;Not because any single line of code is hard, but because the truth is spread across several places.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example in code
&lt;/h2&gt;

&lt;p&gt;Let’s say the frontend uses optimistic updates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;onRetryPayment&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;setPaymentStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PAID&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/api/payments/retry&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;POST&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Retry failed&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;setPaymentStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;FAILED&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At first glance, this looks fine.&lt;/p&gt;

&lt;p&gt;But now imagine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the API succeeds,&lt;/li&gt;
&lt;li&gt;the DB is updated,&lt;/li&gt;
&lt;li&gt;an event is emitted,&lt;/li&gt;
&lt;li&gt;a consumer deduplicates the event incorrectly,&lt;/li&gt;
&lt;li&gt;Redis still contains the old value,&lt;/li&gt;
&lt;li&gt;the UI re-renders from stale cache.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The bug is no longer in this function.&lt;/p&gt;

&lt;p&gt;The bug is in the &lt;strong&gt;propagation path&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why abstractions make this worse
&lt;/h2&gt;

&lt;p&gt;Abstractions hide the exact mechanics that matter during incidents.&lt;/p&gt;

&lt;p&gt;They hide things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;who owns the state,&lt;/li&gt;
&lt;li&gt;when the state changes,&lt;/li&gt;
&lt;li&gt;whether the update is synchronous or async,&lt;/li&gt;
&lt;li&gt;whether caches are invalidated,&lt;/li&gt;
&lt;li&gt;whether retries are safe,&lt;/li&gt;
&lt;li&gt;whether events can arrive out of order.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is useful in normal development.&lt;/p&gt;

&lt;p&gt;It is terrible during debugging.&lt;/p&gt;

&lt;p&gt;Because when something is wrong, you do not need another clean interface. You need visibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Typical failure patterns
&lt;/h2&gt;

&lt;p&gt;These are the patterns I see most often in real systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Stale read
&lt;/h3&gt;

&lt;p&gt;The data was updated, but one layer still serves an old version.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// DB updated successfully&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;payment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;paymentId&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PAID&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Cache not invalidated&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DB = &lt;code&gt;PAID&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;cache = &lt;code&gt;PENDING&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;UI = &lt;code&gt;PENDING&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Lost update
&lt;/h3&gt;

&lt;p&gt;Two writes happen close together, and one silently overwrites the other.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;updateProfile&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Alex&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;updateProfile&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;John&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the system uses last-write-wins without proper locking or versioning, the final state may not match user intent.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Ghost update
&lt;/h3&gt;

&lt;p&gt;One layer changes, but another never receives the update.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;dispatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;updateOrderStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PAID&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="c1"&gt;// but query cache is never invalidated&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result is a UI that looks stuck even though the backend is correct.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Event reorder bug
&lt;/h3&gt;

&lt;p&gt;Events arrive in a different order than they were produced.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Event B processed before Event A&lt;/span&gt;
&lt;span class="nf"&gt;processEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;payment_succeeded&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nf"&gt;processEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;payment_pending&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the final state may be wrong even if both handlers are valid.&lt;/p&gt;

&lt;h2&gt;
  
  
  The debugging trap
&lt;/h2&gt;

&lt;p&gt;The trap is assuming this is a code bug.&lt;/p&gt;

&lt;p&gt;Very often it is not.&lt;/p&gt;

&lt;p&gt;It is a &lt;strong&gt;state ownership bug&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That means the real question is not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Which function crashed?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real question is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Which layer is the source of truth right now?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you cannot answer that clearly, debugging becomes guesswork.&lt;/p&gt;

&lt;h2&gt;
  
  
  A better way to think about it
&lt;/h2&gt;

&lt;p&gt;Instead of thinking in terms of “where is the bug?”, think in terms of “where does state live?”&lt;/p&gt;

&lt;p&gt;A useful checklist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where is the canonical value stored?&lt;/li&gt;
&lt;li&gt;Which layer may cache it?&lt;/li&gt;
&lt;li&gt;Which layer may derive it?&lt;/li&gt;
&lt;li&gt;Which layer may overwrite it?&lt;/li&gt;
&lt;li&gt;Which layer may delay it?&lt;/li&gt;
&lt;li&gt;Which layer may retry it?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the same value exists in five places, you now have five opportunities for disagreement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Debugging strategy
&lt;/h2&gt;

&lt;p&gt;When a bug crosses abstraction boundaries, I usually inspect it in this order:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Check the source of truth
&lt;/h3&gt;

&lt;p&gt;Confirm where the canonical data lives.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Rebuild the timeline
&lt;/h3&gt;

&lt;p&gt;Trace the state from user action to backend write to cache update to UI read.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Check invalidation
&lt;/h3&gt;

&lt;p&gt;If a cache exists, verify it is updated or cleared at the right moment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Check idempotency
&lt;/h3&gt;

&lt;p&gt;If retries or events are involved, verify the operation can safely happen more than once.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Check ordering
&lt;/h3&gt;

&lt;p&gt;If events are async, verify the system does not depend on strict ordering unless it actually guarantees it.&lt;/p&gt;

&lt;h2&gt;
  
  
  When abstractions do help
&lt;/h2&gt;

&lt;p&gt;This is not an anti-abstraction argument.&lt;/p&gt;

&lt;p&gt;Good abstractions are still valuable when they:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reduce search space,&lt;/li&gt;
&lt;li&gt;make ownership clear,&lt;/li&gt;
&lt;li&gt;keep state local,&lt;/li&gt;
&lt;li&gt;expose transitions explicitly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, a small component with local state is easier to debug than three caches and two event consumers trying to keep the same value in sync.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;setCount&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;button&lt;/span&gt; &lt;span class="na"&gt;onClick&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;setCount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      Count: &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;button&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is easy to reason about because there is one owner of the state.&lt;/p&gt;

&lt;p&gt;That is the difference.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do in real systems
&lt;/h2&gt;

&lt;p&gt;If you want abstractions to stay helpful in production, make them observable.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;add logs at boundaries,&lt;/li&gt;
&lt;li&gt;use trace IDs,&lt;/li&gt;
&lt;li&gt;keep ownership explicit,&lt;/li&gt;
&lt;li&gt;invalidate caches intentionally,&lt;/li&gt;
&lt;li&gt;design retries to be safe,&lt;/li&gt;
&lt;li&gt;avoid hidden duplicated state.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A good abstraction should reduce complexity, not hide the mechanics that make incidents debuggable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;The best abstractions are honest.&lt;/p&gt;

&lt;p&gt;They do not pretend the system is simpler than it is. They make the system easier to understand &lt;strong&gt;without hiding where truth lives&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That is why debugging gets harder as systems grow: not because abstraction is bad, but because abstraction is often too successful at hiding the exact thing you need under pressure.&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>systemdesign</category>
      <category>frontend</category>
      <category>software</category>
    </item>
    <item>
      <title>AI-generated code doesn't fail loudly. It fails correctly-looking.</title>
      <dc:creator>Damir Karimov</dc:creator>
      <pubDate>Wed, 13 May 2026 12:39:58 +0000</pubDate>
      <link>https://dev.to/damir-karimov/ai-generated-code-doesnt-fail-loudly-it-fails-correctly-looking-1acc</link>
      <guid>https://dev.to/damir-karimov/ai-generated-code-doesnt-fail-loudly-it-fails-correctly-looking-1acc</guid>
      <description>&lt;p&gt;AI-generated code rarely breaks in obvious ways. It passes review, ships&lt;br&gt;
to production, and behaves correctly in controlled scenarios. The&lt;br&gt;
problem is what happens after: failures appear only under timing, load,&lt;br&gt;
retries, or inconsistent state transitions.&lt;/p&gt;

&lt;p&gt;The core issue is not obvious bugs. It is code that looks structurally&lt;br&gt;
correct while silently ignoring real-world failure modes.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why AI code feels correct
&lt;/h2&gt;

&lt;p&gt;AI tends to generate implementations with strong surface-level signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  consistent TypeScript types&lt;/li&gt;
&lt;li&gt;  standard architectural patterns&lt;/li&gt;
&lt;li&gt;  clean async/await flows&lt;/li&gt;
&lt;li&gt;  readable naming conventions&lt;/li&gt;
&lt;li&gt;  familiar framework usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This produces a strong cognitive bias during review. The code does not&lt;br&gt;
look "risky", so it is assumed to be correct.&lt;/p&gt;

&lt;p&gt;The gap appears because readability is not equivalent to correctness&lt;br&gt;
under production conditions.&lt;/p&gt;


&lt;h2&gt;
  
  
  Where AI-generated code typically fails
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Concurrency and race conditions
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;updateProfile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Profile&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;setLoading&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;updateProfile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="nf"&gt;setUser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;setLoading&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This assumes a single linear execution.&lt;/p&gt;

&lt;p&gt;In real systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  multiple requests can run in parallel&lt;/li&gt;
&lt;li&gt;  responses can resolve out of order&lt;/li&gt;
&lt;li&gt;  later responses can overwrite newer state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result: stale state overwrite without errors or crashes.&lt;/p&gt;


&lt;h3&gt;
  
  
  2. Optimistic updates without consistency guarantees
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;setTodos&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;newTodo&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createTodo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;newTodo&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This assumes success.&lt;/p&gt;

&lt;p&gt;Failure scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  request fails but UI is not rolled back&lt;/li&gt;
&lt;li&gt;  retry creates duplicate entries&lt;/li&gt;
&lt;li&gt;  frontend state diverges from backend state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system remains "visually correct" while data integrity is broken.&lt;/p&gt;


&lt;h3&gt;
  
  
  3. Stale closures and lifecycle assumptions
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;useEffect&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;interval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;setInterval&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;count&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;clearInterval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;interval&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;[]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This pattern locks in initial state.&lt;/p&gt;

&lt;p&gt;In production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  values become stale over time&lt;/li&gt;
&lt;li&gt;  UI desynchronization occurs&lt;/li&gt;
&lt;li&gt;  behavior depends on render timing rather than logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No runtime error occurs, so the issue is often missed.&lt;/p&gt;


&lt;h3&gt;
  
  
  4. Weak caching and invalidation logic
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cacheKey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`user-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;cacheKey&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;cacheKey&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetchUser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;cacheKey&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This assumes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  stable data shape&lt;/li&gt;
&lt;li&gt;  stable identity rules&lt;/li&gt;
&lt;li&gt;  single write path&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In real systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  partial updates invalidate assumptions&lt;/li&gt;
&lt;li&gt;  multiple services mutate the same entity&lt;/li&gt;
&lt;li&gt;  cache becomes silently stale rather than obviously wrong&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  5. Hidden assumptions about APIs
&lt;/h3&gt;

&lt;p&gt;AI can introduce plausible but non-existent APIs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;refreshSession&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;force&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="nx"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invalidateAllQueries&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These patterns often:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  look consistent with ecosystem conventions&lt;/li&gt;
&lt;li&gt;  pass code review without deep verification&lt;/li&gt;
&lt;li&gt;  fail only at runtime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This shifts errors from compile-time to production-time.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. Accumulated lifecycle leaks
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;useEffect&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;controller&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AbortController&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="nf"&gt;fetchData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;controller&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;controller&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Individually correct, but when repeated across systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  inconsistent cleanup patterns accumulate&lt;/li&gt;
&lt;li&gt;  aborted requests still resolve in edge cases&lt;/li&gt;
&lt;li&gt;  memory usage grows gradually&lt;/li&gt;
&lt;li&gt;  behavior becomes harder to reproduce&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Systemic issue: reduced verification depth
&lt;/h2&gt;

&lt;p&gt;The main shift introduced by AI-generated code is not implementation&lt;br&gt;
speed, but review behavior.&lt;/p&gt;

&lt;p&gt;Before AI, writing code required reasoning during implementation. After&lt;br&gt;
AI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  code already looks complete&lt;/li&gt;
&lt;li&gt;  structure appears correct by default&lt;/li&gt;
&lt;li&gt;  reviewers focus on surface validation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This creates a subtle degradation in engineering discipline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  fewer edge-case simulations&lt;/li&gt;
&lt;li&gt;  less reasoning about concurrency&lt;/li&gt;
&lt;li&gt;  weaker validation of failure states&lt;/li&gt;
&lt;li&gt;  acceptance of "looks correct" as correctness&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Impact on real systems
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Frontend state drift
&lt;/h3&gt;

&lt;p&gt;UI remains stable visually while backend state diverges.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Authentication and session issues
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  race conditions during token refresh&lt;/li&gt;
&lt;li&gt;  inconsistent logout handling&lt;/li&gt;
&lt;li&gt;  background requests using invalid sessions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Payments and idempotency problems
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  duplicate transactions&lt;/li&gt;
&lt;li&gt;  retries without deduplication&lt;/li&gt;
&lt;li&gt;  partial failure inconsistencies&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Distributed system inconsistencies
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  assumption of ordering guarantees&lt;/li&gt;
&lt;li&gt;  reliance on immediate consistency&lt;/li&gt;
&lt;li&gt;  incorrect retry semantics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These issues are not immediately visible. They surface as rare,&lt;br&gt;
non-reproducible incidents.&lt;/p&gt;




&lt;h2&gt;
  
  
  The real risk
&lt;/h2&gt;

&lt;p&gt;AI does not generate obviously wrong code.&lt;/p&gt;

&lt;p&gt;It generates code that satisfies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  type safety&lt;/li&gt;
&lt;li&gt;  structural conventions&lt;/li&gt;
&lt;li&gt;  expected patterns&lt;/li&gt;
&lt;li&gt;  readable abstractions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This creates false confidence during review.&lt;/p&gt;

&lt;p&gt;The critical failure is not bugs themselves, but reduced skepticism&lt;br&gt;
toward code that appears correct.&lt;/p&gt;

&lt;p&gt;Once that happens, correctness is no longer actively verified. It is&lt;br&gt;
assumed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;AI increases development speed, but it also changes how correctness is&lt;br&gt;
perceived.&lt;/p&gt;

&lt;p&gt;The danger is code that looks correct enough that nobody questions it deeply.&lt;/p&gt;

&lt;p&gt;When that happens, production issues stop being introduced by obvious mistakes and start emerging from unexamined assumptions embedded in clean-looking code.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwareengineering</category>
      <category>codequality</category>
      <category>frontend</category>
    </item>
    <item>
      <title>LLM-Driven Client-Side Caching: A Hybrid Decision Architecture</title>
      <dc:creator>Damir Karimov</dc:creator>
      <pubDate>Mon, 04 May 2026 15:22:22 +0000</pubDate>
      <link>https://dev.to/damir-karimov/llm-driven-client-side-caching-a-hybrid-decision-architecture-322m</link>
      <guid>https://dev.to/damir-karimov/llm-driven-client-side-caching-a-hybrid-decision-architecture-322m</guid>
      <description>&lt;p&gt;Client-side caching is usually implemented as a storage optimization layer (TTL, SWR, invalidation rules). In practice it behaves like a decision system under uncertainty.&lt;/p&gt;

&lt;p&gt;Static strategies fail when data volatility is non-uniform across the same application. This leads to either stale UI or excessive network traffic.&lt;/p&gt;

&lt;p&gt;This article breaks down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;why standard caching approaches plateau&lt;/li&gt;
&lt;li&gt;where ML improves the system&lt;/li&gt;
&lt;li&gt;where LLMs actually fit&lt;/li&gt;
&lt;li&gt;how to design a production-grade decision pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Problem: caching is not a storage problem
&lt;/h2&gt;

&lt;p&gt;Different data types behave differently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;user profiles → low volatility&lt;/li&gt;
&lt;li&gt;feeds / notifications → high volatility&lt;/li&gt;
&lt;li&gt;search results → context-dependent volatility&lt;/li&gt;
&lt;li&gt;partially hydrated UI → unknown volatility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The core issue:&lt;/p&gt;

&lt;p&gt;caching requires a policy decision per request, not a static rule&lt;/p&gt;

&lt;p&gt;So the real problem is:&lt;/p&gt;

&lt;p&gt;data → context → decision (cache / revalidate / bypass)&lt;/p&gt;

&lt;h2&gt;
  
  
  Baseline systems (what already exists)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. SWR / TTL-based caching
&lt;/h3&gt;

&lt;p&gt;Used in React Query / SWR:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stale-while-revalidate&lt;/li&gt;
&lt;li&gt;background refetch&lt;/li&gt;
&lt;li&gt;TTL invalidation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Works when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;update cycles are predictable&lt;/li&gt;
&lt;li&gt;data freshness is stable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fails when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;volatility varies inside the same dataset&lt;/li&gt;
&lt;li&gt;freshness depends on UI state&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Heuristic scoring systems
&lt;/h3&gt;

&lt;p&gt;Example adaptive TTL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;
&lt;span class="nx"&gt;volatilityScore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EWMA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;changeFrequency&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;priorityScore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;userInteractionWeight&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;dataImportance&lt;/span&gt;
&lt;span class="nx"&gt;ttl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;baseTTL&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;volatilityScore&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Improves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;adaptive cache lifetime&lt;/li&gt;
&lt;li&gt;frequency-aware invalidation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;requires manual feature design&lt;/li&gt;
&lt;li&gt;domain-specific tuning&lt;/li&gt;
&lt;li&gt;breaks under missing signals&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Lightweight ML models
&lt;/h3&gt;

&lt;p&gt;Typical approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;logistic regression&lt;/li&gt;
&lt;li&gt;XGBoost / LightGBM&lt;/li&gt;
&lt;li&gt;embedding classifiers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pros:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fast inference&lt;/li&gt;
&lt;li&gt;stable behavior&lt;/li&gt;
&lt;li&gt;cheaper than LLMs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;needs labeled “optimal cache decision” data (rare)&lt;/li&gt;
&lt;li&gt;retraining pipeline required&lt;/li&gt;
&lt;li&gt;brittle under product changes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why all baseline approaches plateau
&lt;/h2&gt;

&lt;p&gt;All classical systems assume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;feature space is complete&lt;/li&gt;
&lt;li&gt;behavior is stationary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In real systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;user behavior is contextual&lt;/li&gt;
&lt;li&gt;volatility depends on UI state&lt;/li&gt;
&lt;li&gt;freshness is semantic, not numeric&lt;/li&gt;
&lt;li&gt;signals are incomplete&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;heuristics → saturate&lt;/li&gt;
&lt;li&gt;ML-light → overfit or drift&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key idea: caching is a decision system under uncertainty
&lt;/h2&gt;

&lt;p&gt;Instead of:&lt;/p&gt;

&lt;p&gt;“how long do we cache this?”&lt;/p&gt;

&lt;p&gt;The correct formulation is:&lt;/p&gt;

&lt;p&gt;“what action should we take given incomplete information?”&lt;/p&gt;

&lt;p&gt;actions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HIT&lt;/li&gt;
&lt;li&gt;REVALIDATE&lt;/li&gt;
&lt;li&gt;BYPASS&lt;/li&gt;
&lt;li&gt;SWR&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where LLMs fit (and where they don’t)
&lt;/h2&gt;

&lt;p&gt;LLMs are not a replacement layer.&lt;/p&gt;

&lt;p&gt;They function as:&lt;/p&gt;

&lt;p&gt;fallback policy engine for ambiguous decision space&lt;/p&gt;

&lt;p&gt;They are useful only when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scoring model confidence is low&lt;/li&gt;
&lt;li&gt;signals conflict&lt;/li&gt;
&lt;li&gt;unseen patterns appear&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture: layered decision system
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;UI Layer
   ↓
Context Builder
   ↓
Policy Engine
   ├── Rule Layer (deterministic)
   ├── ML Scoring Layer (probabilistic)
   └── LLM Fallback Layer (uncertainty)
   ↓
Cache Layer
   ↓
Network
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Context model (input abstraction)
&lt;/h2&gt;

&lt;p&gt;All decisions must be based on structured signals:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user_feed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"lastUpdatedMs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"accessFrequency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"volatilityScore"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.82&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"userAction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"scroll"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stalenessToleranceMs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Important constraint:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no raw prompts&lt;/li&gt;
&lt;li&gt;only structured features&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  LLM role (strictly bounded)
&lt;/h2&gt;

&lt;p&gt;LLM is only a classifier:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"strategy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"HIT | REVALIDATE | BYPASS | SWR"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ttlMs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.78&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Triggered only when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ML confidence &amp;lt; threshold&lt;/li&gt;
&lt;li&gt;feature signals conflict&lt;/li&gt;
&lt;li&gt;unseen context patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Meta-cache: caching the decision layer
&lt;/h2&gt;

&lt;p&gt;To reduce cost:&lt;/p&gt;

&lt;p&gt;decisionCache(contextHash) → strategy&lt;/p&gt;

&lt;p&gt;Effects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;avoids repeated LLM calls&lt;/li&gt;
&lt;li&gt;stabilizes latency&lt;/li&gt;
&lt;li&gt;amortizes inference cost&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cost-aware execution pipeline
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IF rule matches:
    use rule engine
ELSE IF ML confidence &amp;gt; threshold:
    use ML model
ELSE:
    use LLM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Typical production distribution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;80–90% rules&lt;/li&gt;
&lt;li&gt;10–20% ML&lt;/li&gt;
&lt;li&gt;&amp;lt;10% LLM&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Failure modes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Overuse of LLM
&lt;/h3&gt;

&lt;p&gt;Problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cost spikes&lt;/li&gt;
&lt;li&gt;unpredictable latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mitigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;strict confidence gating&lt;/li&gt;
&lt;li&gt;bounded invocation layer&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Latency variance
&lt;/h3&gt;

&lt;p&gt;Problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;inconsistent response time in UI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mitigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;decision caching&lt;/li&gt;
&lt;li&gt;async precomputation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Model drift
&lt;/h3&gt;

&lt;p&gt;Problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ML decisions degrade over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mitigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;feedback loop&lt;/li&gt;
&lt;li&gt;periodic recalibration&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Engineering takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;caching is a decision system, not storage optimization&lt;/li&gt;
&lt;li&gt;SWR + heuristics solve majority of cases&lt;/li&gt;
&lt;li&gt;ML-light is optimal in stable feature spaces&lt;/li&gt;
&lt;li&gt;LLMs are only for ambiguous cases&lt;/li&gt;
&lt;li&gt;production systems require strict routing hierarchy&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Client-side caching becomes effective only when modeled as a layered decision system.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rules handle deterministic cases&lt;/li&gt;
&lt;li&gt;ML handles structured uncertainty&lt;/li&gt;
&lt;li&gt;LLM handles ambiguity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The correct design is hybrid, with strict boundaries and cost control, not LLM-centric&lt;/p&gt;

&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;Where should the boundary be defined between ML confidence and LLM fallback in production caching systems?&lt;/p&gt;

</description>
      <category>frontend</category>
      <category>systemdesign</category>
      <category>architecture</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
