<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dhruvi</title>
    <description>The latest articles on DEV Community by Dhruvi (@dhruvi_21).</description>
    <link>https://dev.to/dhruvi_21</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3894569%2Fe31cc617-f38a-4448-a25e-dfb161e3364d.png</url>
      <title>DEV Community: Dhruvi</title>
      <link>https://dev.to/dhruvi_21</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dhruvi_21"/>
    <language>en</language>
    <item>
      <title>How We Debug Issues That Only Happen Once Every Few Days</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Fri, 15 May 2026 12:45:23 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/how-we-debug-issues-that-only-happen-once-every-few-days-22kd</link>
      <guid>https://dev.to/dhruvi_21/how-we-debug-issues-that-only-happen-once-every-few-days-22kd</guid>
      <description>&lt;p&gt;The hardest bugs are not the ones that happen constantly.&lt;/p&gt;

&lt;p&gt;The hardest ones are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;once every few days&lt;/li&gt;
&lt;li&gt;under unknown conditions&lt;/li&gt;
&lt;li&gt;with no obvious pattern
Especially in systems that run continuously.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because by the time you notice the issue, the original state is already gone.&lt;/p&gt;

&lt;p&gt;Early on, I used to approach these bugs the wrong way.&lt;/p&gt;

&lt;p&gt;I would immediately start reading logs and trying to reproduce the issue locally.&lt;/p&gt;

&lt;p&gt;Most of the time, that went nowhere.&lt;/p&gt;

&lt;p&gt;Because these problems usually depend on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;timing&lt;/li&gt;
&lt;li&gt;retries&lt;/li&gt;
&lt;li&gt;load&lt;/li&gt;
&lt;li&gt;specific data states&lt;/li&gt;
&lt;li&gt;interactions between systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Things that almost never exist in your local environment the same way.&lt;/p&gt;

&lt;p&gt;What changed for me was realizing:&lt;/p&gt;

&lt;p&gt;The goal is not “find the bug immediately.”&lt;/p&gt;

&lt;p&gt;The goal is:&lt;br&gt;
make the system observable enough that the bug exposes itself next time.&lt;/p&gt;

&lt;p&gt;So instead of guessing, we start adding visibility around the problem.&lt;/p&gt;

&lt;p&gt;Things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tracking state transitions&lt;/li&gt;
&lt;li&gt;storing retry history&lt;/li&gt;
&lt;li&gt;recording execution timing&lt;/li&gt;
&lt;li&gt;correlating events across systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not permanent debugging noise.&lt;/p&gt;

&lt;p&gt;Just enough context to reconstruct what actually happened later.&lt;/p&gt;

&lt;p&gt;Another thing I learned:&lt;/p&gt;

&lt;p&gt;Rare bugs are often not random.&lt;/p&gt;

&lt;p&gt;They usually happen when multiple small conditions align:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a delayed queue&lt;/li&gt;
&lt;li&gt;a retry arriving late&lt;/li&gt;
&lt;li&gt;stale data&lt;/li&gt;
&lt;li&gt;another service slowing down&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Individually, nothing breaks.&lt;/p&gt;

&lt;p&gt;Together, something weird appears for 30 seconds and disappears again.&lt;/p&gt;

&lt;p&gt;One mistake I made a lot before:&lt;/p&gt;

&lt;p&gt;Trying to “fix” the issue too early.&lt;/p&gt;

&lt;p&gt;When you don’t fully understand intermittent bugs, quick fixes usually just hide the symptom temporarily.&lt;/p&gt;

&lt;p&gt;So now I spend more time understanding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what sequence created the issue&lt;/li&gt;
&lt;li&gt;what state the system was in&lt;/li&gt;
&lt;li&gt;why recovery didn’t happen automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only then do we change the flow.&lt;/p&gt;

&lt;p&gt;The interesting part is that debugging these issues slowly changes how you design systems.&lt;/p&gt;

&lt;p&gt;You stop building only for normal operation.&lt;/p&gt;

&lt;p&gt;You start building for investigation too.&lt;/p&gt;

&lt;p&gt;Because eventually, every long-running system develops behaviors you didn’t predict.&lt;/p&gt;

&lt;p&gt;At BrainPack, a lot of debugging work involves understanding interactions between systems that only fail under very specific timing conditions. The more AI workflows and automations are layered on top, the more important observability and recoverability become.&lt;/p&gt;

</description>
      <category>backend</category>
    </item>
    <item>
      <title>A Tool That Saves Me Time Every Single Week</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Mon, 11 May 2026 12:50:54 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/a-tool-that-saves-me-time-every-single-week-219j</link>
      <guid>https://dev.to/dhruvi_21/a-tool-that-saves-me-time-every-single-week-219j</guid>
      <description>&lt;p&gt;One thing that saves me an absurd amount of time is building small internal debugging endpoints.&lt;/p&gt;

&lt;p&gt;Not dashboards.&lt;br&gt;
Not full admin panels.&lt;/p&gt;

&lt;p&gt;Just tiny routes or tools that answer very specific questions fast.&lt;/p&gt;

&lt;p&gt;Things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“show the last sync status for this customer”&lt;/li&gt;
&lt;li&gt;“replay this failed webhook”&lt;/li&gt;
&lt;li&gt;“show all retries for this workflow”&lt;/li&gt;
&lt;li&gt;“compare the data between these two systems”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Early on, I used to debug everything directly from logs and databases.&lt;/p&gt;

&lt;p&gt;It worked when systems were smaller.&lt;/p&gt;

&lt;p&gt;But once multiple services, queues, integrations, and retries are involved, simple issues start taking way too long to trace manually.&lt;/p&gt;

&lt;p&gt;So now, whenever I notice:&lt;br&gt;
“I keep checking this manually”&lt;/p&gt;

&lt;p&gt;I usually turn it into a small internal tool.&lt;/p&gt;

&lt;p&gt;The interesting part is that these tools are rarely complicated.&lt;/p&gt;

&lt;p&gt;Sometimes it’s:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one endpoint&lt;/li&gt;
&lt;li&gt;one query&lt;/li&gt;
&lt;li&gt;one button&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But removing 20 minutes of repeated investigation every day adds up fast.&lt;/p&gt;

&lt;p&gt;Especially in systems that run continuously.&lt;/p&gt;

&lt;p&gt;Another thing I realized:&lt;/p&gt;

&lt;p&gt;The best internal tools are usually built by the people operating the system directly.&lt;/p&gt;

&lt;p&gt;Because they come from real friction.&lt;/p&gt;

&lt;p&gt;Not assumptions about what might be useful.&lt;/p&gt;

&lt;p&gt;A lot of engineering time is lost not on fixing problems, but on figuring out where the problem actually is.&lt;/p&gt;

&lt;p&gt;Anything that shortens that feedback loop becomes valuable very quickly.&lt;/p&gt;

&lt;p&gt;At BrainPack, a lot of our internal tooling comes directly from operating live enterprise systems continuously. Once AI workflows and multiple integrations are involved, reducing investigation time becomes just as important as reducing failure rates.&lt;/p&gt;

</description>
      <category>backend</category>
      <category>productivity</category>
      <category>devtools</category>
    </item>
    <item>
      <title>The Hidden Cost of “Quick Fixes” in Enterprise Systems</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Thu, 07 May 2026 13:27:45 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/the-hidden-cost-of-quick-fixes-in-enterprise-systems-258j</link>
      <guid>https://dev.to/dhruvi_21/the-hidden-cost-of-quick-fixes-in-enterprise-systems-258j</guid>
      <description>&lt;p&gt;Most enterprise systems don’t become messy all at once.&lt;/p&gt;

&lt;p&gt;They become messy one quick fix at a time.&lt;/p&gt;

&lt;p&gt;A temporary script.&lt;br&gt;
A manual spreadsheet.&lt;br&gt;
A copied database table.&lt;br&gt;
A workflow someone added “just for now.”&lt;/p&gt;

&lt;p&gt;Individually, none of these seem dangerous.&lt;/p&gt;

&lt;p&gt;But after a few years, the organization is running on layers of patches nobody fully understands anymore.&lt;/p&gt;

&lt;p&gt;The problem with quick fixes is that they solve the immediate issue while quietly increasing system complexity.&lt;/p&gt;

&lt;p&gt;And complexity compounds.&lt;/p&gt;

&lt;p&gt;What starts as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one workaround&lt;/li&gt;
&lt;li&gt;becomes:&lt;/li&gt;
&lt;li&gt;multiple duplicate processes&lt;/li&gt;
&lt;li&gt;inconsistent data&lt;/li&gt;
&lt;li&gt;hidden dependencies&lt;/li&gt;
&lt;li&gt;workflows that only one person understands&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At some point, nobody trusts the system anymore.&lt;/p&gt;

&lt;p&gt;So teams create even more manual processes to compensate.&lt;/p&gt;

&lt;p&gt;That’s usually when things start slowing down operationally.&lt;/p&gt;

&lt;p&gt;One thing I noticed working on these systems:&lt;/p&gt;

&lt;p&gt;The biggest cost is rarely technical debt itself.&lt;/p&gt;

&lt;p&gt;It’s operational uncertainty.&lt;/p&gt;

&lt;p&gt;People stop knowing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which system is correct&lt;/li&gt;
&lt;li&gt;what process is actually being used&lt;/li&gt;
&lt;li&gt;whether automations can be trusted&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And once trust disappears, everything becomes slower because humans start double checking everything manually.&lt;/p&gt;

&lt;p&gt;The tricky part is that most quick fixes are not bad decisions at the time.&lt;/p&gt;

&lt;p&gt;The business needed something fast.&lt;br&gt;
The team solved the problem.&lt;br&gt;
Everyone moved on.&lt;/p&gt;

&lt;p&gt;But systems that run continuously remember every shortcut forever.&lt;/p&gt;

&lt;p&gt;What changed how I approach this:&lt;/p&gt;

&lt;p&gt;I stopped asking:&lt;br&gt;
“does this solve the problem?”&lt;/p&gt;

&lt;p&gt;Now the question is:&lt;br&gt;
“what does this make harder six months from now?”&lt;/p&gt;

&lt;p&gt;Because in long running systems, future complexity is usually more expensive than the original issue.&lt;/p&gt;

&lt;p&gt;A lot of the work we do at BrainPack starts with untangling years of accumulated workarounds across existing systems. AI only becomes useful once the underlying operations are predictable enough to trust again.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>discuss</category>
      <category>softwareengineering</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Why Logging Is Not Enough When You Operate Systems Continuously</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Mon, 04 May 2026 17:08:59 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/why-logging-is-not-enough-when-you-operate-systems-continuously-3k0o</link>
      <guid>https://dev.to/dhruvi_21/why-logging-is-not-enough-when-you-operate-systems-continuously-3k0o</guid>
      <description>&lt;p&gt;At some point, logs stop helping.&lt;/p&gt;

&lt;p&gt;Not because logging is bad.&lt;br&gt;
Because the system is doing too much.&lt;/p&gt;

&lt;p&gt;When you’re running something continuously, across multiple systems, logs turn into noise fast.&lt;/p&gt;

&lt;p&gt;You still log everything.&lt;br&gt;
You just can’t rely on it to understand what’s actually happening.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The expectation&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Early on, logging feels like the answer.&lt;/p&gt;

&lt;p&gt;Something breaks → check logs → find the issue → fix it&lt;/p&gt;

&lt;p&gt;Clean. Linear. Works in small systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What actually happens&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In production, it looks like this:&lt;/p&gt;

&lt;p&gt;thousands of log lines per minute&lt;br&gt;
multiple services writing at the same time&lt;br&gt;
retries creating duplicate entries&lt;br&gt;
partial failures that don’t throw clear errors&lt;/p&gt;

&lt;p&gt;You open logs and see everything.&lt;/p&gt;

&lt;p&gt;Which means you see nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The real problem&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Logs tell you what happened.&lt;/p&gt;

&lt;p&gt;They don’t tell you:&lt;/p&gt;

&lt;p&gt;what state the system is in&lt;br&gt;
what is currently broken&lt;br&gt;
what needs attention right now&lt;/p&gt;

&lt;p&gt;And when things run continuously, that’s what you actually need.&lt;/p&gt;

&lt;p&gt;What we started doing instead&lt;/p&gt;

&lt;p&gt;We still log. But we stopped treating logs as the source of truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Track state, not just events&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Instead of just writing logs like:&lt;/p&gt;

&lt;p&gt;“order created”&lt;br&gt;
“order failed”&lt;/p&gt;

&lt;p&gt;We track:&lt;/p&gt;

&lt;p&gt;current status of the order&lt;br&gt;
where it is in the flow&lt;br&gt;
what’s pending&lt;/p&gt;

&lt;p&gt;So at any moment, we can answer:&lt;/p&gt;

&lt;p&gt;what’s stuck right now&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2. Surface problems, don’t search for them&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Logs require you to go looking.&lt;/p&gt;

&lt;p&gt;In real systems, you don’t have time for that.&lt;/p&gt;

&lt;p&gt;So we build:&lt;/p&gt;

&lt;p&gt;alerts when something is off&lt;br&gt;
dashboards that show broken flows&lt;br&gt;
queues that show backlog&lt;/p&gt;

&lt;p&gt;The system tells you where to look.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Group by flow, not by line&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Logs are isolated lines.&lt;/p&gt;

&lt;p&gt;But real issues happen across a sequence.&lt;/p&gt;

&lt;p&gt;So we group things by:&lt;/p&gt;

&lt;p&gt;request&lt;br&gt;
entity&lt;br&gt;
workflow&lt;/p&gt;

&lt;p&gt;Instead of reading 100 lines, you follow one story.&lt;/p&gt;

&lt;p&gt;That’s where things start making sense again.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. Accept that some issues won’t be obvious&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Some problems don’t throw errors.&lt;/p&gt;

&lt;p&gt;They just… stop moving.&lt;/p&gt;

&lt;p&gt;A process gets stuck.&lt;br&gt;
A sync silently fails.&lt;/p&gt;

&lt;p&gt;Logs might show nothing critical.&lt;/p&gt;

&lt;p&gt;So you need signals like:&lt;/p&gt;

&lt;p&gt;time thresholds&lt;br&gt;
missing updates&lt;br&gt;
“this should have finished by now”&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What changed for me&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I used to think:&lt;/p&gt;

&lt;p&gt;if it’s logged, we can debug it&lt;/p&gt;

&lt;p&gt;Now I think:&lt;/p&gt;

&lt;p&gt;if we need logs to notice something is broken, we’re already late&lt;/p&gt;

&lt;p&gt;Logs are for digging deeper.&lt;/p&gt;

&lt;p&gt;Not for discovering the problem.&lt;/p&gt;

&lt;p&gt;In systems that run all the time, you don’t watch everything manually.&lt;/p&gt;

&lt;p&gt;The system needs to show you where it’s struggling.&lt;/p&gt;

&lt;p&gt;Otherwise, you’re just scrolling and hoping you notice the right line.&lt;/p&gt;

&lt;p&gt;This is something we run into a lot at BrainPack, where multiple systems are always moving and interacting. AI workflows depend on knowing the current state of everything, not just what happened, so observability has to go beyond logs.&lt;/p&gt;

</description>
      <category>backend</category>
      <category>architecture</category>
    </item>
    <item>
      <title>How We Design Systems That Keep Working Even When One Part Fails</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Thu, 30 Apr 2026 13:14:46 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/how-we-design-systems-that-keep-working-even-when-one-part-fails-3nmg</link>
      <guid>https://dev.to/dhruvi_21/how-we-design-systems-that-keep-working-even-when-one-part-fails-3nmg</guid>
      <description>&lt;p&gt;In real systems, something is always failing.&lt;/p&gt;

&lt;p&gt;An API times out.&lt;br&gt;
A database slows down.&lt;br&gt;
A third-party service returns garbage.&lt;/p&gt;

&lt;p&gt;If your system depends on everything working perfectly, it won’t last long in production.&lt;/p&gt;

&lt;p&gt;So the goal is not preventing failure.&lt;/p&gt;

&lt;p&gt;It’s designing so failure doesn’t break everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The wrong assumption&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A lot of systems are built like this:&lt;/p&gt;

&lt;p&gt;Step 1 → Step 2 → Step 3 → Done&lt;/p&gt;

&lt;p&gt;If Step 2 fails, the whole flow stops.&lt;/p&gt;

&lt;p&gt;In controlled environments, this works.&lt;/p&gt;

&lt;p&gt;In production, it creates fragile systems that break on the first issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What we do instead&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We design flows that can survive failure and continue.&lt;/p&gt;

&lt;p&gt;Not perfectly. But safely.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Break the dependency chain&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Instead of one long synchronous flow, we split things into independent steps.&lt;/p&gt;

&lt;p&gt;Each step:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;does one thing&lt;/li&gt;
&lt;li&gt;stores its state&lt;/li&gt;
&lt;li&gt;can be retried&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So if something fails, you don’t lose everything.&lt;/p&gt;

&lt;p&gt;You just retry that part.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;## 2. Accept partial success&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This one is uncomfortable at first.&lt;/p&gt;

&lt;p&gt;Sometimes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;part of the system succeeds&lt;/li&gt;
&lt;li&gt;another part fails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of rolling everything back, we:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;keep what succeeded&lt;/li&gt;
&lt;li&gt;fix what failed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because in distributed systems, “all or nothing” is rarely realistic.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Make retries safe&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Failures lead to retries.&lt;/p&gt;

&lt;p&gt;Retries lead to duplication if you’re not careful.&lt;/p&gt;

&lt;p&gt;So every step needs to be safe to run again:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no duplicate records&lt;/li&gt;
&lt;li&gt;no repeated side effects&lt;/li&gt;
&lt;li&gt;no broken state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If retries are safe, failure becomes manageable.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. Isolate external dependencies&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Anything outside your control will fail eventually.&lt;/p&gt;

&lt;p&gt;So we isolate them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;queues between systems&lt;/li&gt;
&lt;li&gt;timeouts and fallbacks&lt;/li&gt;
&lt;li&gt;delayed execution when needed
The goal is simple&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If one system goes down, everything else should keep moving.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;5. Design for recovery, not perfection&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Instead of asking:&lt;/p&gt;

&lt;p&gt;how do we make this never fail&lt;/p&gt;

&lt;p&gt;We ask:&lt;/p&gt;

&lt;p&gt;how does this recover when it fails&lt;/p&gt;

&lt;p&gt;That changes everything.&lt;/p&gt;

&lt;p&gt;You stop chasing edge cases and start building systems that handle them naturally.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What changed for me&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I stopped treating failure as an exception.&lt;/p&gt;

&lt;p&gt;Now it’s part of the normal flow.&lt;/p&gt;

&lt;p&gt;Every system I build assumes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;something will fail&lt;/li&gt;
&lt;li&gt;it will fail at the wrong time&lt;/li&gt;
&lt;li&gt;and it will fail more than once&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the system needs to absorb that without collapsing.&lt;/p&gt;

&lt;p&gt;In systems that run continuously, reliability doesn’t come from everything working.&lt;/p&gt;

&lt;p&gt;It comes from everything being able to keep going when something doesn’t.&lt;/p&gt;

&lt;p&gt;This is something we deal with constantly at BrainPack, designing systems that keep operating even when parts of the infrastructure fail. AI workflows only work if the underlying systems can recover and continue without breaking the overall flow.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>backend</category>
      <category>systemdesign</category>
      <category>sre</category>
    </item>
    <item>
      <title>What Actually Breaks When You Connect AI to Real Enterprise Data</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Mon, 27 Apr 2026 13:15:43 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/what-actually-breaks-when-you-connect-ai-to-real-enterprise-data-55ba</link>
      <guid>https://dev.to/dhruvi_21/what-actually-breaks-when-you-connect-ai-to-real-enterprise-data-55ba</guid>
      <description>&lt;p&gt;Connecting AI to real enterprise data sounds straightforward.&lt;/p&gt;

&lt;p&gt;Give it access to your systems.&lt;br&gt;
Let it read data.&lt;br&gt;
Let it take actions.&lt;/p&gt;

&lt;p&gt;In reality, this is where things start breaking.&lt;/p&gt;

&lt;p&gt;Not because the AI is wrong.&lt;br&gt;
Because the data and systems underneath are not stable enough.&lt;/p&gt;

&lt;p&gt;The assumption that fails&lt;/p&gt;

&lt;p&gt;Most people assume:&lt;/p&gt;

&lt;p&gt;if the data exists, AI can use it&lt;/p&gt;

&lt;p&gt;In real systems, data exists in inconsistent states.&lt;/p&gt;

&lt;p&gt;Same entity&lt;br&gt;
different systems&lt;br&gt;
different values&lt;/p&gt;

&lt;p&gt;An order might be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;completed in one system&lt;/li&gt;
&lt;li&gt;pending in another&lt;/li&gt;
&lt;li&gt;duplicated somewhere else&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI doesn’t know which one is “correct”. It just sees all of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Inconsistent data&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Enterprise systems are rarely in sync.&lt;/p&gt;

&lt;p&gt;You have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ERPs&lt;/li&gt;
&lt;li&gt;CRMs&lt;/li&gt;
&lt;li&gt;spreadsheets&lt;/li&gt;
&lt;li&gt;custom tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each one updates at different times. Some fail silently.&lt;/p&gt;

&lt;p&gt;So when AI queries across them, it gets conflicting answers.&lt;/p&gt;

&lt;p&gt;This leads to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;wrong insights&lt;/li&gt;
&lt;li&gt;incorrect decisions&lt;/li&gt;
&lt;li&gt;broken automations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The issue isn’t AI accuracy.&lt;br&gt;
It’s data consistency.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2. Missing context&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;AI works on what it can see.&lt;/p&gt;

&lt;p&gt;But a lot of enterprise logic lives outside the data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;manual processes&lt;/li&gt;
&lt;li&gt;unwritten rules&lt;/li&gt;
&lt;li&gt;team-specific workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
A record looks valid in the system.&lt;br&gt;
But internally, everyone knows it shouldn’t be processed yet.&lt;/p&gt;

&lt;p&gt;AI has no way to infer that unless the logic is formalized.&lt;/p&gt;

&lt;p&gt;So it acts on incomplete understanding.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Unreliable actions&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Reading data is one problem. Acting on it is another.&lt;/p&gt;

&lt;p&gt;When AI triggers actions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;create orders&lt;/li&gt;
&lt;li&gt;update records&lt;/li&gt;
&lt;li&gt;send communications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It depends on underlying systems behaving predictably.&lt;/p&gt;

&lt;p&gt;But those systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;retry&lt;/li&gt;
&lt;li&gt;timeout&lt;/li&gt;
&lt;li&gt;partially fail&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without safeguards, AI actions can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;execute twice&lt;/li&gt;
&lt;li&gt;fail halfway&lt;/li&gt;
&lt;li&gt;create inconsistent states&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. Timing issues&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Enterprise systems are not real-time in a clean way.&lt;/p&gt;

&lt;p&gt;There are delays:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sync jobs&lt;/li&gt;
&lt;li&gt;queues&lt;/li&gt;
&lt;li&gt;batch updates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI might:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;read data before it’s updated&lt;/li&gt;
&lt;li&gt;act on stale information&lt;/li&gt;
&lt;li&gt;trigger workflows too early&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything looks correct individually.&lt;br&gt;
But the sequence is wrong.&lt;/p&gt;

&lt;p&gt;What changed for me&lt;/p&gt;

&lt;p&gt;I stopped thinking of AI as the hard part.&lt;/p&gt;

&lt;p&gt;The hard part is making the environment predictable enough for AI to operate.&lt;/p&gt;

&lt;p&gt;You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;consistent data&lt;/li&gt;
&lt;li&gt;clear state&lt;/li&gt;
&lt;li&gt;reliable execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without that, AI just amplifies existing problems faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The shift&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;AI doesn’t fix messy systems.&lt;/p&gt;

&lt;p&gt;It exposes them.&lt;/p&gt;

&lt;p&gt;If your data is inconsistent, AI will surface conflicting answers.&lt;br&gt;
If your workflows are fragile, AI will break them faster.&lt;/p&gt;

&lt;p&gt;This is the kind of problem we deal with constantly at BrainPack, turning fragmented and inconsistent systems into something AI can actually operate on. The AI layer only works once the underlying infrastructure becomes predictable enough to trust.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>backend</category>
      <category>systemdesign</category>
      <category>architecture</category>
    </item>
    <item>
      <title>The Code Pattern That Keeps Our Integrations Stable in Production</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Thu, 23 Apr 2026 16:31:30 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/the-code-pattern-that-keeps-our-integrations-stable-in-production-3ad4</link>
      <guid>https://dev.to/dhruvi_21/the-code-pattern-that-keeps-our-integrations-stable-in-production-3ad4</guid>
      <description>&lt;p&gt;When you connect real systems, ERPs, APIs, AI workflows, things don’t behave cleanly.&lt;/p&gt;

&lt;p&gt;Requests retry.&lt;br&gt;
Webhooks get sent twice.&lt;br&gt;
Sometimes something succeeds, but you don’t get the response.&lt;/p&gt;

&lt;p&gt;And then you see it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;duplicate orders&lt;/li&gt;
&lt;li&gt;repeated emails&lt;/li&gt;
&lt;li&gt;workflows triggering twice&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is normal in production.&lt;/p&gt;

&lt;p&gt;The pattern that keeps this under control is idempotency.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The rule&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Every action should be safe to run more than once.&lt;/p&gt;

&lt;p&gt;Same input → same result.&lt;/p&gt;

&lt;p&gt;If the same request hits your system twice, nothing should break and nothing extra should happen.&lt;/p&gt;

&lt;p&gt;Where things usually go wrong&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Partial execution&lt;/strong&gt;&lt;br&gt;
Something starts, then crashes halfway.&lt;br&gt;
A retry comes in and runs everything again.&lt;/p&gt;

&lt;p&gt;If you’re not careful, you create duplicates.&lt;/p&gt;

&lt;p&gt;So instead of “just create”, you always check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;does this already exist?&lt;/li&gt;
&lt;li&gt;should I update instead?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Multi-step flows&lt;/strong&gt;&lt;br&gt;
Most integrations don’t stop at one system.&lt;/p&gt;

&lt;p&gt;You might:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;create something in one system&lt;/li&gt;
&lt;li&gt;then send it to another&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If it fails in the middle, the retry should continue from where it stopped, not start from zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Side effects&lt;/strong&gt;&lt;br&gt;
This is where it gets visible.&lt;/p&gt;

&lt;p&gt;Things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sending emails&lt;/li&gt;
&lt;li&gt;charging payments&lt;/li&gt;
&lt;li&gt;triggering automations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If these run twice, users notice immediately.&lt;/p&gt;

&lt;p&gt;So you need to control when they run and make sure they don’t fire again on retries.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What changed for me&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I stopped assuming things run once.&lt;/p&gt;

&lt;p&gt;Now I assume:&lt;/p&gt;

&lt;p&gt;everything can retry&lt;br&gt;
everything can duplicate&lt;br&gt;
things can fail halfway&lt;/p&gt;

&lt;p&gt;So the question is always:&lt;/p&gt;

&lt;p&gt;what happens if this runs again?&lt;/p&gt;

&lt;p&gt;In systems that run all the time, this isn’t an edge case.&lt;/p&gt;

&lt;p&gt;This is how the system behaves every day.&lt;/p&gt;

&lt;p&gt;And once you build with that in mind, a lot of production issues just stop showing up.&lt;/p&gt;

&lt;p&gt;This is the kind of problem we deal with constantly at BrainPack, making unpredictable systems stable enough to layer AI on top of them. If the underlying operations are not reliable under retries, nothing built above them can be trusted.&lt;/p&gt;

</description>
      <category>api</category>
      <category>architecture</category>
      <category>backend</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
