<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dhruvi</title>
    <description>The latest articles on DEV Community by Dhruvi (@dhruvi_21).</description>
    <link>https://dev.to/dhruvi_21</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3894569%2Fe31cc617-f38a-4448-a25e-dfb161e3364d.png</url>
      <title>DEV Community: Dhruvi</title>
      <link>https://dev.to/dhruvi_21</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dhruvi_21"/>
    <language>en</language>
    <item>
      <title>What Developers Underestimate About Long-Running Workflows</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Wed, 24 Jun 2026 12:52:47 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/what-developers-underestimate-about-long-running-workflows-4009</link>
      <guid>https://dev.to/dhruvi_21/what-developers-underestimate-about-long-running-workflows-4009</guid>
      <description>&lt;p&gt;Long-running workflows look simple when you first build them.&lt;/p&gt;

&lt;p&gt;Something happens.&lt;/p&gt;

&lt;p&gt;A few systems exchange data.&lt;/p&gt;

&lt;p&gt;Everything completes.&lt;/p&gt;

&lt;p&gt;Done.&lt;/p&gt;

&lt;p&gt;At least that's the expectation.&lt;/p&gt;

&lt;p&gt;Reality is very different.&lt;/p&gt;

&lt;p&gt;The biggest thing I underestimated was time.&lt;/p&gt;

&lt;p&gt;Not execution time.&lt;/p&gt;

&lt;p&gt;Elapsed time.&lt;/p&gt;

&lt;p&gt;Because once workflows start running for hours, days, or continuously, strange things start happening.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;APIs become temporarily unavailable&lt;/li&gt;
&lt;li&gt;Data changes halfway through the process&lt;/li&gt;
&lt;li&gt;Retries arrive much later than expected&lt;/li&gt;
&lt;li&gt;Someone manually updates a record&lt;/li&gt;
&lt;li&gt;Another system processes things in a different order&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nothing is broken.&lt;/p&gt;

&lt;p&gt;But everything is slightly different from when the workflow started.&lt;/p&gt;

&lt;p&gt;Early on, I assumed workflows were transactions.&lt;/p&gt;

&lt;p&gt;Start.&lt;/p&gt;

&lt;p&gt;Execute.&lt;/p&gt;

&lt;p&gt;Finish.&lt;/p&gt;

&lt;p&gt;Now I think of them as conversations between systems.&lt;/p&gt;

&lt;p&gt;And conversations can get interrupted.&lt;/p&gt;

&lt;p&gt;Another thing I underestimated:&lt;/p&gt;

&lt;p&gt;State changes.&lt;/p&gt;

&lt;p&gt;You might start processing an order that is "pending".&lt;/p&gt;

&lt;p&gt;Ten minutes later, another system marks it as "cancelled".&lt;/p&gt;

&lt;p&gt;An hour later, a retry comes in from an earlier step.&lt;/p&gt;

&lt;p&gt;If your workflow only thinks about data, weird things happen.&lt;/p&gt;

&lt;p&gt;Because the world has changed while the process was still running.&lt;/p&gt;

&lt;p&gt;Long-running workflows also expose assumptions you didn't know you made.&lt;/p&gt;

&lt;p&gt;Like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;this API will always respond quickly&lt;/li&gt;
&lt;li&gt;data will arrive in order&lt;/li&gt;
&lt;li&gt;users won't modify records manually&lt;/li&gt;
&lt;li&gt;retries will happen immediately&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those assumptions survive in testing.&lt;/p&gt;

&lt;p&gt;Production removes them quickly.&lt;/p&gt;

&lt;p&gt;One thing that changed how I build these systems:&lt;/p&gt;

&lt;p&gt;I stopped asking:&lt;/p&gt;

&lt;p&gt;"Will this workflow finish?"&lt;/p&gt;

&lt;p&gt;And started asking:&lt;/p&gt;

&lt;p&gt;"What state will the world be in when it finishes?"&lt;/p&gt;

&lt;p&gt;Because those are two very different questions.&lt;/p&gt;

&lt;p&gt;Most problems in long-running systems aren't caused by one big failure.&lt;/p&gt;

&lt;p&gt;They're caused by lots of small changes happening while the workflow is still alive.&lt;/p&gt;

&lt;p&gt;And if you don't account for that, eventually the workflow finishes successfully and still produces the wrong outcome.&lt;/p&gt;

&lt;p&gt;This is something we think about constantly at BrainPack while operating workflows that span multiple systems and AI layers. Long-running processes are less about moving data and more about managing changing state over time.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>backend</category>
      <category>distributedsystems</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Why Retries Are More Dangerous Than Failures in Production Systems</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Fri, 19 Jun 2026 12:47:54 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/why-retries-are-more-dangerous-than-failures-in-production-systems-44jn</link>
      <guid>https://dev.to/dhruvi_21/why-retries-are-more-dangerous-than-failures-in-production-systems-44jn</guid>
      <description>&lt;p&gt;Failures are obvious.&lt;/p&gt;

&lt;p&gt;Retries are sneaky.&lt;/p&gt;

&lt;p&gt;When something fails, everyone notices.&lt;/p&gt;

&lt;p&gt;An alert goes off.&lt;br&gt;
A request errors out.&lt;br&gt;
Someone starts investigating.&lt;/p&gt;

&lt;p&gt;Retries are different.&lt;/p&gt;

&lt;p&gt;They look harmless.&lt;/p&gt;

&lt;p&gt;Most of the time, they save the system.&lt;/p&gt;

&lt;p&gt;But sometimes, retries create bigger problems than the original failure.&lt;/p&gt;

&lt;p&gt;Imagine an API call times out.&lt;/p&gt;

&lt;p&gt;No problem.&lt;/p&gt;

&lt;p&gt;The system retries.&lt;/p&gt;

&lt;p&gt;But what if the first request actually succeeded and only the response was lost?&lt;/p&gt;

&lt;p&gt;Now the retry creates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;duplicate orders&lt;/li&gt;
&lt;li&gt;repeated emails&lt;/li&gt;
&lt;li&gt;inconsistent records&lt;/li&gt;
&lt;li&gt;workflows running twice&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The failure happened once.&lt;/p&gt;

&lt;p&gt;The retry multiplied it.&lt;/p&gt;

&lt;p&gt;Another thing I've seen:&lt;/p&gt;

&lt;p&gt;One slow dependency causes requests to pile up.&lt;/p&gt;

&lt;p&gt;Retries start firing.&lt;/p&gt;

&lt;p&gt;Those retries create even more traffic.&lt;/p&gt;

&lt;p&gt;Which slows things down further.&lt;/p&gt;

&lt;p&gt;Which triggers even more retries.&lt;/p&gt;

&lt;p&gt;Suddenly, the system is spending more effort retrying than doing useful work.&lt;/p&gt;

&lt;p&gt;Retries also hide problems.&lt;/p&gt;

&lt;p&gt;A temporary issue gets retried five times and eventually succeeds.&lt;/p&gt;

&lt;p&gt;Everything looks normal.&lt;/p&gt;

&lt;p&gt;Meanwhile:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;latency increases&lt;/li&gt;
&lt;li&gt;queues grow&lt;/li&gt;
&lt;li&gt;users experience delays
Nothing technically failed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the system is getting less healthy.&lt;/p&gt;

&lt;p&gt;What changed for me is that I stopped treating retries as free.&lt;/p&gt;

&lt;p&gt;Every retry has a cost.&lt;/p&gt;

&lt;p&gt;It consumes resources.&lt;/p&gt;

&lt;p&gt;It increases load.&lt;/p&gt;

&lt;p&gt;And if actions aren't designed carefully, retries can repeat side effects that should only happen once.&lt;/p&gt;

&lt;p&gt;Now when I build something, I don't ask:&lt;/p&gt;

&lt;p&gt;"What happens if this fails?"&lt;/p&gt;

&lt;p&gt;I ask:&lt;/p&gt;

&lt;p&gt;"What happens if this runs again?"&lt;/p&gt;

&lt;p&gt;Because in production, things almost always run again.&lt;/p&gt;

&lt;p&gt;And if the answer is "bad things happen," the retry mechanism isn't helping.&lt;/p&gt;

&lt;p&gt;It's making things worse.&lt;/p&gt;

&lt;p&gt;Failures are part of every system.&lt;/p&gt;

&lt;p&gt;Retries are too.&lt;/p&gt;

&lt;p&gt;The difference is that failures usually happen once.&lt;/p&gt;

&lt;p&gt;Retries can turn one problem into hundreds if you don't design for them.&lt;/p&gt;

&lt;p&gt;This is something we think about constantly at BrainPack when operating long-running workflows across multiple systems. AI and automation layers make retries even more common, which means making actions safe to repeat becomes just as important as handling failures themselves.&lt;/p&gt;

</description>
      <category>backend</category>
      <category>distributedsystems</category>
      <category>sre</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>An Experience From the Deployment Process Worth Sharing</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Mon, 15 Jun 2026 13:23:49 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/an-experience-from-the-deployment-process-worth-sharing-4ab5</link>
      <guid>https://dev.to/dhruvi_21/an-experience-from-the-deployment-process-worth-sharing-4ab5</guid>
      <description>&lt;p&gt;One deployment taught me a lesson I keep coming back to.&lt;/p&gt;

&lt;p&gt;Nothing was technically wrong.&lt;/p&gt;

&lt;p&gt;The code worked.&lt;/p&gt;

&lt;p&gt;Testing passed.&lt;/p&gt;

&lt;p&gt;The deployment completed successfully.&lt;/p&gt;

&lt;p&gt;And yet, a few hours later, things started behaving strangely.&lt;/p&gt;

&lt;p&gt;Some workflows were running slower than expected.&lt;/p&gt;

&lt;p&gt;A few records weren't updating.&lt;/p&gt;

&lt;p&gt;Nothing was completely broken.&lt;/p&gt;

&lt;p&gt;Just enough to make people question whether the system was working correctly.&lt;/p&gt;

&lt;p&gt;Those are often the hardest situations.&lt;/p&gt;

&lt;p&gt;Because there isn't a clear error pointing you in the right direction.&lt;/p&gt;

&lt;p&gt;After digging through the flow, we found the issue.&lt;/p&gt;

&lt;p&gt;The new deployment introduced a change that increased the number of requests between systems.&lt;/p&gt;

&lt;p&gt;Not enough to cause failures.&lt;/p&gt;

&lt;p&gt;But enough to create a small backlog.&lt;/p&gt;

&lt;p&gt;That backlog slowly grew throughout the day.&lt;/p&gt;

&lt;p&gt;The deployment itself was fine.&lt;/p&gt;

&lt;p&gt;The operational impact wasn't.&lt;/p&gt;

&lt;p&gt;What stuck with me was this:&lt;/p&gt;

&lt;p&gt;We had tested functionality.&lt;/p&gt;

&lt;p&gt;We hadn't tested behavior under real operating conditions.&lt;/p&gt;

&lt;p&gt;Those are very different things.&lt;/p&gt;

&lt;p&gt;Since then, I pay much more attention to questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What will this change do to traffic patterns?&lt;/li&gt;
&lt;li&gt;Will it create more retries?&lt;/li&gt;
&lt;li&gt;Will it increase queue sizes?&lt;/li&gt;
&lt;li&gt;How will it behave after running for several hours?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those questions rarely come up during feature development.&lt;/p&gt;

&lt;p&gt;But they matter a lot after deployment.&lt;/p&gt;

&lt;p&gt;One thing I've noticed about production systems:&lt;/p&gt;

&lt;p&gt;Problems often don't appear immediately.&lt;/p&gt;

&lt;p&gt;They accumulate.&lt;/p&gt;

&lt;p&gt;A small delay becomes a backlog.&lt;/p&gt;

&lt;p&gt;A backlog becomes slower processing.&lt;/p&gt;

&lt;p&gt;Slower processing creates more retries.&lt;/p&gt;

&lt;p&gt;And suddenly you're debugging something that started hours earlier.&lt;/p&gt;

&lt;p&gt;The experience changed how I think about deployments.&lt;/p&gt;

&lt;p&gt;A successful deployment is not when the code reaches production.&lt;/p&gt;

&lt;p&gt;A successful deployment is when the system continues behaving predictably after the change.&lt;/p&gt;

&lt;p&gt;This is something we think about constantly at BrainPack while operating systems that run continuously across enterprise environments. The deployment is usually the easy part. Understanding how a change affects the system over the next few hours and days is where the real work begins.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>devops</category>
      <category>performance</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Something I Learned Recently That Changed How I Approach a Problem</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Thu, 11 Jun 2026 12:49:58 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/something-i-learned-recently-that-changed-how-i-approach-a-problem-38dp</link>
      <guid>https://dev.to/dhruvi_21/something-i-learned-recently-that-changed-how-i-approach-a-problem-38dp</guid>
      <description>&lt;p&gt;One thing I learned recently:&lt;/p&gt;

&lt;p&gt;Most production problems are not technical problems.&lt;/p&gt;

&lt;p&gt;They're visibility problems.&lt;/p&gt;

&lt;p&gt;For a long time, my instinct was:&lt;/p&gt;

&lt;p&gt;Something breaks → find the bug → fix the code.&lt;/p&gt;

&lt;p&gt;Seems reasonable.&lt;/p&gt;

&lt;p&gt;But the more time I spend operating systems, the more I notice that many issues happen because we can't clearly see what's happening.&lt;/p&gt;

&lt;p&gt;A workflow gets stuck.&lt;/p&gt;

&lt;p&gt;Data stops syncing.&lt;/p&gt;

&lt;p&gt;An automation behaves unexpectedly.&lt;/p&gt;

&lt;p&gt;The first question isn't:&lt;/p&gt;

&lt;p&gt;"Why did it fail?"&lt;/p&gt;

&lt;p&gt;It's:&lt;/p&gt;

&lt;p&gt;"Can we actually see where it failed?"&lt;/p&gt;

&lt;p&gt;I've worked on issues where the fix took 15 minutes.&lt;/p&gt;

&lt;p&gt;Finding the issue took several hours.&lt;/p&gt;

&lt;p&gt;Not because the bug was complicated.&lt;/p&gt;

&lt;p&gt;Because there wasn't enough visibility into the system.&lt;/p&gt;

&lt;p&gt;That changed how I approach new work.&lt;/p&gt;

&lt;p&gt;Now, before I think about features, I think about observability.&lt;/p&gt;

&lt;p&gt;Questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How will we know this is broken?&lt;/li&gt;
&lt;li&gt;How will we know it's slow?&lt;/li&gt;
&lt;li&gt;How will we know it's stuck?&lt;/li&gt;
&lt;li&gt;How will we investigate it six months from now?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those questions often end up being more important than the implementation itself.&lt;/p&gt;

&lt;p&gt;The interesting thing is that adding visibility rarely feels urgent when you're building.&lt;/p&gt;

&lt;p&gt;Everything is working.&lt;/p&gt;

&lt;p&gt;Everything looks fine.&lt;/p&gt;

&lt;p&gt;Until the day it isn't.&lt;/p&gt;

&lt;p&gt;And that's usually when you realize how valuable those extra logs, status checks, and monitoring points actually are.&lt;/p&gt;

&lt;p&gt;One pattern I've noticed:&lt;/p&gt;

&lt;p&gt;Teams often spend more time locating problems than solving them.&lt;/p&gt;

&lt;p&gt;So improving visibility doesn't just reduce downtime.&lt;/p&gt;

&lt;p&gt;It makes engineering work faster.&lt;/p&gt;

&lt;p&gt;Now when I build something, I try to leave behind enough information that future me doesn't have to guess what happened.&lt;/p&gt;

&lt;p&gt;That's probably one of the highest-return investments I've found in software engineering.&lt;/p&gt;

&lt;p&gt;This lesson comes up constantly at BrainPack while operating systems that run continuously across multiple platforms and workflows. AI systems can only be as reliable as your ability to understand what the underlying infrastructure is doing at any given moment.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>What Documentation Looks Like in a Permanently Operated System</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Thu, 04 Jun 2026 12:39:15 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/what-documentation-looks-like-in-a-permanently-operated-system-1gja</link>
      <guid>https://dev.to/dhruvi_21/what-documentation-looks-like-in-a-permanently-operated-system-1gja</guid>
      <description>&lt;p&gt;I used to think documentation was mostly for onboarding.&lt;/p&gt;

&lt;p&gt;A way to help new developers understand the system.&lt;/p&gt;

&lt;p&gt;That's part of it.&lt;/p&gt;

&lt;p&gt;But when you're operating systems continuously, documentation becomes something else entirely.&lt;/p&gt;

&lt;p&gt;It's operational infrastructure.&lt;/p&gt;

&lt;p&gt;The biggest misconception about documentation is that it's about explaining how things were built.&lt;/p&gt;

&lt;p&gt;Most of the time, that's not what people need.&lt;/p&gt;

&lt;p&gt;What they need is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how does this actually work today?&lt;/li&gt;
&lt;li&gt;what happens if it fails?&lt;/li&gt;
&lt;li&gt;who depends on it?&lt;/li&gt;
&lt;li&gt;what should happen next?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One thing I learned pretty quickly:&lt;/p&gt;

&lt;p&gt;Nobody reads long documentation during an incident.&lt;/p&gt;

&lt;p&gt;If something breaks, people need answers fast.&lt;/p&gt;

&lt;p&gt;So the most useful documentation is usually the simplest.&lt;/p&gt;

&lt;p&gt;Things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;workflow diagrams&lt;/li&gt;
&lt;li&gt;system dependencies&lt;/li&gt;
&lt;li&gt;retry behavior&lt;/li&gt;
&lt;li&gt;recovery steps&lt;/li&gt;
&lt;li&gt;known failure points&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Another thing that changes in long-running systems:&lt;/p&gt;

&lt;p&gt;Documentation can't be static.&lt;/p&gt;

&lt;p&gt;The system evolves.&lt;/p&gt;

&lt;p&gt;Integrations change.&lt;/p&gt;

&lt;p&gt;Business processes change.&lt;/p&gt;

&lt;p&gt;Automations get added.&lt;/p&gt;

&lt;p&gt;If documentation doesn't evolve too, it slowly becomes misleading.&lt;/p&gt;

&lt;p&gt;And outdated documentation is often worse than no documentation at all.&lt;/p&gt;

&lt;p&gt;The documentation I use most is rarely technical.&lt;/p&gt;

&lt;p&gt;It's operational.&lt;/p&gt;

&lt;p&gt;Questions like:&lt;/p&gt;

&lt;p&gt;Why does this process exist?&lt;/p&gt;

&lt;p&gt;What happens if this service is unavailable?&lt;/p&gt;

&lt;p&gt;Which systems depend on this workflow?&lt;/p&gt;

&lt;p&gt;Those answers save more time than implementation details.&lt;/p&gt;

&lt;p&gt;One thing I appreciate now:&lt;/p&gt;

&lt;p&gt;Good documentation reduces dependency on specific people.&lt;/p&gt;

&lt;p&gt;Without it, knowledge gets trapped.&lt;/p&gt;

&lt;p&gt;One person knows how something works.&lt;/p&gt;

&lt;p&gt;One person knows how to recover it.&lt;/p&gt;

&lt;p&gt;One person knows why it was built that way.&lt;/p&gt;

&lt;p&gt;That's a risk.&lt;/p&gt;

&lt;p&gt;In systems that run continuously, documentation is less about explaining code and more about preserving operational knowledge.&lt;/p&gt;

&lt;p&gt;Because eventually, everyone forgets why a decision was made.&lt;/p&gt;

&lt;p&gt;The documentation is what remains.&lt;/p&gt;

&lt;p&gt;At BrainPack, a lot of the systems we operate involve multiple integrations, workflows, and AI layers running together. Good documentation helps turn individual knowledge into infrastructure that the entire team can rely on over time.I used to think documentation was mostly for onboarding.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>softwareengineering</category>
      <category>sre</category>
      <category>writing</category>
    </item>
    <item>
      <title>What Building Software That Runs 24/7 Actually Means Day to Day</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Wed, 03 Jun 2026 13:04:16 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/what-building-software-that-runs-247-actually-means-day-to-day-14fl</link>
      <guid>https://dev.to/dhruvi_21/what-building-software-that-runs-247-actually-means-day-to-day-14fl</guid>
      <description>&lt;p&gt;When people hear that a system runs 24/7, they usually think about uptime.&lt;/p&gt;

&lt;p&gt;Servers running.&lt;/p&gt;

&lt;p&gt;Services responding.&lt;/p&gt;

&lt;p&gt;No outages.&lt;/p&gt;

&lt;p&gt;But day to day, that's not what I spend most of my time thinking about.&lt;/p&gt;

&lt;p&gt;What I actually think about is:&lt;/p&gt;

&lt;p&gt;What happens at 2:13 AM when something unexpected occurs?&lt;/p&gt;

&lt;p&gt;Because eventually, it will.&lt;/p&gt;

&lt;p&gt;A queue gets stuck.&lt;/p&gt;

&lt;p&gt;A third party API slows down.&lt;/p&gt;

&lt;p&gt;A workflow starts behaving differently.&lt;/p&gt;

&lt;p&gt;A retry arrives hours later than expected.&lt;/p&gt;

&lt;p&gt;The interesting part is that most problems aren't dramatic.&lt;/p&gt;

&lt;p&gt;The system doesn't crash.&lt;/p&gt;

&lt;p&gt;It keeps running.&lt;/p&gt;

&lt;p&gt;Just slightly wrong.&lt;/p&gt;

&lt;p&gt;And those are often the hardest issues to catch.&lt;/p&gt;

&lt;p&gt;Building software that runs continuously means caring about things that demos never show:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;recovery&lt;/li&gt;
&lt;li&gt;observability&lt;/li&gt;
&lt;li&gt;retries&lt;/li&gt;
&lt;li&gt;data consistency&lt;/li&gt;
&lt;li&gt;failure handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not because they're exciting.&lt;/p&gt;

&lt;p&gt;Because they become important every single day.&lt;/p&gt;

&lt;p&gt;One thing I learned pretty quickly:&lt;/p&gt;

&lt;p&gt;The goal isn't building a system that never fails.&lt;/p&gt;

&lt;p&gt;The goal is building a system that can recover without someone jumping in every time.&lt;/p&gt;

&lt;p&gt;If a process gets stuck, can it restart?&lt;/p&gt;

&lt;p&gt;If an API fails, can it retry safely?&lt;/p&gt;

&lt;p&gt;If data arrives late, can the workflow still complete correctly?&lt;/p&gt;

&lt;p&gt;Those questions matter more than most features.&lt;/p&gt;

&lt;p&gt;Another reality is that software running 24/7 creates a different relationship with technical decisions.&lt;/p&gt;

&lt;p&gt;Small shortcuts last a long time.&lt;/p&gt;

&lt;p&gt;Small bugs eventually surface.&lt;/p&gt;

&lt;p&gt;Small assumptions eventually get tested.&lt;/p&gt;

&lt;p&gt;The system has a lot of time to find weaknesses.&lt;/p&gt;

&lt;p&gt;What surprised me most is how much of the work is actually about predictability.&lt;/p&gt;

&lt;p&gt;Not speed.&lt;/p&gt;

&lt;p&gt;Not new features.&lt;/p&gt;

&lt;p&gt;Predictability.&lt;/p&gt;

&lt;p&gt;Knowing how the system behaves when things go right and when they don't.&lt;/p&gt;

&lt;p&gt;Because people eventually start depending on that behavior.&lt;/p&gt;

&lt;p&gt;Building software that runs continuously has changed how I think about engineering. The feature is only the beginning. The real work starts when the system has to keep doing its job reliably every hour of every day.&lt;/p&gt;

&lt;p&gt;This is the reality of a lot of the systems we operate at BrainPack. Once enterprise workflows and AI automations are running continuously, reliability becomes less about uptime and more about making sure the system behaves predictably under real-world conditions.&lt;/p&gt;

</description>
      <category>backend</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>sre</category>
    </item>
    <item>
      <title>The Hardest Part of Integrating with Legacy ERPs Is Not the API</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Mon, 01 Jun 2026 13:03:15 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/the-hardest-part-of-integrating-with-legacy-erps-is-not-the-api-52fh</link>
      <guid>https://dev.to/dhruvi_21/the-hardest-part-of-integrating-with-legacy-erps-is-not-the-api-52fh</guid>
      <description>&lt;p&gt;When people hear "ERP integration," they usually think the difficult part is the API.&lt;/p&gt;

&lt;p&gt;Sometimes there isn't even an API.&lt;/p&gt;

&lt;p&gt;But honestly, that's rarely the biggest problem.&lt;/p&gt;

&lt;p&gt;The harder problem is understanding how the business actually uses the system.&lt;/p&gt;

&lt;p&gt;Because what the ERP does and what the organization does are often two different things.&lt;/p&gt;

&lt;p&gt;I've seen situations where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a field means one thing in the documentation and something completely different in practice&lt;/li&gt;
&lt;li&gt;a workflow exists because someone created a workaround five years ago&lt;/li&gt;
&lt;li&gt;critical business rules live in spreadsheets instead of the ERP&lt;/li&gt;
&lt;li&gt;important decisions happen outside the system entirely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Technically, the integration works.&lt;/p&gt;

&lt;p&gt;Operationally, it's wrong.&lt;/p&gt;

&lt;p&gt;This is where many integration projects get stuck.&lt;/p&gt;

&lt;p&gt;The data moves successfully.&lt;/p&gt;

&lt;p&gt;The API calls succeed.&lt;/p&gt;

&lt;p&gt;The sync completes.&lt;/p&gt;

&lt;p&gt;And yet users immediately tell you something is broken.&lt;/p&gt;

&lt;p&gt;Because the integration followed the system.&lt;/p&gt;

&lt;p&gt;Not the business process.&lt;/p&gt;

&lt;p&gt;One thing I learned early on:&lt;/p&gt;

&lt;p&gt;Never assume the ERP is the source of truth for how work gets done.&lt;/p&gt;

&lt;p&gt;It's often just one piece of a much larger process.&lt;/p&gt;

&lt;p&gt;The real workflow usually spans:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ERP systems&lt;/li&gt;
&lt;li&gt;spreadsheets&lt;/li&gt;
&lt;li&gt;emails&lt;/li&gt;
&lt;li&gt;manual approvals&lt;/li&gt;
&lt;li&gt;undocumented habits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The API is a technical challenge.&lt;/p&gt;

&lt;p&gt;Understanding operational behavior is a people challenge.&lt;/p&gt;

&lt;p&gt;And in my experience, the people challenge takes longer.&lt;/p&gt;

&lt;p&gt;The best integrations I've worked on started with questions like:&lt;/p&gt;

&lt;p&gt;Why does this process exist?&lt;/p&gt;

&lt;p&gt;Who actually uses this field?&lt;/p&gt;

&lt;p&gt;What happens if this step is skipped?&lt;/p&gt;

&lt;p&gt;Those conversations usually uncover more than the technical documentation ever does.&lt;/p&gt;

&lt;p&gt;The interesting part is that older systems are often doing exactly what they were designed to do.&lt;/p&gt;

&lt;p&gt;The complexity comes from everything built around them over the years.&lt;/p&gt;

&lt;p&gt;This is something we see regularly at BrainPack when connecting existing enterprise systems to modern workflows and AI capabilities. The technical connection is usually the easy part. Understanding how the organization actually operates is where most of the real integration work happens.&lt;/p&gt;

</description>
      <category>api</category>
      <category>discuss</category>
      <category>softwareengineering</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Something Honest About Being a Developer on This Kind of Team</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Thu, 28 May 2026 12:38:43 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/something-honest-about-being-a-developer-on-this-kind-of-team-5ehp</link>
      <guid>https://dev.to/dhruvi_21/something-honest-about-being-a-developer-on-this-kind-of-team-5ehp</guid>
      <description>&lt;p&gt;One thing I didn’t expect working on systems like this:&lt;/p&gt;

&lt;p&gt;A lot of the job is uncertainty.&lt;/p&gt;

&lt;p&gt;Not coding itself.&lt;/p&gt;

&lt;p&gt;Uncertainty.&lt;/p&gt;

&lt;p&gt;You’re constantly working across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;systems you didn’t build&lt;/li&gt;
&lt;li&gt;workflows nobody fully documented&lt;/li&gt;
&lt;li&gt;integrations that behave differently under production load&lt;/li&gt;
&lt;li&gt;business logic hidden inside years of habits and manual processes
Sometimes the hardest part is simply figuring out what is actually happening.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Another thing people don’t talk about enough:&lt;/p&gt;

&lt;p&gt;You rarely get the satisfaction of “finished.”&lt;/p&gt;

&lt;p&gt;Because the systems keep evolving.&lt;/p&gt;

&lt;p&gt;You fix one workflow.&lt;/p&gt;

&lt;p&gt;Then another dependency appears.&lt;/p&gt;

&lt;p&gt;You stabilize one integration.&lt;/p&gt;

&lt;p&gt;Then business priorities change and the flow changes again.&lt;/p&gt;

&lt;p&gt;The system keeps moving underneath you.&lt;/p&gt;

&lt;p&gt;There’s also a different kind of pressure when systems run continuously.&lt;/p&gt;

&lt;p&gt;You know real operations depend on them.&lt;/p&gt;

&lt;p&gt;If something breaks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;people stop receiving orders&lt;/li&gt;
&lt;li&gt;workflows stop moving&lt;/li&gt;
&lt;li&gt;teams lose visibility&lt;/li&gt;
&lt;li&gt;data becomes unreliable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It changes how carefully you think about small decisions.&lt;/p&gt;

&lt;p&gt;At the same time, this kind of work made me much calmer technically.&lt;/p&gt;

&lt;p&gt;You stop panicking when things fail.&lt;/p&gt;

&lt;p&gt;Because eventually you realize:&lt;br&gt;
production systems always fail somewhere.&lt;/p&gt;

&lt;p&gt;The goal is not perfection.&lt;/p&gt;

&lt;p&gt;The goal is building systems that recover safely and predictably.&lt;/p&gt;

&lt;p&gt;One thing I genuinely like about this work though:&lt;/p&gt;

&lt;p&gt;You get very close to how businesses actually operate.&lt;/p&gt;

&lt;p&gt;Not the clean diagrams.&lt;/p&gt;

&lt;p&gt;The real workflows.&lt;/p&gt;

&lt;p&gt;The weird edge cases.&lt;/p&gt;

&lt;p&gt;The manual processes people invented just to keep things moving.&lt;/p&gt;

&lt;p&gt;You learn quickly that software is usually less about code and more about understanding operational behavior.&lt;/p&gt;

&lt;p&gt;Working at BrainPack exposed me to how complex enterprise environments actually are once multiple systems, teams, and AI workflows start interacting together. Most of the engineering work is not about building isolated features, it’s about making entire operational flows stable enough to trust long term.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>A Technical Problem I Worked On This Week</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Fri, 22 May 2026 12:43:15 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/a-technical-problem-i-worked-on-this-week-1fle</link>
      <guid>https://dev.to/dhruvi_21/a-technical-problem-i-worked-on-this-week-1fle</guid>
      <description>&lt;p&gt;This week, I spent more time than expected debugging something that looked simple.&lt;/p&gt;

&lt;p&gt;Data syncing between two systems.&lt;/p&gt;

&lt;p&gt;One side said the record was updated.&lt;/p&gt;

&lt;p&gt;The other side disagreed.&lt;/p&gt;

&lt;p&gt;No errors.&lt;/p&gt;

&lt;p&gt;No failed requests.&lt;/p&gt;

&lt;p&gt;Everything looked normal.&lt;/p&gt;

&lt;p&gt;Which usually means the problem is not obvious.&lt;/p&gt;

&lt;p&gt;The issue ended up being timing.&lt;/p&gt;

&lt;p&gt;One system updated immediately.&lt;/p&gt;

&lt;p&gt;The other processed updates through a delayed workflow.&lt;/p&gt;

&lt;p&gt;Most of the time it worked.&lt;/p&gt;

&lt;p&gt;Sometimes updates arrived in a different order.&lt;/p&gt;

&lt;p&gt;Which created small inconsistencies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;outdated values appearing temporarily&lt;/li&gt;
&lt;li&gt;automation triggering from stale information&lt;/li&gt;
&lt;li&gt;users seeing different states in different systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The difficult part was that it only happened occasionally.&lt;/p&gt;

&lt;p&gt;So reproducing it locally was almost impossible.&lt;/p&gt;

&lt;p&gt;The fix itself was not complicated.&lt;/p&gt;

&lt;p&gt;We changed how updates were processed.&lt;/p&gt;

&lt;p&gt;Instead of assuming data arrives in the right order, we added validation around state changes before applying updates.&lt;/p&gt;

&lt;p&gt;Small change.&lt;/p&gt;

&lt;p&gt;Big difference.&lt;/p&gt;

&lt;p&gt;One thing I keep learning working on systems that run continuously:&lt;/p&gt;

&lt;p&gt;A lot of problems are not caused by failures.&lt;/p&gt;

&lt;p&gt;They come from assumptions.&lt;/p&gt;

&lt;p&gt;Assuming systems process things instantly.&lt;/p&gt;

&lt;p&gt;Assuming updates arrive in order.&lt;/p&gt;

&lt;p&gt;Assuming timing stays consistent.&lt;/p&gt;

&lt;p&gt;Production environments break assumptions very quickly.&lt;/p&gt;

&lt;p&gt;The interesting part about enterprise systems is that most technical problems are not isolated.&lt;/p&gt;

&lt;p&gt;One small inconsistency spreads.&lt;/p&gt;

&lt;p&gt;An automation behaves differently.&lt;/p&gt;

&lt;p&gt;A report becomes inaccurate.&lt;/p&gt;

&lt;p&gt;Another system trusts the wrong data.&lt;/p&gt;

&lt;p&gt;Small problems travel.&lt;/p&gt;

&lt;p&gt;This is something we deal with regularly at BrainPack while connecting systems that were never originally designed to operate together. AI workflows become much more reliable once the underlying data movement becomes predictable.&lt;/p&gt;

</description>
      <category>backend</category>
      <category>devjournal</category>
      <category>distributedsystems</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>The Difference Between Building for Demo and Building for Production</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Wed, 20 May 2026 12:45:36 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/the-difference-between-building-for-demo-and-building-for-production-52e7</link>
      <guid>https://dev.to/dhruvi_21/the-difference-between-building-for-demo-and-building-for-production-52e7</guid>
      <description>&lt;p&gt;A lot of software looks great in demos.&lt;/p&gt;

&lt;p&gt;Clean data.&lt;br&gt;
Fast responses.&lt;br&gt;
Perfect flow.&lt;/p&gt;

&lt;p&gt;Production is where reality shows up.&lt;/p&gt;

&lt;p&gt;A demo assumes everything behaves correctly.&lt;/p&gt;

&lt;p&gt;Production assumes eventually something will break.&lt;/p&gt;

&lt;p&gt;That changes how you build.&lt;/p&gt;

&lt;p&gt;For demos:&lt;/p&gt;

&lt;p&gt;You optimize for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;speed&lt;/li&gt;
&lt;li&gt;presentation&lt;/li&gt;
&lt;li&gt;showing capability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For production:&lt;/p&gt;

&lt;p&gt;You optimize for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;failure recovery&lt;/li&gt;
&lt;li&gt;retries&lt;/li&gt;
&lt;li&gt;monitoring&lt;/li&gt;
&lt;li&gt;stability&lt;/li&gt;
&lt;li&gt;weird edge cases nobody planned for&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A demo works when everything goes right.&lt;/p&gt;

&lt;p&gt;Production works when things go wrong.&lt;/p&gt;

&lt;p&gt;One thing I noticed early on:&lt;/p&gt;

&lt;p&gt;Demo environments are predictable.&lt;/p&gt;

&lt;p&gt;Production environments are messy.&lt;/p&gt;

&lt;p&gt;You get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;duplicate events&lt;/li&gt;
&lt;li&gt;incomplete data&lt;/li&gt;
&lt;li&gt;slow third party systems&lt;/li&gt;
&lt;li&gt;retries arriving late&lt;/li&gt;
&lt;li&gt;users doing things you never expected&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code that looked perfect in testing suddenly behaves very differently.&lt;/p&gt;

&lt;p&gt;Another difference:&lt;/p&gt;

&lt;p&gt;Demo code usually answers:&lt;/p&gt;

&lt;p&gt;"Can we do this?"&lt;/p&gt;

&lt;p&gt;Production code answers:&lt;/p&gt;

&lt;p&gt;"Can this keep working for months while real people depend on it?"&lt;/p&gt;

&lt;p&gt;Very different problem.&lt;/p&gt;

&lt;p&gt;One thing that changed how I build systems:&lt;/p&gt;

&lt;p&gt;I stopped asking:&lt;/p&gt;

&lt;p&gt;"Does this work?"&lt;/p&gt;

&lt;p&gt;Now I ask:&lt;/p&gt;

&lt;p&gt;"What happens when this fails?"&lt;/p&gt;

&lt;p&gt;Because eventually it will.&lt;/p&gt;

&lt;p&gt;The question is whether the system recovers safely.&lt;/p&gt;

&lt;p&gt;A lot of engineering work happens after the feature already works.&lt;/p&gt;

&lt;p&gt;Observability. Recovery. Reliability.&lt;/p&gt;

&lt;p&gt;The things users never notice.&lt;/p&gt;

&lt;p&gt;Until they stop existing.&lt;/p&gt;

&lt;p&gt;This comes up constantly at BrainPack while operating systems that run continuously across enterprise environments. Layering AI on top becomes much easier once the underlying infrastructure is designed for production conditions instead of demo conditions.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>A Small Fix That Helped a Live Deployment Immediately</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Tue, 19 May 2026 12:44:30 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/a-small-fix-that-helped-a-live-deployment-immediately-1hng</link>
      <guid>https://dev.to/dhruvi_21/a-small-fix-that-helped-a-live-deployment-immediately-1hng</guid>
      <description>&lt;p&gt;One of the most useful fixes I worked on recently was not complicated at all.&lt;/p&gt;

&lt;p&gt;We added a queue between two systems that were talking to each other directly.&lt;/p&gt;

&lt;p&gt;That was it.&lt;/p&gt;

&lt;p&gt;Before that, everything worked fine most of the time.&lt;/p&gt;

&lt;p&gt;Until traffic increased or one system slowed down for a few seconds.&lt;/p&gt;

&lt;p&gt;Then things started piling up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;requests timing out&lt;/li&gt;
&lt;li&gt;retries triggering&lt;/li&gt;
&lt;li&gt;duplicate operations&lt;/li&gt;
&lt;li&gt;random failures appearing across workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The problem was that both systems expected immediate responses from each other.&lt;/p&gt;

&lt;p&gt;So when one slowed down, the other started failing too.&lt;/p&gt;

&lt;p&gt;Classic cascading failure.&lt;/p&gt;

&lt;p&gt;The fix was surprisingly small.&lt;/p&gt;

&lt;p&gt;Instead of:&lt;br&gt;
System A → direct request → System B&lt;/p&gt;

&lt;p&gt;We changed it to:&lt;br&gt;
System A → queue → System B&lt;/p&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;requests could wait safely&lt;/li&gt;
&lt;li&gt;retries became manageable&lt;/li&gt;
&lt;li&gt;temporary slowdowns stopped affecting the entire flow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The deployment stabilized almost immediately.&lt;/p&gt;

&lt;p&gt;What I liked about this fix is that it changed the behavior of the system more than the complexity of the code.&lt;/p&gt;

&lt;p&gt;No massive rewrite.&lt;br&gt;
No new infrastructure layer.&lt;/p&gt;

&lt;p&gt;Just removing the assumption that everything has to happen instantly.&lt;/p&gt;

&lt;p&gt;A lot of production issues come from systems being too tightly coupled.&lt;/p&gt;

&lt;p&gt;One delay becomes everybody’s problem.&lt;/p&gt;

&lt;p&gt;Queues don’t remove failures.&lt;/p&gt;

&lt;p&gt;They absorb pressure long enough for the rest of the system to keep operating normally.&lt;/p&gt;

&lt;p&gt;One thing I learned working on live systems:&lt;/p&gt;

&lt;p&gt;Performance issues are often really coordination issues.&lt;/p&gt;

&lt;p&gt;The systems themselves are usually capable.&lt;/p&gt;

&lt;p&gt;They just fail because everything depends on perfect timing.&lt;/p&gt;

&lt;p&gt;This is something we run into constantly at BrainPack while integrating multiple enterprise systems and AI workflows together. A lot of stability comes from reducing tight coupling between systems so temporary failures don’t spread across the entire infrastructure.&lt;/p&gt;

</description>
      <category>backend</category>
      <category>deployment</category>
      <category>engineering</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>How We Debug Issues That Only Happen Once Every Few Days</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Fri, 15 May 2026 12:45:23 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/how-we-debug-issues-that-only-happen-once-every-few-days-22kd</link>
      <guid>https://dev.to/dhruvi_21/how-we-debug-issues-that-only-happen-once-every-few-days-22kd</guid>
      <description>&lt;p&gt;The hardest bugs are not the ones that happen constantly.&lt;/p&gt;

&lt;p&gt;The hardest ones are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;once every few days&lt;/li&gt;
&lt;li&gt;under unknown conditions&lt;/li&gt;
&lt;li&gt;with no obvious pattern
Especially in systems that run continuously.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because by the time you notice the issue, the original state is already gone.&lt;/p&gt;

&lt;p&gt;Early on, I used to approach these bugs the wrong way.&lt;/p&gt;

&lt;p&gt;I would immediately start reading logs and trying to reproduce the issue locally.&lt;/p&gt;

&lt;p&gt;Most of the time, that went nowhere.&lt;/p&gt;

&lt;p&gt;Because these problems usually depend on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;timing&lt;/li&gt;
&lt;li&gt;retries&lt;/li&gt;
&lt;li&gt;load&lt;/li&gt;
&lt;li&gt;specific data states&lt;/li&gt;
&lt;li&gt;interactions between systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Things that almost never exist in your local environment the same way.&lt;/p&gt;

&lt;p&gt;What changed for me was realizing:&lt;/p&gt;

&lt;p&gt;The goal is not “find the bug immediately.”&lt;/p&gt;

&lt;p&gt;The goal is:&lt;br&gt;
make the system observable enough that the bug exposes itself next time.&lt;/p&gt;

&lt;p&gt;So instead of guessing, we start adding visibility around the problem.&lt;/p&gt;

&lt;p&gt;Things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tracking state transitions&lt;/li&gt;
&lt;li&gt;storing retry history&lt;/li&gt;
&lt;li&gt;recording execution timing&lt;/li&gt;
&lt;li&gt;correlating events across systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not permanent debugging noise.&lt;/p&gt;

&lt;p&gt;Just enough context to reconstruct what actually happened later.&lt;/p&gt;

&lt;p&gt;Another thing I learned:&lt;/p&gt;

&lt;p&gt;Rare bugs are often not random.&lt;/p&gt;

&lt;p&gt;They usually happen when multiple small conditions align:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a delayed queue&lt;/li&gt;
&lt;li&gt;a retry arriving late&lt;/li&gt;
&lt;li&gt;stale data&lt;/li&gt;
&lt;li&gt;another service slowing down&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Individually, nothing breaks.&lt;/p&gt;

&lt;p&gt;Together, something weird appears for 30 seconds and disappears again.&lt;/p&gt;

&lt;p&gt;One mistake I made a lot before:&lt;/p&gt;

&lt;p&gt;Trying to “fix” the issue too early.&lt;/p&gt;

&lt;p&gt;When you don’t fully understand intermittent bugs, quick fixes usually just hide the symptom temporarily.&lt;/p&gt;

&lt;p&gt;So now I spend more time understanding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what sequence created the issue&lt;/li&gt;
&lt;li&gt;what state the system was in&lt;/li&gt;
&lt;li&gt;why recovery didn’t happen automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only then do we change the flow.&lt;/p&gt;

&lt;p&gt;The interesting part is that debugging these issues slowly changes how you design systems.&lt;/p&gt;

&lt;p&gt;You stop building only for normal operation.&lt;/p&gt;

&lt;p&gt;You start building for investigation too.&lt;/p&gt;

&lt;p&gt;Because eventually, every long-running system develops behaviors you didn’t predict.&lt;/p&gt;

&lt;p&gt;At BrainPack, a lot of debugging work involves understanding interactions between systems that only fail under very specific timing conditions. The more AI workflows and automations are layered on top, the more important observability and recoverability become.&lt;/p&gt;

</description>
      <category>backend</category>
    </item>
  </channel>
</rss>
