<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dhruvi</title>
    <description>The latest articles on DEV Community by Dhruvi (@dhruvi_21).</description>
    <link>https://dev.to/dhruvi_21</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3894569%2Fe31cc617-f38a-4448-a25e-dfb161e3364d.png</url>
      <title>DEV Community: Dhruvi</title>
      <link>https://dev.to/dhruvi_21</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dhruvi_21"/>
    <language>en</language>
    <item>
      <title>What Documentation Looks Like in a Permanently Operated System</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Thu, 04 Jun 2026 12:39:15 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/what-documentation-looks-like-in-a-permanently-operated-system-1gja</link>
      <guid>https://dev.to/dhruvi_21/what-documentation-looks-like-in-a-permanently-operated-system-1gja</guid>
      <description>&lt;p&gt;I used to think documentation was mostly for onboarding.&lt;/p&gt;

&lt;p&gt;A way to help new developers understand the system.&lt;/p&gt;

&lt;p&gt;That's part of it.&lt;/p&gt;

&lt;p&gt;But when you're operating systems continuously, documentation becomes something else entirely.&lt;/p&gt;

&lt;p&gt;It's operational infrastructure.&lt;/p&gt;

&lt;p&gt;The biggest misconception about documentation is that it's about explaining how things were built.&lt;/p&gt;

&lt;p&gt;Most of the time, that's not what people need.&lt;/p&gt;

&lt;p&gt;What they need is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how does this actually work today?&lt;/li&gt;
&lt;li&gt;what happens if it fails?&lt;/li&gt;
&lt;li&gt;who depends on it?&lt;/li&gt;
&lt;li&gt;what should happen next?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One thing I learned pretty quickly:&lt;/p&gt;

&lt;p&gt;Nobody reads long documentation during an incident.&lt;/p&gt;

&lt;p&gt;If something breaks, people need answers fast.&lt;/p&gt;

&lt;p&gt;So the most useful documentation is usually the simplest.&lt;/p&gt;

&lt;p&gt;Things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;workflow diagrams&lt;/li&gt;
&lt;li&gt;system dependencies&lt;/li&gt;
&lt;li&gt;retry behavior&lt;/li&gt;
&lt;li&gt;recovery steps&lt;/li&gt;
&lt;li&gt;known failure points&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Another thing that changes in long-running systems:&lt;/p&gt;

&lt;p&gt;Documentation can't be static.&lt;/p&gt;

&lt;p&gt;The system evolves.&lt;/p&gt;

&lt;p&gt;Integrations change.&lt;/p&gt;

&lt;p&gt;Business processes change.&lt;/p&gt;

&lt;p&gt;Automations get added.&lt;/p&gt;

&lt;p&gt;If documentation doesn't evolve too, it slowly becomes misleading.&lt;/p&gt;

&lt;p&gt;And outdated documentation is often worse than no documentation at all.&lt;/p&gt;

&lt;p&gt;The documentation I use most is rarely technical.&lt;/p&gt;

&lt;p&gt;It's operational.&lt;/p&gt;

&lt;p&gt;Questions like:&lt;/p&gt;

&lt;p&gt;Why does this process exist?&lt;/p&gt;

&lt;p&gt;What happens if this service is unavailable?&lt;/p&gt;

&lt;p&gt;Which systems depend on this workflow?&lt;/p&gt;

&lt;p&gt;Those answers save more time than implementation details.&lt;/p&gt;

&lt;p&gt;One thing I appreciate now:&lt;/p&gt;

&lt;p&gt;Good documentation reduces dependency on specific people.&lt;/p&gt;

&lt;p&gt;Without it, knowledge gets trapped.&lt;/p&gt;

&lt;p&gt;One person knows how something works.&lt;/p&gt;

&lt;p&gt;One person knows how to recover it.&lt;/p&gt;

&lt;p&gt;One person knows why it was built that way.&lt;/p&gt;

&lt;p&gt;That's a risk.&lt;/p&gt;

&lt;p&gt;In systems that run continuously, documentation is less about explaining code and more about preserving operational knowledge.&lt;/p&gt;

&lt;p&gt;Because eventually, everyone forgets why a decision was made.&lt;/p&gt;

&lt;p&gt;The documentation is what remains.&lt;/p&gt;

&lt;p&gt;At BrainPack, a lot of the systems we operate involve multiple integrations, workflows, and AI layers running together. Good documentation helps turn individual knowledge into infrastructure that the entire team can rely on over time.I used to think documentation was mostly for onboarding.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>softwareengineering</category>
      <category>sre</category>
      <category>writing</category>
    </item>
    <item>
      <title>What Building Software That Runs 24/7 Actually Means Day to Day</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Wed, 03 Jun 2026 13:04:16 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/what-building-software-that-runs-247-actually-means-day-to-day-14fl</link>
      <guid>https://dev.to/dhruvi_21/what-building-software-that-runs-247-actually-means-day-to-day-14fl</guid>
      <description>&lt;p&gt;When people hear that a system runs 24/7, they usually think about uptime.&lt;/p&gt;

&lt;p&gt;Servers running.&lt;/p&gt;

&lt;p&gt;Services responding.&lt;/p&gt;

&lt;p&gt;No outages.&lt;/p&gt;

&lt;p&gt;But day to day, that's not what I spend most of my time thinking about.&lt;/p&gt;

&lt;p&gt;What I actually think about is:&lt;/p&gt;

&lt;p&gt;What happens at 2:13 AM when something unexpected occurs?&lt;/p&gt;

&lt;p&gt;Because eventually, it will.&lt;/p&gt;

&lt;p&gt;A queue gets stuck.&lt;/p&gt;

&lt;p&gt;A third party API slows down.&lt;/p&gt;

&lt;p&gt;A workflow starts behaving differently.&lt;/p&gt;

&lt;p&gt;A retry arrives hours later than expected.&lt;/p&gt;

&lt;p&gt;The interesting part is that most problems aren't dramatic.&lt;/p&gt;

&lt;p&gt;The system doesn't crash.&lt;/p&gt;

&lt;p&gt;It keeps running.&lt;/p&gt;

&lt;p&gt;Just slightly wrong.&lt;/p&gt;

&lt;p&gt;And those are often the hardest issues to catch.&lt;/p&gt;

&lt;p&gt;Building software that runs continuously means caring about things that demos never show:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;recovery&lt;/li&gt;
&lt;li&gt;observability&lt;/li&gt;
&lt;li&gt;retries&lt;/li&gt;
&lt;li&gt;data consistency&lt;/li&gt;
&lt;li&gt;failure handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not because they're exciting.&lt;/p&gt;

&lt;p&gt;Because they become important every single day.&lt;/p&gt;

&lt;p&gt;One thing I learned pretty quickly:&lt;/p&gt;

&lt;p&gt;The goal isn't building a system that never fails.&lt;/p&gt;

&lt;p&gt;The goal is building a system that can recover without someone jumping in every time.&lt;/p&gt;

&lt;p&gt;If a process gets stuck, can it restart?&lt;/p&gt;

&lt;p&gt;If an API fails, can it retry safely?&lt;/p&gt;

&lt;p&gt;If data arrives late, can the workflow still complete correctly?&lt;/p&gt;

&lt;p&gt;Those questions matter more than most features.&lt;/p&gt;

&lt;p&gt;Another reality is that software running 24/7 creates a different relationship with technical decisions.&lt;/p&gt;

&lt;p&gt;Small shortcuts last a long time.&lt;/p&gt;

&lt;p&gt;Small bugs eventually surface.&lt;/p&gt;

&lt;p&gt;Small assumptions eventually get tested.&lt;/p&gt;

&lt;p&gt;The system has a lot of time to find weaknesses.&lt;/p&gt;

&lt;p&gt;What surprised me most is how much of the work is actually about predictability.&lt;/p&gt;

&lt;p&gt;Not speed.&lt;/p&gt;

&lt;p&gt;Not new features.&lt;/p&gt;

&lt;p&gt;Predictability.&lt;/p&gt;

&lt;p&gt;Knowing how the system behaves when things go right and when they don't.&lt;/p&gt;

&lt;p&gt;Because people eventually start depending on that behavior.&lt;/p&gt;

&lt;p&gt;Building software that runs continuously has changed how I think about engineering. The feature is only the beginning. The real work starts when the system has to keep doing its job reliably every hour of every day.&lt;/p&gt;

&lt;p&gt;This is the reality of a lot of the systems we operate at BrainPack. Once enterprise workflows and AI automations are running continuously, reliability becomes less about uptime and more about making sure the system behaves predictably under real-world conditions.&lt;/p&gt;

</description>
      <category>backend</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>sre</category>
    </item>
    <item>
      <title>The Hardest Part of Integrating with Legacy ERPs Is Not the API</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Mon, 01 Jun 2026 13:03:15 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/the-hardest-part-of-integrating-with-legacy-erps-is-not-the-api-52fh</link>
      <guid>https://dev.to/dhruvi_21/the-hardest-part-of-integrating-with-legacy-erps-is-not-the-api-52fh</guid>
      <description>&lt;p&gt;When people hear "ERP integration," they usually think the difficult part is the API.&lt;/p&gt;

&lt;p&gt;Sometimes there isn't even an API.&lt;/p&gt;

&lt;p&gt;But honestly, that's rarely the biggest problem.&lt;/p&gt;

&lt;p&gt;The harder problem is understanding how the business actually uses the system.&lt;/p&gt;

&lt;p&gt;Because what the ERP does and what the organization does are often two different things.&lt;/p&gt;

&lt;p&gt;I've seen situations where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a field means one thing in the documentation and something completely different in practice&lt;/li&gt;
&lt;li&gt;a workflow exists because someone created a workaround five years ago&lt;/li&gt;
&lt;li&gt;critical business rules live in spreadsheets instead of the ERP&lt;/li&gt;
&lt;li&gt;important decisions happen outside the system entirely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Technically, the integration works.&lt;/p&gt;

&lt;p&gt;Operationally, it's wrong.&lt;/p&gt;

&lt;p&gt;This is where many integration projects get stuck.&lt;/p&gt;

&lt;p&gt;The data moves successfully.&lt;/p&gt;

&lt;p&gt;The API calls succeed.&lt;/p&gt;

&lt;p&gt;The sync completes.&lt;/p&gt;

&lt;p&gt;And yet users immediately tell you something is broken.&lt;/p&gt;

&lt;p&gt;Because the integration followed the system.&lt;/p&gt;

&lt;p&gt;Not the business process.&lt;/p&gt;

&lt;p&gt;One thing I learned early on:&lt;/p&gt;

&lt;p&gt;Never assume the ERP is the source of truth for how work gets done.&lt;/p&gt;

&lt;p&gt;It's often just one piece of a much larger process.&lt;/p&gt;

&lt;p&gt;The real workflow usually spans:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ERP systems&lt;/li&gt;
&lt;li&gt;spreadsheets&lt;/li&gt;
&lt;li&gt;emails&lt;/li&gt;
&lt;li&gt;manual approvals&lt;/li&gt;
&lt;li&gt;undocumented habits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The API is a technical challenge.&lt;/p&gt;

&lt;p&gt;Understanding operational behavior is a people challenge.&lt;/p&gt;

&lt;p&gt;And in my experience, the people challenge takes longer.&lt;/p&gt;

&lt;p&gt;The best integrations I've worked on started with questions like:&lt;/p&gt;

&lt;p&gt;Why does this process exist?&lt;/p&gt;

&lt;p&gt;Who actually uses this field?&lt;/p&gt;

&lt;p&gt;What happens if this step is skipped?&lt;/p&gt;

&lt;p&gt;Those conversations usually uncover more than the technical documentation ever does.&lt;/p&gt;

&lt;p&gt;The interesting part is that older systems are often doing exactly what they were designed to do.&lt;/p&gt;

&lt;p&gt;The complexity comes from everything built around them over the years.&lt;/p&gt;

&lt;p&gt;This is something we see regularly at BrainPack when connecting existing enterprise systems to modern workflows and AI capabilities. The technical connection is usually the easy part. Understanding how the organization actually operates is where most of the real integration work happens.&lt;/p&gt;

</description>
      <category>api</category>
      <category>discuss</category>
      <category>softwareengineering</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Something Honest About Being a Developer on This Kind of Team</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Thu, 28 May 2026 12:38:43 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/something-honest-about-being-a-developer-on-this-kind-of-team-5ehp</link>
      <guid>https://dev.to/dhruvi_21/something-honest-about-being-a-developer-on-this-kind-of-team-5ehp</guid>
      <description>&lt;p&gt;One thing I didn’t expect working on systems like this:&lt;/p&gt;

&lt;p&gt;A lot of the job is uncertainty.&lt;/p&gt;

&lt;p&gt;Not coding itself.&lt;/p&gt;

&lt;p&gt;Uncertainty.&lt;/p&gt;

&lt;p&gt;You’re constantly working across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;systems you didn’t build&lt;/li&gt;
&lt;li&gt;workflows nobody fully documented&lt;/li&gt;
&lt;li&gt;integrations that behave differently under production load&lt;/li&gt;
&lt;li&gt;business logic hidden inside years of habits and manual processes
Sometimes the hardest part is simply figuring out what is actually happening.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Another thing people don’t talk about enough:&lt;/p&gt;

&lt;p&gt;You rarely get the satisfaction of “finished.”&lt;/p&gt;

&lt;p&gt;Because the systems keep evolving.&lt;/p&gt;

&lt;p&gt;You fix one workflow.&lt;/p&gt;

&lt;p&gt;Then another dependency appears.&lt;/p&gt;

&lt;p&gt;You stabilize one integration.&lt;/p&gt;

&lt;p&gt;Then business priorities change and the flow changes again.&lt;/p&gt;

&lt;p&gt;The system keeps moving underneath you.&lt;/p&gt;

&lt;p&gt;There’s also a different kind of pressure when systems run continuously.&lt;/p&gt;

&lt;p&gt;You know real operations depend on them.&lt;/p&gt;

&lt;p&gt;If something breaks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;people stop receiving orders&lt;/li&gt;
&lt;li&gt;workflows stop moving&lt;/li&gt;
&lt;li&gt;teams lose visibility&lt;/li&gt;
&lt;li&gt;data becomes unreliable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It changes how carefully you think about small decisions.&lt;/p&gt;

&lt;p&gt;At the same time, this kind of work made me much calmer technically.&lt;/p&gt;

&lt;p&gt;You stop panicking when things fail.&lt;/p&gt;

&lt;p&gt;Because eventually you realize:&lt;br&gt;
production systems always fail somewhere.&lt;/p&gt;

&lt;p&gt;The goal is not perfection.&lt;/p&gt;

&lt;p&gt;The goal is building systems that recover safely and predictably.&lt;/p&gt;

&lt;p&gt;One thing I genuinely like about this work though:&lt;/p&gt;

&lt;p&gt;You get very close to how businesses actually operate.&lt;/p&gt;

&lt;p&gt;Not the clean diagrams.&lt;/p&gt;

&lt;p&gt;The real workflows.&lt;/p&gt;

&lt;p&gt;The weird edge cases.&lt;/p&gt;

&lt;p&gt;The manual processes people invented just to keep things moving.&lt;/p&gt;

&lt;p&gt;You learn quickly that software is usually less about code and more about understanding operational behavior.&lt;/p&gt;

&lt;p&gt;Working at BrainPack exposed me to how complex enterprise environments actually are once multiple systems, teams, and AI workflows start interacting together. Most of the engineering work is not about building isolated features, it’s about making entire operational flows stable enough to trust long term.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>A Technical Problem I Worked On This Week</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Fri, 22 May 2026 12:43:15 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/a-technical-problem-i-worked-on-this-week-1fle</link>
      <guid>https://dev.to/dhruvi_21/a-technical-problem-i-worked-on-this-week-1fle</guid>
      <description>&lt;p&gt;This week, I spent more time than expected debugging something that looked simple.&lt;/p&gt;

&lt;p&gt;Data syncing between two systems.&lt;/p&gt;

&lt;p&gt;One side said the record was updated.&lt;/p&gt;

&lt;p&gt;The other side disagreed.&lt;/p&gt;

&lt;p&gt;No errors.&lt;/p&gt;

&lt;p&gt;No failed requests.&lt;/p&gt;

&lt;p&gt;Everything looked normal.&lt;/p&gt;

&lt;p&gt;Which usually means the problem is not obvious.&lt;/p&gt;

&lt;p&gt;The issue ended up being timing.&lt;/p&gt;

&lt;p&gt;One system updated immediately.&lt;/p&gt;

&lt;p&gt;The other processed updates through a delayed workflow.&lt;/p&gt;

&lt;p&gt;Most of the time it worked.&lt;/p&gt;

&lt;p&gt;Sometimes updates arrived in a different order.&lt;/p&gt;

&lt;p&gt;Which created small inconsistencies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;outdated values appearing temporarily&lt;/li&gt;
&lt;li&gt;automation triggering from stale information&lt;/li&gt;
&lt;li&gt;users seeing different states in different systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The difficult part was that it only happened occasionally.&lt;/p&gt;

&lt;p&gt;So reproducing it locally was almost impossible.&lt;/p&gt;

&lt;p&gt;The fix itself was not complicated.&lt;/p&gt;

&lt;p&gt;We changed how updates were processed.&lt;/p&gt;

&lt;p&gt;Instead of assuming data arrives in the right order, we added validation around state changes before applying updates.&lt;/p&gt;

&lt;p&gt;Small change.&lt;/p&gt;

&lt;p&gt;Big difference.&lt;/p&gt;

&lt;p&gt;One thing I keep learning working on systems that run continuously:&lt;/p&gt;

&lt;p&gt;A lot of problems are not caused by failures.&lt;/p&gt;

&lt;p&gt;They come from assumptions.&lt;/p&gt;

&lt;p&gt;Assuming systems process things instantly.&lt;/p&gt;

&lt;p&gt;Assuming updates arrive in order.&lt;/p&gt;

&lt;p&gt;Assuming timing stays consistent.&lt;/p&gt;

&lt;p&gt;Production environments break assumptions very quickly.&lt;/p&gt;

&lt;p&gt;The interesting part about enterprise systems is that most technical problems are not isolated.&lt;/p&gt;

&lt;p&gt;One small inconsistency spreads.&lt;/p&gt;

&lt;p&gt;An automation behaves differently.&lt;/p&gt;

&lt;p&gt;A report becomes inaccurate.&lt;/p&gt;

&lt;p&gt;Another system trusts the wrong data.&lt;/p&gt;

&lt;p&gt;Small problems travel.&lt;/p&gt;

&lt;p&gt;This is something we deal with regularly at BrainPack while connecting systems that were never originally designed to operate together. AI workflows become much more reliable once the underlying data movement becomes predictable.&lt;/p&gt;

</description>
      <category>backend</category>
      <category>devjournal</category>
      <category>distributedsystems</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>The Difference Between Building for Demo and Building for Production</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Wed, 20 May 2026 12:45:36 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/the-difference-between-building-for-demo-and-building-for-production-52e7</link>
      <guid>https://dev.to/dhruvi_21/the-difference-between-building-for-demo-and-building-for-production-52e7</guid>
      <description>&lt;p&gt;A lot of software looks great in demos.&lt;/p&gt;

&lt;p&gt;Clean data.&lt;br&gt;
Fast responses.&lt;br&gt;
Perfect flow.&lt;/p&gt;

&lt;p&gt;Production is where reality shows up.&lt;/p&gt;

&lt;p&gt;A demo assumes everything behaves correctly.&lt;/p&gt;

&lt;p&gt;Production assumes eventually something will break.&lt;/p&gt;

&lt;p&gt;That changes how you build.&lt;/p&gt;

&lt;p&gt;For demos:&lt;/p&gt;

&lt;p&gt;You optimize for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;speed&lt;/li&gt;
&lt;li&gt;presentation&lt;/li&gt;
&lt;li&gt;showing capability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For production:&lt;/p&gt;

&lt;p&gt;You optimize for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;failure recovery&lt;/li&gt;
&lt;li&gt;retries&lt;/li&gt;
&lt;li&gt;monitoring&lt;/li&gt;
&lt;li&gt;stability&lt;/li&gt;
&lt;li&gt;weird edge cases nobody planned for&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A demo works when everything goes right.&lt;/p&gt;

&lt;p&gt;Production works when things go wrong.&lt;/p&gt;

&lt;p&gt;One thing I noticed early on:&lt;/p&gt;

&lt;p&gt;Demo environments are predictable.&lt;/p&gt;

&lt;p&gt;Production environments are messy.&lt;/p&gt;

&lt;p&gt;You get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;duplicate events&lt;/li&gt;
&lt;li&gt;incomplete data&lt;/li&gt;
&lt;li&gt;slow third party systems&lt;/li&gt;
&lt;li&gt;retries arriving late&lt;/li&gt;
&lt;li&gt;users doing things you never expected&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code that looked perfect in testing suddenly behaves very differently.&lt;/p&gt;

&lt;p&gt;Another difference:&lt;/p&gt;

&lt;p&gt;Demo code usually answers:&lt;/p&gt;

&lt;p&gt;"Can we do this?"&lt;/p&gt;

&lt;p&gt;Production code answers:&lt;/p&gt;

&lt;p&gt;"Can this keep working for months while real people depend on it?"&lt;/p&gt;

&lt;p&gt;Very different problem.&lt;/p&gt;

&lt;p&gt;One thing that changed how I build systems:&lt;/p&gt;

&lt;p&gt;I stopped asking:&lt;/p&gt;

&lt;p&gt;"Does this work?"&lt;/p&gt;

&lt;p&gt;Now I ask:&lt;/p&gt;

&lt;p&gt;"What happens when this fails?"&lt;/p&gt;

&lt;p&gt;Because eventually it will.&lt;/p&gt;

&lt;p&gt;The question is whether the system recovers safely.&lt;/p&gt;

&lt;p&gt;A lot of engineering work happens after the feature already works.&lt;/p&gt;

&lt;p&gt;Observability. Recovery. Reliability.&lt;/p&gt;

&lt;p&gt;The things users never notice.&lt;/p&gt;

&lt;p&gt;Until they stop existing.&lt;/p&gt;

&lt;p&gt;This comes up constantly at BrainPack while operating systems that run continuously across enterprise environments. Layering AI on top becomes much easier once the underlying infrastructure is designed for production conditions instead of demo conditions.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>A Small Fix That Helped a Live Deployment Immediately</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Tue, 19 May 2026 12:44:30 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/a-small-fix-that-helped-a-live-deployment-immediately-1hng</link>
      <guid>https://dev.to/dhruvi_21/a-small-fix-that-helped-a-live-deployment-immediately-1hng</guid>
      <description>&lt;p&gt;One of the most useful fixes I worked on recently was not complicated at all.&lt;/p&gt;

&lt;p&gt;We added a queue between two systems that were talking to each other directly.&lt;/p&gt;

&lt;p&gt;That was it.&lt;/p&gt;

&lt;p&gt;Before that, everything worked fine most of the time.&lt;/p&gt;

&lt;p&gt;Until traffic increased or one system slowed down for a few seconds.&lt;/p&gt;

&lt;p&gt;Then things started piling up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;requests timing out&lt;/li&gt;
&lt;li&gt;retries triggering&lt;/li&gt;
&lt;li&gt;duplicate operations&lt;/li&gt;
&lt;li&gt;random failures appearing across workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The problem was that both systems expected immediate responses from each other.&lt;/p&gt;

&lt;p&gt;So when one slowed down, the other started failing too.&lt;/p&gt;

&lt;p&gt;Classic cascading failure.&lt;/p&gt;

&lt;p&gt;The fix was surprisingly small.&lt;/p&gt;

&lt;p&gt;Instead of:&lt;br&gt;
System A → direct request → System B&lt;/p&gt;

&lt;p&gt;We changed it to:&lt;br&gt;
System A → queue → System B&lt;/p&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;requests could wait safely&lt;/li&gt;
&lt;li&gt;retries became manageable&lt;/li&gt;
&lt;li&gt;temporary slowdowns stopped affecting the entire flow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The deployment stabilized almost immediately.&lt;/p&gt;

&lt;p&gt;What I liked about this fix is that it changed the behavior of the system more than the complexity of the code.&lt;/p&gt;

&lt;p&gt;No massive rewrite.&lt;br&gt;
No new infrastructure layer.&lt;/p&gt;

&lt;p&gt;Just removing the assumption that everything has to happen instantly.&lt;/p&gt;

&lt;p&gt;A lot of production issues come from systems being too tightly coupled.&lt;/p&gt;

&lt;p&gt;One delay becomes everybody’s problem.&lt;/p&gt;

&lt;p&gt;Queues don’t remove failures.&lt;/p&gt;

&lt;p&gt;They absorb pressure long enough for the rest of the system to keep operating normally.&lt;/p&gt;

&lt;p&gt;One thing I learned working on live systems:&lt;/p&gt;

&lt;p&gt;Performance issues are often really coordination issues.&lt;/p&gt;

&lt;p&gt;The systems themselves are usually capable.&lt;/p&gt;

&lt;p&gt;They just fail because everything depends on perfect timing.&lt;/p&gt;

&lt;p&gt;This is something we run into constantly at BrainPack while integrating multiple enterprise systems and AI workflows together. A lot of stability comes from reducing tight coupling between systems so temporary failures don’t spread across the entire infrastructure.&lt;/p&gt;

</description>
      <category>backend</category>
      <category>deployment</category>
      <category>engineering</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>How We Debug Issues That Only Happen Once Every Few Days</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Fri, 15 May 2026 12:45:23 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/how-we-debug-issues-that-only-happen-once-every-few-days-22kd</link>
      <guid>https://dev.to/dhruvi_21/how-we-debug-issues-that-only-happen-once-every-few-days-22kd</guid>
      <description>&lt;p&gt;The hardest bugs are not the ones that happen constantly.&lt;/p&gt;

&lt;p&gt;The hardest ones are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;once every few days&lt;/li&gt;
&lt;li&gt;under unknown conditions&lt;/li&gt;
&lt;li&gt;with no obvious pattern
Especially in systems that run continuously.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because by the time you notice the issue, the original state is already gone.&lt;/p&gt;

&lt;p&gt;Early on, I used to approach these bugs the wrong way.&lt;/p&gt;

&lt;p&gt;I would immediately start reading logs and trying to reproduce the issue locally.&lt;/p&gt;

&lt;p&gt;Most of the time, that went nowhere.&lt;/p&gt;

&lt;p&gt;Because these problems usually depend on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;timing&lt;/li&gt;
&lt;li&gt;retries&lt;/li&gt;
&lt;li&gt;load&lt;/li&gt;
&lt;li&gt;specific data states&lt;/li&gt;
&lt;li&gt;interactions between systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Things that almost never exist in your local environment the same way.&lt;/p&gt;

&lt;p&gt;What changed for me was realizing:&lt;/p&gt;

&lt;p&gt;The goal is not “find the bug immediately.”&lt;/p&gt;

&lt;p&gt;The goal is:&lt;br&gt;
make the system observable enough that the bug exposes itself next time.&lt;/p&gt;

&lt;p&gt;So instead of guessing, we start adding visibility around the problem.&lt;/p&gt;

&lt;p&gt;Things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tracking state transitions&lt;/li&gt;
&lt;li&gt;storing retry history&lt;/li&gt;
&lt;li&gt;recording execution timing&lt;/li&gt;
&lt;li&gt;correlating events across systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not permanent debugging noise.&lt;/p&gt;

&lt;p&gt;Just enough context to reconstruct what actually happened later.&lt;/p&gt;

&lt;p&gt;Another thing I learned:&lt;/p&gt;

&lt;p&gt;Rare bugs are often not random.&lt;/p&gt;

&lt;p&gt;They usually happen when multiple small conditions align:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a delayed queue&lt;/li&gt;
&lt;li&gt;a retry arriving late&lt;/li&gt;
&lt;li&gt;stale data&lt;/li&gt;
&lt;li&gt;another service slowing down&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Individually, nothing breaks.&lt;/p&gt;

&lt;p&gt;Together, something weird appears for 30 seconds and disappears again.&lt;/p&gt;

&lt;p&gt;One mistake I made a lot before:&lt;/p&gt;

&lt;p&gt;Trying to “fix” the issue too early.&lt;/p&gt;

&lt;p&gt;When you don’t fully understand intermittent bugs, quick fixes usually just hide the symptom temporarily.&lt;/p&gt;

&lt;p&gt;So now I spend more time understanding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what sequence created the issue&lt;/li&gt;
&lt;li&gt;what state the system was in&lt;/li&gt;
&lt;li&gt;why recovery didn’t happen automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only then do we change the flow.&lt;/p&gt;

&lt;p&gt;The interesting part is that debugging these issues slowly changes how you design systems.&lt;/p&gt;

&lt;p&gt;You stop building only for normal operation.&lt;/p&gt;

&lt;p&gt;You start building for investigation too.&lt;/p&gt;

&lt;p&gt;Because eventually, every long-running system develops behaviors you didn’t predict.&lt;/p&gt;

&lt;p&gt;At BrainPack, a lot of debugging work involves understanding interactions between systems that only fail under very specific timing conditions. The more AI workflows and automations are layered on top, the more important observability and recoverability become.&lt;/p&gt;

</description>
      <category>backend</category>
    </item>
    <item>
      <title>A Tool That Saves Me Time Every Single Week</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Mon, 11 May 2026 12:50:54 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/a-tool-that-saves-me-time-every-single-week-219j</link>
      <guid>https://dev.to/dhruvi_21/a-tool-that-saves-me-time-every-single-week-219j</guid>
      <description>&lt;p&gt;One thing that saves me an absurd amount of time is building small internal debugging endpoints.&lt;/p&gt;

&lt;p&gt;Not dashboards.&lt;br&gt;
Not full admin panels.&lt;/p&gt;

&lt;p&gt;Just tiny routes or tools that answer very specific questions fast.&lt;/p&gt;

&lt;p&gt;Things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“show the last sync status for this customer”&lt;/li&gt;
&lt;li&gt;“replay this failed webhook”&lt;/li&gt;
&lt;li&gt;“show all retries for this workflow”&lt;/li&gt;
&lt;li&gt;“compare the data between these two systems”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Early on, I used to debug everything directly from logs and databases.&lt;/p&gt;

&lt;p&gt;It worked when systems were smaller.&lt;/p&gt;

&lt;p&gt;But once multiple services, queues, integrations, and retries are involved, simple issues start taking way too long to trace manually.&lt;/p&gt;

&lt;p&gt;So now, whenever I notice:&lt;br&gt;
“I keep checking this manually”&lt;/p&gt;

&lt;p&gt;I usually turn it into a small internal tool.&lt;/p&gt;

&lt;p&gt;The interesting part is that these tools are rarely complicated.&lt;/p&gt;

&lt;p&gt;Sometimes it’s:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one endpoint&lt;/li&gt;
&lt;li&gt;one query&lt;/li&gt;
&lt;li&gt;one button&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But removing 20 minutes of repeated investigation every day adds up fast.&lt;/p&gt;

&lt;p&gt;Especially in systems that run continuously.&lt;/p&gt;

&lt;p&gt;Another thing I realized:&lt;/p&gt;

&lt;p&gt;The best internal tools are usually built by the people operating the system directly.&lt;/p&gt;

&lt;p&gt;Because they come from real friction.&lt;/p&gt;

&lt;p&gt;Not assumptions about what might be useful.&lt;/p&gt;

&lt;p&gt;A lot of engineering time is lost not on fixing problems, but on figuring out where the problem actually is.&lt;/p&gt;

&lt;p&gt;Anything that shortens that feedback loop becomes valuable very quickly.&lt;/p&gt;

&lt;p&gt;At BrainPack, a lot of our internal tooling comes directly from operating live enterprise systems continuously. Once AI workflows and multiple integrations are involved, reducing investigation time becomes just as important as reducing failure rates.&lt;/p&gt;

</description>
      <category>backend</category>
      <category>productivity</category>
      <category>devtools</category>
    </item>
    <item>
      <title>The Hidden Cost of “Quick Fixes” in Enterprise Systems</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Thu, 07 May 2026 13:27:45 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/the-hidden-cost-of-quick-fixes-in-enterprise-systems-258j</link>
      <guid>https://dev.to/dhruvi_21/the-hidden-cost-of-quick-fixes-in-enterprise-systems-258j</guid>
      <description>&lt;p&gt;Most enterprise systems don’t become messy all at once.&lt;/p&gt;

&lt;p&gt;They become messy one quick fix at a time.&lt;/p&gt;

&lt;p&gt;A temporary script.&lt;br&gt;
A manual spreadsheet.&lt;br&gt;
A copied database table.&lt;br&gt;
A workflow someone added “just for now.”&lt;/p&gt;

&lt;p&gt;Individually, none of these seem dangerous.&lt;/p&gt;

&lt;p&gt;But after a few years, the organization is running on layers of patches nobody fully understands anymore.&lt;/p&gt;

&lt;p&gt;The problem with quick fixes is that they solve the immediate issue while quietly increasing system complexity.&lt;/p&gt;

&lt;p&gt;And complexity compounds.&lt;/p&gt;

&lt;p&gt;What starts as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one workaround&lt;/li&gt;
&lt;li&gt;becomes:&lt;/li&gt;
&lt;li&gt;multiple duplicate processes&lt;/li&gt;
&lt;li&gt;inconsistent data&lt;/li&gt;
&lt;li&gt;hidden dependencies&lt;/li&gt;
&lt;li&gt;workflows that only one person understands&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At some point, nobody trusts the system anymore.&lt;/p&gt;

&lt;p&gt;So teams create even more manual processes to compensate.&lt;/p&gt;

&lt;p&gt;That’s usually when things start slowing down operationally.&lt;/p&gt;

&lt;p&gt;One thing I noticed working on these systems:&lt;/p&gt;

&lt;p&gt;The biggest cost is rarely technical debt itself.&lt;/p&gt;

&lt;p&gt;It’s operational uncertainty.&lt;/p&gt;

&lt;p&gt;People stop knowing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which system is correct&lt;/li&gt;
&lt;li&gt;what process is actually being used&lt;/li&gt;
&lt;li&gt;whether automations can be trusted&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And once trust disappears, everything becomes slower because humans start double checking everything manually.&lt;/p&gt;

&lt;p&gt;The tricky part is that most quick fixes are not bad decisions at the time.&lt;/p&gt;

&lt;p&gt;The business needed something fast.&lt;br&gt;
The team solved the problem.&lt;br&gt;
Everyone moved on.&lt;/p&gt;

&lt;p&gt;But systems that run continuously remember every shortcut forever.&lt;/p&gt;

&lt;p&gt;What changed how I approach this:&lt;/p&gt;

&lt;p&gt;I stopped asking:&lt;br&gt;
“does this solve the problem?”&lt;/p&gt;

&lt;p&gt;Now the question is:&lt;br&gt;
“what does this make harder six months from now?”&lt;/p&gt;

&lt;p&gt;Because in long running systems, future complexity is usually more expensive than the original issue.&lt;/p&gt;

&lt;p&gt;A lot of the work we do at BrainPack starts with untangling years of accumulated workarounds across existing systems. AI only becomes useful once the underlying operations are predictable enough to trust again.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>discuss</category>
      <category>softwareengineering</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Why Logging Is Not Enough When You Operate Systems Continuously</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Mon, 04 May 2026 17:08:59 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/why-logging-is-not-enough-when-you-operate-systems-continuously-3k0o</link>
      <guid>https://dev.to/dhruvi_21/why-logging-is-not-enough-when-you-operate-systems-continuously-3k0o</guid>
      <description>&lt;p&gt;At some point, logs stop helping.&lt;/p&gt;

&lt;p&gt;Not because logging is bad.&lt;br&gt;
Because the system is doing too much.&lt;/p&gt;

&lt;p&gt;When you’re running something continuously, across multiple systems, logs turn into noise fast.&lt;/p&gt;

&lt;p&gt;You still log everything.&lt;br&gt;
You just can’t rely on it to understand what’s actually happening.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The expectation&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Early on, logging feels like the answer.&lt;/p&gt;

&lt;p&gt;Something breaks → check logs → find the issue → fix it&lt;/p&gt;

&lt;p&gt;Clean. Linear. Works in small systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What actually happens&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In production, it looks like this:&lt;/p&gt;

&lt;p&gt;thousands of log lines per minute&lt;br&gt;
multiple services writing at the same time&lt;br&gt;
retries creating duplicate entries&lt;br&gt;
partial failures that don’t throw clear errors&lt;/p&gt;

&lt;p&gt;You open logs and see everything.&lt;/p&gt;

&lt;p&gt;Which means you see nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The real problem&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Logs tell you what happened.&lt;/p&gt;

&lt;p&gt;They don’t tell you:&lt;/p&gt;

&lt;p&gt;what state the system is in&lt;br&gt;
what is currently broken&lt;br&gt;
what needs attention right now&lt;/p&gt;

&lt;p&gt;And when things run continuously, that’s what you actually need.&lt;/p&gt;

&lt;p&gt;What we started doing instead&lt;/p&gt;

&lt;p&gt;We still log. But we stopped treating logs as the source of truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Track state, not just events&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Instead of just writing logs like:&lt;/p&gt;

&lt;p&gt;“order created”&lt;br&gt;
“order failed”&lt;/p&gt;

&lt;p&gt;We track:&lt;/p&gt;

&lt;p&gt;current status of the order&lt;br&gt;
where it is in the flow&lt;br&gt;
what’s pending&lt;/p&gt;

&lt;p&gt;So at any moment, we can answer:&lt;/p&gt;

&lt;p&gt;what’s stuck right now&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2. Surface problems, don’t search for them&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Logs require you to go looking.&lt;/p&gt;

&lt;p&gt;In real systems, you don’t have time for that.&lt;/p&gt;

&lt;p&gt;So we build:&lt;/p&gt;

&lt;p&gt;alerts when something is off&lt;br&gt;
dashboards that show broken flows&lt;br&gt;
queues that show backlog&lt;/p&gt;

&lt;p&gt;The system tells you where to look.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Group by flow, not by line&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Logs are isolated lines.&lt;/p&gt;

&lt;p&gt;But real issues happen across a sequence.&lt;/p&gt;

&lt;p&gt;So we group things by:&lt;/p&gt;

&lt;p&gt;request&lt;br&gt;
entity&lt;br&gt;
workflow&lt;/p&gt;

&lt;p&gt;Instead of reading 100 lines, you follow one story.&lt;/p&gt;

&lt;p&gt;That’s where things start making sense again.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. Accept that some issues won’t be obvious&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Some problems don’t throw errors.&lt;/p&gt;

&lt;p&gt;They just… stop moving.&lt;/p&gt;

&lt;p&gt;A process gets stuck.&lt;br&gt;
A sync silently fails.&lt;/p&gt;

&lt;p&gt;Logs might show nothing critical.&lt;/p&gt;

&lt;p&gt;So you need signals like:&lt;/p&gt;

&lt;p&gt;time thresholds&lt;br&gt;
missing updates&lt;br&gt;
“this should have finished by now”&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What changed for me&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I used to think:&lt;/p&gt;

&lt;p&gt;if it’s logged, we can debug it&lt;/p&gt;

&lt;p&gt;Now I think:&lt;/p&gt;

&lt;p&gt;if we need logs to notice something is broken, we’re already late&lt;/p&gt;

&lt;p&gt;Logs are for digging deeper.&lt;/p&gt;

&lt;p&gt;Not for discovering the problem.&lt;/p&gt;

&lt;p&gt;In systems that run all the time, you don’t watch everything manually.&lt;/p&gt;

&lt;p&gt;The system needs to show you where it’s struggling.&lt;/p&gt;

&lt;p&gt;Otherwise, you’re just scrolling and hoping you notice the right line.&lt;/p&gt;

&lt;p&gt;This is something we run into a lot at BrainPack, where multiple systems are always moving and interacting. AI workflows depend on knowing the current state of everything, not just what happened, so observability has to go beyond logs.&lt;/p&gt;

</description>
      <category>backend</category>
      <category>architecture</category>
    </item>
    <item>
      <title>How We Design Systems That Keep Working Even When One Part Fails</title>
      <dc:creator>Dhruvi</dc:creator>
      <pubDate>Thu, 30 Apr 2026 13:14:46 +0000</pubDate>
      <link>https://dev.to/dhruvi_21/how-we-design-systems-that-keep-working-even-when-one-part-fails-3nmg</link>
      <guid>https://dev.to/dhruvi_21/how-we-design-systems-that-keep-working-even-when-one-part-fails-3nmg</guid>
      <description>&lt;p&gt;In real systems, something is always failing.&lt;/p&gt;

&lt;p&gt;An API times out.&lt;br&gt;
A database slows down.&lt;br&gt;
A third-party service returns garbage.&lt;/p&gt;

&lt;p&gt;If your system depends on everything working perfectly, it won’t last long in production.&lt;/p&gt;

&lt;p&gt;So the goal is not preventing failure.&lt;/p&gt;

&lt;p&gt;It’s designing so failure doesn’t break everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The wrong assumption&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A lot of systems are built like this:&lt;/p&gt;

&lt;p&gt;Step 1 → Step 2 → Step 3 → Done&lt;/p&gt;

&lt;p&gt;If Step 2 fails, the whole flow stops.&lt;/p&gt;

&lt;p&gt;In controlled environments, this works.&lt;/p&gt;

&lt;p&gt;In production, it creates fragile systems that break on the first issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What we do instead&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We design flows that can survive failure and continue.&lt;/p&gt;

&lt;p&gt;Not perfectly. But safely.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Break the dependency chain&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Instead of one long synchronous flow, we split things into independent steps.&lt;/p&gt;

&lt;p&gt;Each step:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;does one thing&lt;/li&gt;
&lt;li&gt;stores its state&lt;/li&gt;
&lt;li&gt;can be retried&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So if something fails, you don’t lose everything.&lt;/p&gt;

&lt;p&gt;You just retry that part.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;## 2. Accept partial success&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This one is uncomfortable at first.&lt;/p&gt;

&lt;p&gt;Sometimes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;part of the system succeeds&lt;/li&gt;
&lt;li&gt;another part fails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of rolling everything back, we:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;keep what succeeded&lt;/li&gt;
&lt;li&gt;fix what failed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because in distributed systems, “all or nothing” is rarely realistic.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Make retries safe&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Failures lead to retries.&lt;/p&gt;

&lt;p&gt;Retries lead to duplication if you’re not careful.&lt;/p&gt;

&lt;p&gt;So every step needs to be safe to run again:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no duplicate records&lt;/li&gt;
&lt;li&gt;no repeated side effects&lt;/li&gt;
&lt;li&gt;no broken state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If retries are safe, failure becomes manageable.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. Isolate external dependencies&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Anything outside your control will fail eventually.&lt;/p&gt;

&lt;p&gt;So we isolate them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;queues between systems&lt;/li&gt;
&lt;li&gt;timeouts and fallbacks&lt;/li&gt;
&lt;li&gt;delayed execution when needed
The goal is simple&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If one system goes down, everything else should keep moving.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;5. Design for recovery, not perfection&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Instead of asking:&lt;/p&gt;

&lt;p&gt;how do we make this never fail&lt;/p&gt;

&lt;p&gt;We ask:&lt;/p&gt;

&lt;p&gt;how does this recover when it fails&lt;/p&gt;

&lt;p&gt;That changes everything.&lt;/p&gt;

&lt;p&gt;You stop chasing edge cases and start building systems that handle them naturally.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What changed for me&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I stopped treating failure as an exception.&lt;/p&gt;

&lt;p&gt;Now it’s part of the normal flow.&lt;/p&gt;

&lt;p&gt;Every system I build assumes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;something will fail&lt;/li&gt;
&lt;li&gt;it will fail at the wrong time&lt;/li&gt;
&lt;li&gt;and it will fail more than once&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the system needs to absorb that without collapsing.&lt;/p&gt;

&lt;p&gt;In systems that run continuously, reliability doesn’t come from everything working.&lt;/p&gt;

&lt;p&gt;It comes from everything being able to keep going when something doesn’t.&lt;/p&gt;

&lt;p&gt;This is something we deal with constantly at BrainPack, designing systems that keep operating even when parts of the infrastructure fail. AI workflows only work if the underlying systems can recover and continue without breaking the overall flow.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>backend</category>
      <category>systemdesign</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
