Dhruvi

Posted on Apr 30

How We Design Systems That Keep Working Even When One Part Fails

#architecture #backend #systemdesign #sre

In real systems, something is always failing.

An API times out.
A database slows down.
A third-party service returns garbage.

If your system depends on everything working perfectly, it won’t last long in production.

So the goal is not preventing failure.

It’s designing so failure doesn’t break everything.

The wrong assumption

A lot of systems are built like this:

Step 1 → Step 2 → Step 3 → Done

If Step 2 fails, the whole flow stops.

In controlled environments, this works.

In production, it creates fragile systems that break on the first issue.

What we do instead

We design flows that can survive failure and continue.

Not perfectly. But safely.

1. Break the dependency chain

Instead of one long synchronous flow, we split things into independent steps.

Each step:

does one thing
stores its state
can be retried

So if something fails, you don’t lose everything.

You just retry that part.

## 2. Accept partial success

This one is uncomfortable at first.

Sometimes:

part of the system succeeds
another part fails

Instead of rolling everything back, we:

keep what succeeded
fix what failed

Because in distributed systems, “all or nothing” is rarely realistic.

3. Make retries safe

Failures lead to retries.

Retries lead to duplication if you’re not careful.

So every step needs to be safe to run again:

no duplicate records
no repeated side effects
no broken state

If retries are safe, failure becomes manageable.

4. Isolate external dependencies

Anything outside your control will fail eventually.

So we isolate them:

queues between systems
timeouts and fallbacks
delayed execution when needed The goal is simple

If one system goes down, everything else should keep moving.

5. Design for recovery, not perfection

Instead of asking:

how do we make this never fail

We ask:

how does this recover when it fails

That changes everything.

You stop chasing edge cases and start building systems that handle them naturally.

What changed for me

I stopped treating failure as an exception.

Now it’s part of the normal flow.

Every system I build assumes:

something will fail
it will fail at the wrong time
and it will fail more than once

So the system needs to absorb that without collapsing.

In systems that run continuously, reliability doesn’t come from everything working.

It comes from everything being able to keep going when something doesn’t.

This is something we deal with constantly at BrainPack, designing systems that keep operating even when parts of the infrastructure fail. AI workflows only work if the underlying systems can recover and continue without breaking the overall flow.

DEV Community