What cave diving taught me about distributed systems

#distributedsystems #softwareengineering #backend #career

What cave diving taught me about distributed systems

I've been building backend systems for 14 years. I've also spent a decent chunk of the last decade underwater, mostly in caves.

At some point I stopped being surprised by how often the two worlds rhyme. The deeper you go into either, the more you notice the same ideas showing up in different costumes. Here are a few that stuck with me.

You plan the dive, then you dive the plan

In open water, if something goes wrong, you go up. That's it. The surface is always there, a few kicks away, a guaranteed exit.

In a cave, there is no "up". There's a ceiling, and between you and air there's sometimes hundreds of meters of rock and a specific path you came in through. If something goes wrong at the end of a one-hour penetration, the solution is still one hour of swimming away — and you're the one who has to swim it, with whatever gas, light, and composure you have left.

So technical divers plan everything before getting in the water. Gas volumes for every phase, with reserves for the worst case and the worst case after that. Turn points. Decompression schedules. Equipment failures and who does what when they happen. Team positions, signals, lost-diver procedures. Murphy's law isn't a joke in this context — it's a design input.

The rule is: plan the dive, dive the plan. You don't improvise
underwater. You execute what you already decided on land, when your
brain had oxygen and no time pressure.

Software has the same trap, and most teams fall into it. "We'll figure it out in production" is the engineering equivalent of "we'll figure it out at 80 meters." Sometimes you get lucky. Often you don't.

The work that matters — capacity planning, failure mode analysis,
runbooks, rollback procedures, on-call rotations, dependency mapping — happens before the system is under load. Before the incident. Before anyone is stressed. Because the incident is not the time to start thinking. It's time to execute what you already thought through.

And just like diving, the planning doesn't eliminate failure. It just makes sure that when failure shows up, you've already met it on paper.

Failures cascade. Plan for the second failure, not the first.

The thing that kills divers isn't usually the first problem. It's the panic reaction to the first problem that causes the second one — and the second one is the one you weren't ready for.

Same in distributed systems. The database slowdown isn't what takes you down. It's the retry storm from 400 service instances hammering the recovering database that takes you down.

Good divers train for compound failures: light out and low on gas, lost line and silted visibility. Good systems are designed for compound failures too: circuit breakers, exponential backoff with jitter, bulkheads, and graceful degradation. Not because the first failure is rare, but because the second one, triggered by your response to the first, is where the real damage happens.

Turn pressure is a circuit breaker

Before a cave dive, you calculate your "turn pressure" — the tank pressure at which you stop going in and start coming out, regardless of how close you are to the thing you wanted to see. It's non-negotiable. You don't get to feel your way through it.

Circuit breakers work the same way. You pick a threshold in advance, when you're calm and have a clear head. And when the threshold trips, the system doesn't get to argue with it. It just turns around.

The hardest part of both is the same: accepting the limit you set for yourself when you were thinking clearly, even when the situation makes you want to push past it.

Checklists feel stupid until they save you

Every cave diver I respect uses a pre-dive checklist. Not because they forget things — but because under stress, everyone forgets things. The checklist is what your past, calm self leaves behind to protect your future, stressed self.

Runbooks are the same. The incident is not the time to remember the command. The deployment at 2 am is not the time to improvise the rollback procedure. Write it down when it's quiet. Read it when it's loud.

The real lesson

Both disciplines teach you the same uncomfortable thing: most disasters are built in advance, by people who assumed the happy path was the only path.

The habits that keep you alive in a cave are the same ones that keep systems running at 3 am on a Saturday. Redundancy, calm limits, planning for the compound failure, trusting your past self's checklist over your present self's instincts.

The costume is different. The physics of failure is the same.

If you're into either distributed systems or cave diving, I'd love to hear what overlaps you've noticed. Always surprising how many fields converge on the same answers.