DEV Community

Cover image for Resilience Engineering and Life
Ant(on) Weiss for Canarian

Posted on

Resilience Engineering and Life

31 years ago my life boarded an airplane and crashed into a rock. A soviet teenager, brought up in the breath-takingly beautiful city of Leningrad I found myself standing in the midst of Geula - a Jerusalem neighbourhood populated mainly by Ultra Orthodox Haredi Jews.
It is a grey, chilly morning in January 1990. I am surrounded by ugly dirty buildings and bearded men wearing weird black overcoats and fedora hats. Staring around in shock and disbelief, desperately wishing this was all a dream. Wishing I was back in Russia surrounded by my friends, rock music and perestroika.

But alas, there was no going back. In the upcoming years Israel never ceased to surprise me with more and more things that nothing in my previous life has prepared me for. It was a bumpy ride with street fights, drugs and even imprisonment. And one could say that it’s a kind of a miracle - that here I am today - a well-respected Israeli citizen blessed with a family and a successful business, building a new company.

Me in 1991 Me in 1991. Not yet ready to adapt.

Like thousands of my fellow immigrants I adapted, I overcame the unexpected challenges of the new reality and found my place. But there were also others. Those who failed to acclimate and fell victim to addiction, delinquency, depression or suicide.

So what is it - that thing that helps some of us to adapt and succeed while others crumble?
In scientific talk it is called resilience or adaptive capacity. I was in no way better than those immigrants who failed. Instead there were certain choices and actions I took when faced with difficulties that allowed me to regain my social status, to learn the new skills and understandings needed to succeed in my newfound home.
As John Allspaw says - resilience is not something a system has, resilience is something a system does. It’s not a property but rather an activity, something we actively pursue and develop. Resilience is the ability of a system, be it a human being, an organization or a software component to withstand the unforeseen adversities, to adapt to the changes they require and to spring back, to recover, to continue providing the previously expected capabilities.

So why are we now talking about resilience at IT conferences? Why is the topic of resilience becoming so top-of-mind for many of the most profound visionaries of our industry? Well, it’s because the information systems we are building are becoming increasingly more complex and unpredictable, interconnected and chaotic, while we become increasingly dependent on them for carrying out our expected capabilities as a society, as a civilization.

How many production incidents did you have in the last year?

How many of those were expected to occur?

What was the total cost of those incidents?

How long did it take you to go back to normal?

As an industry we’ve come to an understanding that in complex distributed systems failure is a feature, not a bug. And what really matters is the system's resilience - it’s ability to withstand the failure, to bounce back and recover.

And we’re also becoming painfully aware of the human factor in the resilience of the information systems. A program does what it is programmed to do, but in most cases - not what the programmer intended. As Stafford Beer put it - the purpose of a system is what it does.
And right now it’s only us humans who can stand in to fill that gap between the intended and the actual purpose. Paraphrasing Conway's law one could say that the resilience of an information system is defined by the resilience of the organization that builds it. And let me say, 2020 was a great testbed for organizational resilience, with test results still being calculated.

Now - as engineers the first question we should ask is: can this be engineered? Can we intentionally build our organizations and consequently our systems to do more resilience and less brittleness?
The answer is - maybe! We now somewhat understand the principles and the the algorithms of how resilience works. But the paradox lies in the fact that resilience is about being prepared for the unexpected, about being ready for the unknown. Therefore it can only be tested in real time - when the unexpected event happens. Much like many other activities in software delivery - resilience is a continuous quest based on never-ending learning and adaptation.

The quest has begun! There’s already a great deal of knowledge to learn from. But there’s still a long road ahead - and it is now up to us to walk that road, to build the resilient systems of tomorrow. And then maybe, just maybe we will all be ready for whatever unexpected crap comes our way.

Top comments (0)