The coronavirus pandemic has taught us two things: all online systems will fail eventually, and they will fail in unpredictable ways. But have no fear; resilience engineering is here!
Building and maintaining event-driven, highly scalable, and quickly adaptable systems are a pain in the neck. Resilience engineering embraces this complexity and helps you manage it.
In the TEQnation lightning talk below, I will explain what resilience engineering is, why every software person should know about it, and how you can start practicing it. You can find the transcript below.
Hungry for more? Great! This might help:
The Resilience Engineering Association (REA) has created a lovely introductory guide on resilience engineering. After reading or scanning it, you should have a pretty good idea about what resilience engineering is, where it came from and who to follow if you want to learn more.
Resilience engineering takes a lot of inspiration from safety science. In short, safety science aims to help people die less and in less horrible ways. A prominent researcher is Erik Hollnagel, who introduced the terms Safety I and Safety II. The former focuses on preventing error, the latter on increasing success.
The term system has special meaning in the context of resilience engineering. It is about both humans as well as technology, thus the term socio-technical system. When resilience engineers talk about complex systems and how they fail, they focus on the whole, from broken machine parts to overworked maintenance workers and increasing market pressures.
Richard Cook has written a surprisingly readable paper on how complex systems fail. Another great resource is the book Drift into Failure by Sidney Dekker, which contains examples ranging from economic collapse to airplanes crashing into the sea.
Here’s Richard Cook again, this time using the human body to explain what resilience engineering is all about:
An essential element of resilience engineering is the post mortem. To emerge from failure more strongly than before — a critical property of a resilient system — engineers need to learn how to think about finding and fixing causes of failure. That‘s why you can often hear resilience engineering advocates (confusingly) proclaim: there is no root cause.
John Allspaw can explain really well what is needed to dig into failure and learn from it:
Resilience engineering has an ongoing and major influence on the DevOps movement. Disciplines like chaos engineering and site reliability engineering are heavily leaning on resilience engineering principles and research.
Gene Kim and John Willis extensively talk about the origins of this side of DevOps (and much more) in their audio series Beyond the Phoenix Project.
Learning about resilience engineering is an ongoing effort here at Luminis. That‘s why we decided to start a blog series on resilience engineering. We will continue to add to it over the next few years, and I invite you to follow us and give us your feedback. I believe that writing, reading, and reflecting are excellent (and fun!) ways to become experts on complex subjects like resilience engineering.
Thanks again for your attention!
Hi everyone! And welcome to my TEQnation lightning talk titled: What is resilience engineering?
Before we dive into the details, I want to talk about my favorite subject: me! I am Piet van Dongen. I am 37 years old. I work at Luminis, a super nice Dutch company that is all about technology leadership, software craftsmanship, and sharing knowledge. Check us out at www.luminis.eu!
I once started my career as a primary school teacher, then moved to web design, front end engineering, back end engineering, and now I am a Cloud Consultant.
I am really, really excited about cloud technology. And I think understanding resilience engineering is essential for anyone working with technology — especially cloud technology, which enables us to build and evolve super-complex systems.
I’d love to tell you more about me, but this is a lightning talk. Let’s get cracking!
What will you learn in the next 10 minutes? This is a talk about resilience engineering, so we’ll start off with resilience.
What is resilience exactly? This will give you an idea of what resilience is and is not.
Then, the engineer parts comes into play. I’ll talk about safety science, systems thinking and the socio-technical system.
Finally, I’ll give you some pointers. So you can do a resilience engineering deep dive yourself.
Now, what is resilience?
To understand resilience, you need to think about robustness first. We — you know, software people — are really good at building robust systems. Systems that are highly available, that can handle a little peak load, a little stress. To do so, we use techniques like:
Sound familiar? Great.
We do this because we tend to think about what might go wrong. You know:
- Servers might fail.
- Networks might become unreliable.
- Things go brrrrr.
We don’t want that. So we design around all the things we know can go wrong. Resilience engineering people sometimes call this: the known unknowns.
Robustness is nice, but it does not protect us when things go south in the most unexpected ways. Redundancy won’t help when the Retro Encabulator’s hydrocoptic marzel vanes go from horizontal to vertical fumbling. We call these: unknown unknowns. Stuff we don’t know we don’t know, to quote Ronald Rumsfeld.
That’s the stuff resilient systems can handle. And robust systems can not.
Take this glass, for example. I can do this (Piet taps the glass), even this (Piet taps the concrete floor with the glass). It’s designed to withstand all that. It’s robust. But it is not very resilient. If I do this (Piet smashes the glass with a hammer), it breaks.
But, smashing glasses with a hammer is not a good example of resilience engineering. Smashing with a hammer is not an unknown unknown for glassware designers. It’s just not a requirement.
Hmm, lemme think…
Yes, I have a better example!
(Piet takes a hammer. And smashes fingers. He screams loudly. Fade to black. Elevator music plays. Fade back. He now has bandages on his fingers.)
Piet, now with a trembling, high-pitched voice:
Yes, a much better example. The human body is an excellent example of a resilient system. You see, even though I smashed my fingers with a hammer, my body still works. Even better: in a few weeks, these bones will be stronger than before. Great.
Now that I’ve explained the difference between robust and resilient systems let’s dive into the engineering part of resilience engineering.
Let’s go back to my example about known unknowns. The things we know can go wrong. If you think about things that can go wrong a lot, you tend to focus on preventing error.
And how do we prevent error?
- By minimizing variance. For example, by not allowing certain input.
- By writing a LOT of documentation. The Boeing 747 flight crew operating manual is 3,000 pages long.
- By enforcing things. Like making it impossible to shift to reverse while driving your car.
That’s all fine and dandy, but it only helps us prevent the known unknowns. And there are a LOT more unknown unknowns.
Have you seen the HBO series Chernobyl? A great example of disaster caused by unknown unknowns:
- Graphite tips.
- Bad management.
- Cover up by the government.
Just to name a few. So how do you prevent that? That’s what the field of safety science is all about, the birthplace of resilience engineering.
In trying to create a safer world, safety scientists are trying to shift the focus from avoiding things that go wrong, sometimes, to making sure things go right, most of the time. They do this by focusing not only on technology but on humans as well.
When resilience engineering advocates talk about systems, they do not mean components or computers, or networks. They are talking about the so-called socio-technical system. Socio as in: the people that interact with the technology. This perspective is known as systems thinking.
You see, whether you like it or not, humans are always part of the systems we create. Take the Chernobyl example again. What was ‘the system’ there? Just the nuclear power plant? The control room, the reactor, the steam turbines. Or the people as well? The operators, the managers, the government.
When thinking in socio-technical systems, the humans are just as important as the machines.
So what? How does this prevent error? What’s the difference? How do you engineer resilience, then? Surely, you can only create systems, not the humans as well? That’s where the engineering part of resilience engineering finally comes into play.
When things go wrong, they do not go wrong suddenly and all at once. Again, Chernobyl. The explosion of the reactor core was the climax, for sure. But what caused it?
The operators in the control room pushing a shutdown button? Maybe, but they did not anticipate the disastrous outcome. They didn’t even completely understand what was happening at the time, even though they were surrounded by tons of data. Also, they were under a lot of pressure from their superiors. Who were under pressure themselves as well. Plus, nobody imagined that a shutdown would cause an explosion. What happened was radically different from what people imagined would happen.
As you see, a lot of factors are at play here. Both human factors as well as technical factors. Socio. Technical. And I think that is the central theme of resilience engineering: to make sure things go right, most of the time, using the engineer skills we have.
How do we do that? By helping the human part of our systems as much as possible.
One way to do that is by creating transparent systems. Systems that can tell you what is happening, what state it is in at any point in time. Systems with clear controls, interfaces that enable operators to control the system when they decide they need to. That way, we enable humans to imagine systems as close as possible as to their actual state, which is, of course, impossible.
Another way to practice resilience engineering is by creating easily extensible systems. Systems that can be improved over time. So write clear, modular code. Implement architectures that can evolve. Make sure you are clear about what your code does, or at least: should and should not do.
Lastly, when things go wrong, and they will go wrong, keep an open mind. Don’t focus on a single root cause because it does not exist, and it only limits your chances of success. Don’t blame each other, but use the opportunity to create an even better system. A system that increases the chances of success over time.
That is what resilience engineering is all about.
So in conclusion:
- Resilience is not robustness.
- Resilience engineering is about creating success, not eliminating error.
- Resilience engineering is about humans and technology.
I hope you have learned something today. This talk only scratches the surface, of course. And the same goes for me: the more I learn about resilience engineering, the more I realize there is much more to learn. That’s why I’ve created a blog post that will help you dive deeper. You can find it here.
Thank you very much for listening to me. I hope you have a great day!