I decided to read “Drift Into Failure” by Sidney Dekker because of an appraisal by Lorin Hochstein (thanks to him for curating his reading notes). I thought I would learn something about software systems reliability from this book. Apart from that, the book actually gave me more: it introduced me to a number of fascinating philosophical ideas and the complexity and systems theory with which I was unfamiliar before.
I highly recommend this book , despite it is a bit self-repetitive. It might be that you can learn the same concepts from some other books or papers which are shorter than “Drift Into Failure”, including some of the newer Dekker’s books.
Below are my notes on the book. All citations are from the book, © Sidney Dekker.
Drift into failure (DIF) is a gradual, incremental decline into disaster driven by environmental pressure, unruly technology, and social processes that normalize growing risk. It's an inevitable byproduct of normally functioning processes.
Competition and scarcity dictate that organisation borrows more and more from near the safety boundaries, takes greater risks.
The traditional model claims that for accidents to happen something must break, give, malfunction: component, part, person. But organisations drift into failure precisely because they are doing well.
Cartesian-Newtonian epistemology = rationalism, reductionism, belief that there is the best (optimal) decision, solution, theory, explanation of events.
Optimal, rational decision-maker:
- Completely informed about alternatives, the consequences of decisions, probability of events.
- Capable of full objective logical analysis.
- Sees the finest differences between alternatives.
- Rationally ranks alternatives based on the priorities.
- (Doesn’t exist :)
Before the 1970s, accidents were seen as truly accidental or sent by god. After—as failures in risk management or the result of deliberate amoral choices or neglect. We also connect the moral gravity to the outcomes of the accident.
The growth of complexity in our society has got ahead of our understanding of how complex systems (CS) work and fail.
Simplicity and linearity remain the defining characteristic of the stories and theories that we use to explain bad events that emerge from the complexity of the world.
In complex systems, we can predict only probabilities, not results. (In Antifragile, Nassim Taleb argues that we cannot estimate narrow probabilities with any good precision, too.)
Complexity is not designed. It happens or grows (against our will).
In 2008 debt crisis, the absence of liability of asset managers skewed the incentives from loan quality to loan quality. More was always better.
Knowledge is inherently subjective. Direct, unmediated, objective knowledge of how a whole complex system works is impossible to get.
Local rationality principle (LRP): people are doing what makes sense to them given situational indictions, organisational pressures, and operational norms that exist at the time. Rational decision making, however, requires massive cognitive resources and all the time in the world.
In ambiguity and uncertainty of complex systems, options that appear to work are better than perfect options that never get computed. However, decisions that look nice locally can fail globally.
In complex systems, local actions can have global results.
- DIF is caused by resource scarcity and competition.
- DIF occurs in small steps. In CS, past success is not a guarantee of future success or safety of the same operation.
- CSs are sensitive to small changes in input conditions, early decisions. The potential to DIF may be baked in a very small event when the system was much simpler.
- CSs that can drift into failure are characterized by unruly (unproven, delusional) technology that creates uncertainty.
- Protective (regulatory) structure between CS and the environment which is supposed to protect CS from failure may contribute to DIF because it is subject to the same pressures as the operator.
It takes more than one error to push a CS into failure. Theories explaining failure by mistake or broken component (Cartesian-Newtonian) are overapplied and overdeveloped.
Complexity doesn't allow to think in unidirectional terms along which “progress” or “regress” could be plotted.
Though DIF invites the idea of regress: smaller margins, less adaptive capacity, declining standards, CS always simply adapts to whatever happens now or just happened. Any larger historical story is ours, not the system's.
DIF is a construct in retrospect, or, at the most, it is a construct that can be applied to a CS from the outside. From the inside, drift is invisible.
Evolution has no visible direction when you are in the middle of it. Evolution, progress, regress in CSs (including society) are perhaps just illusion, our phycological imposition.
The idea that there is a vector of evolution, a “hand” behind it, a cause for its speed and direction is Newtonian. (Vectors of forces.) (The “hand” also made me think about the connection of this idea to religion.)
Exegesis is an assumption that the essence of the story is already in the story, ready formed, waiting to be discovered. All we need to do is to read the story well, apply the right method, use the correct analytic tools, put things in context, and the essence will be revealed to us. Such structuralism sees the truth as being within or behind the text.
Eisegesis is reading something into a story. Any directions in adaptations are of our own making. Post-structuralism stresses the relationship between the reader and the text as the engine of truth. Reading is not passive consumption of what is already there, provided by somebody who possessed the truth, passing it on. Reading is a creative act. Readers generate meanings out of their own experience and history with text.
Nietzsche, and post-structuralism in general, don't believe that it is a single world that we are interpreting differently. That we could in principle reach agreement when we put all different pictures together. More perspectives don't mean a greater representation of some underlying truth.
In CSs, more perspectives mean more conflicts. CSs can never be understood or exhaustively described. If they could, they would either be not complex, or the entity understanding them would have to be as complex as the whole system. This contradicts the local rationality principle.
Events that we want to study, just like any CS, can never be fixed, tied down, circumscribed with a conclusive perimeter telling us what is part of it and what is not.
By reading drift into a particular failure we will probably learn some interesting things, but we will surely miss or misconstrue other things.
Whether there is a drift into failure depends not on what in the story, but on us, what we bring into the story, how far we read, how deeply, what else we read. It depends on the assumptions and expectations that we have about knowledge, cause, and, ultimately, about morality.
Newton’s and Descartes’ model, like any model, prevents us from seeing the world in other ways, from learning things about the world that we didn't even know how to ask.
Galileo Galilei insisted that we should not worry too much about the things that we cannot quantity, like how something tastes or smells, because he thought that is a subjective mental projection and not a material property of the world that can be measured, and, since you cannot measure it, it's not very scientific. That legacy lives with us today. An obsession with metrics and quantifications and downplaying the things that not only help determine the outcome of many phenomena, like our values or feelings, motives, intentions, but also make us human.
Francis Bacon: “Science should torture nature’s secrets from her. Nature has to be hounded in her wanderings, bound into service, put into constraint and made a slave.”
Take the context surrounding component failures seriously. Look for sources of trouble in the organisational, administrative, and regulatory levels of the system, not just the operational or engineering sharp end.
Systems thinking is about
— relationships, not parts
— the complexity of the whole, not the simplicity of the carved out bits
— nonlinearity and dynamics, not linear cause-effect
— accidents that are more than a sum of broken parts
— understanding how accidents can happen when no parts are broken, or no parts are seen as broken.
The work in complex systems is bounded by three types of constraints:
— economic boundary , beyond which the system cannot sustain itself financially.
— workload boundary : people or technology not able to perform the task.
— safety boundary beyond which the system will functionally fail.
Finance departments and competition push organisations into workload/safety boundary corner.
In making tradeoffs for efficiency there is a feedback imbalance. Information on whether a tradeoff is cost-efficient is easy to get. How much was borrowed from safety to achieve this benefit is much harder to quantify.
Operation rules must not be set in stone. They should emerge from production experience and data.
In practice, practices don’t follow the rules. Rules follow emerging practices.
Declaring that CS has a safety hole doesn’t help to find when did the hole occur and why.
The banality of accidents: incidents don’t precede accidents. Normal work does. Accidents result from a combination of factors none of which in isolation cause an accident. These combinations are not usually taken into account in the safety analysis. For this reason, reporting is very ineffective at predicting major disasters.
The etiology of accidents is fundamentally different from that of incidents: hidden in residual risks of doing normal work under pressure of scarcity and competition.
The signals now seen as worthy of reporting, or organisational decisions now seen as bad (even though they looked good at the time) are not big, risky events or order-of-magnitude steps. Rather, there is a succession of weak signals and decisions along a steady progression of decremental steps. Each has an empirical success and no obvious sacrifice to safety.
We need to raise awareness through a language that enables a more functional account. A story of living processes run by people with real jobs and real constraints but no ill intentions, whose understanding co-evolve with the system and the environmental conditions as they try to stay up-to-date with them.
Looking for organisational, latent causes of incidents is still very Newtonian (cause-effect): broken parts, failures of higher management.
Newtonian idea: data (metrics) help to understand something real, and some data may be better than other data.
More redundancy (which itself comes with a fixation on parts) creates greater opacity and complexity. The system can become more difficult to manage and understand.
Individualism was in part fueled by Cartesian dualistic phycology. It’s unnatural in individualistic worldview to think that it may take a team, an organisation, or an entire industry to be responsible for a failure.
The conceptualisation of risk/failure as build-up and release of energy (physical metaphor) is not necessarily well-suited to explain the organisational and sociotechnical factors behind system breakdown, nor equip with a language that can meaningfully handle processes of gradual adaptation or the social processes of risk management and human decision making.
Four components of a high-reliability organisation (HRO) (according to high-reliability theory):
— Leadership commitment
— Redundancy and layers of defence
— Entrusting people at the sharp edge to do what they think is needed for safety
— Incremental organisational learning through trial and error, analysis, reflection, simulation. Should not be organised centrally, can (should) be distributed locally as well as decision making.
HRO needs to keep expecting a surprise. The past should not grow a belief in sustaining future safety. Tactics:
— Preoccupation with failure
— Reluctance to simplify. Paying close attention to context and contingencies of events. More differentiation, mindsets, worldviews. Diversity of interpretation helps to anticipate what might go wrong in the future.
— Sensitivity to actual operations, rather than predefined procedures and conclusions.
— Deference to expertise. Listen to the minority opinion.
— Commitment to resilience
Deference to expertise might be still slippery. Experts cannot ultimately escape the push towards margins in their expert judgement based on an incomplete understanding of the data. From within an organisation, it might be impossible to transcend the routine and political stuff and notice an emerging problem. What we don’t believe we cannot see. Thus, relying on human rationality is unreasonable.
Conflicting goals are not an exception, they are a rule. They are the essence of most operational systems. Systems don’t exist to be safe, they exist to provide a service.
Most important goal conflicts are not explicit, they emerge from multiple, irreconcilable directives from different levels and sources, saddle and tacit pressures, from management’s or customers’ reactions to particular tradeoffs.
Practitioners see their ability to reconcile the irreconcilable as a source of professional pride and a sign of their expertise.
Normalisation of deviance is the mechanism of how production pressures incubate failures.
Redundancies, the presence of extraordinary competence, or the use of proven technology can all add to the impression that nothing could go wrong.
A solution to risk, if any, is for an organisation to continually reflect on and challenging its own definition of “normal”. To prioritize chronic safety concerns over acute production pressures.
Large organisations cannot act as closed, rational mechanisms. They themselves have local rationality precisely because they are made of people put together. Problems are ill-defined, often unrecognised. Decision-makers have limited information, shifting allegiances, and uncertain intentions. Solutions may be laying around actively searching for problems to attach themselves to.
According to control theory, degradation of safety control structures can be due to asynchronous evolution: one part of the system changes without related charges in the other parts.
Changes to subsystems may be carefully planned, but their role in overall safety or effect on each other may be neglected or viewed inadequately.
Mechanistic thinking about failures means going down and into individual “broken” components. Systems thinking means going up and out: understanding comes from seeing how the system is configured in a larger network of systems, tracing the relationships with those, and how those spread out to affect and be affected by factors laying far away in time and space from things went wrong.
At the subatomic level, the interrelations and interactions between the parts of the whole are more fundamental than the parts themselves. There is motion, but, ultimately, no moving objects. There is activity, but no actors.
In complex systems, just like in physics, there are no definite predictions, only probabilities.
Ironically, as natural and social sciences both shift towards complexity and systems thinking, they become closer again. Natural sciences become “softer”, with an emphasis on unpredictability, relationships, irreducibility, non-linearity, time irreversibility, adaptivity, self-organisation emergence, all sort of things that have always been better suited to capture the social order.
Systems thinking predates the scientific revolution that gave birth to the Newtonian-Cartesian paradigm. Leonardo da Vinci was the first systems thinker and complexity theorist. He was a relentless empiricist and inventor, a humanist fusion of art and science. He embraced the profound interconnectedness of ideas from multiple fields. His goal was to combine, advance, investigate, and understand processes in the natural world through the interdisciplinary view.
Foundations of the complexity and systems theory:
Complex system is open, affecting and affected by the environment
Each component is ignorant of the behaviour of the system as a whole and doesn’t know the full effect of its actions. Components respond locally to the information presented to them there and then. Complexity arises from the huge web of relations between components and their local actions.
Complexity is a feature of the system, not the components.
CS operates far from equilibrium. Components need to get inputs constantly to keep functioning. Without it, CS won’t survive in a changing environment. The performance of a CS is typically optimised at the edge of chaos, just before the system’s behaviour can become unrecognisably turbulent.
CS has a history, a path dependence. The past is co-responsible for its present behaviour. Descriptions of complexity have to take history into account. Synchronous snapshot never represents the CS, only the full diachronic lineage can.
Interactions in CS are non-linear. Small events can produce large results.
Complicated systems consist of a lot of parts and interactions but are closed, unlike complex systems which are open.
Paradox: complexity should support resilience because CS can adapt to the environment and survive. How can complexity lead to failure? Complexity opens a path to a particular kind of brittleness:
Drift into failure cannot be seen synchronically. Only a diachronic study can reveal where the system is headed.
Non-linearity can amplify small events into a failure.
Local reactions and reasoning support decremental steps toward failure through normalisation of deviance.
Unruly technology introduces and sustains uncertainties about how and when things may fail.
Technology can be a source of complexity. Even if parts can be modelled exhaustively (thus are merely complicated), their operation with each other in a dynamic environment generates unpredictabilities and unruliness of complexity.
Emergence: a system has an emergent property if what it produces can not be explained by the properties of the parts.
Organized complexity is emergent, bottom-up, a result of local interactions and a cumulative effect of simple rules.
Accidents are emergent properties of CS.
Beware of renaming things that renegotiates the perceived risk down from what it was before, because it may have far-reaching consequences (phase transition).
The optimum (maximum) behaviour is determined in the relationship between the CS and the environment.
A tradeoff between exploration and exploitation (by local agents) lies behind much of the system’s adaptation. This is also why adaptation is sometimes unsuccessful.
Exploration can produce big failures (e. g. trying a new technology in production can produce an outage because the technology fails under the load). This is the reason why the optimum balance between the exploration and exploitation puts the system near the edge of chaos.
Arriving at the edge of chaos is the logical endpoint of DIF. The system has tuned itself into maximum capability.
In a CS, an action controls almost nothing but influences almost everything.
Diversity is critical for resilience. One way to ensure diversity is to push authority about safety decisions down to people closest to the problem.
High-reliability theory also recommends attuning to minority opinions, which helps diversity.
Greater diversity of agents in a CS leads to richer emergent patterns (including for good qualities). It typically produces greater adaptive capacity.
=> Diversity is good for an organisation.
Diversity of people will likely enhance creativity on a team. More stories get told about the things we should be afraid of. More hypothesis might be generated about how to improve the system.
Diversity of opinion in a CS should be seen as a strength, not a weakness. The more angles and perspectives, the more opportunities to learn.
1. Resource/safety tradeoff:
Full optimization of CS is undesirable because the lack of slack and margins can turn small perturbations into large events. Diversity of opinion helps to prevent exhaustive optimization.
2. Small steps offer two levers for preventing drift:
Small steps happen all the time, providing a lot of opportunity for reflection. What this step can mean for safety? Ask people in different parts of the org (since the step can reverberate to there).
Small steps don’t generate defensive posturing when they are questioned, and cheap to rollback.
Hiring people from outside provides an opportunity to calibrate the practices and correct deviance. Rotation between teams also works. An additional benefit of people rotation is constant ongoing training, including of the safety procedures.
3. Unruly technology:
We may have traditionally seen unruly technology as a pest that needs further tweaking before it becomes optimal. This is a Newtonian commitment. Complexity theory suggests that we see unruly technology not as a partially finished product but as a tool for discovery. Turn small steps into experiments. If results surprise us, it’s an opportunity to learn from about the system, or about ourselves.
Why did this happen?
Why are we surprised that this happened?
But consider the post-structural view, and that the insights that we could possibly have are limited by our local rationality and prior experience. This is where diversity matters, again.
4. Protective structure:
The idea of safety “oversight” is problematic. Oversight implies a “big picture”. A big picture in the sense of a complete description is impossible to achieve over a CS. CS evolve and adapt all the time. Nailing its description down at any one moment means very little to how it will look in the next moment.
If oversight implies sensitivity to the features of complexity and drift, it might work. Oversight may try to explore complex properties:
Interconnectedness, interdependence: less is better
Diversity, rates of learning: more is better
Selection (e. g. for groupthink culture)
Protective structure should adopt the co-evolving or counter-evolving mindset. When inspecting parts, it should relentlessly follow dependencies and connections and explore interactions.
Rather than working on the safety strategy or the accident prevention program, we should create preconditions that can give rise to innovation, to new emergent orders and ideas, without necessarily our own careful attending and crafting.
There are thousands of small steps that can lead to failure. No prevention strategy tinkered by one designer can foresee them all.
In a CS, it’s impossible to determine whose view is right and whose view is wrong, since the agents effectively live in different environments.
Blaming someone, beware the limited knowledge and selection bias and post-structural nature of this act.
There may almost always be several authentic stories of an incident.
Complexity theory doesn’t provide the answer about who is responsible for DIF, but it at least dispels the belief that there exists an easy answer.
Thanks for reading thus far!