HBO’s Chernobyl drew me in instantly. Not just because of the drama or history around the famed incident, it was the way actor Jared Harris (Valery Legasov) approached mitigating the nuclear explosion that was mesmerizing to me. Watching his character mentally engineer a solution with the least risks based on the current assumptions only to run into unforeseeable issues later resonated with my engineering side.
SPOILER ALERT
Technical Debt
More than anything else, Chernobyl is a stark reminder of the true cost of technical debt.
Technical debt, for those who don’t work in the software world, is the concept that bad design decisions made in code or poorly written code will need to be fixed or will cause future problems. Unlike in the Chernobyl disaster, technical debt is usually invisible in software but can cost companies millions to fix. What drives technical debt is very similar to what was portrayed in the final episode of the mini-series: A combination of bad design decisions due to cost cutting (graphite tips), human error and productivity/business needs.
For the purpose of this analogy, I will use the term “business needs” to reference the productivity targets that the factories were trying to meet.
Cost Cutting
In the case of Chernobyl, the technical debt started with the graphite tips of the control rods. The control rods were made of boron, which helped slow the reaction rate in the nuclear reactor. However, the tips of these control rods were actually graphite which increases the reaction rate. So, when the engineer went to slow the rate of the core's reaction by pressing the emergency shut off switch, the rods got stuck with just the tips in the reactor and the reactor exploded.
The graphite tips were not revealed as the straw that broke the camels back until the final episode. In the mini-series, the moment Valery mentions the graphite tips the prosecutor asks, “Why graphite?”, to which Valery responds in short “because it is cheaper”.
The graphite tips were the physical manifestation of a programmer deciding to not write robust code and failing to consider how future engineers will interact with the code. It was a short-cut that the Soviet Union used to save money at the time which led to one of the worst disasters of all time.
In the end, they did not account for the number of mistakes that the engineers of Chernobyl would make that would place nuclear reactor #4 in a situation where it would explode.
Human Error
Technical debt does not often surface in the normal day-to-day use of software, just as it wasn’t revealed in the standard use of the Chernobyl nuclear reactor.
More than likely, similar to software, the engineers designed the reactor to handle 95% of the possible scenarios that it would be placed in. We often reference these scenarios as use cases.
The use cases that aren’t in that 95% are called edge cases. These are typically scenarios that seem rare, shouldn’t happen or are caused by user-error instead of system error.
Yet, because of how complex these systems are you can never really foresee every possible scenario. Predicting every edge case is impossible to do.
Based on the show (as I am not a nuclear engineer), I assume no one thought that someone would first poison the nuclear reactor with xenon which would cause it to stall and then try bringing it up to power again without putting the rods back in.
And then…to make matters worse…have the graphite tips of the control rod get stuck causing an increase rather than a decrease of nuclear reactivity.
Perfect engineering accounts for every possible scenario, it accounts for every possible human error and ensures that the users are steered in the right direction. But, there is no such thing as perfect engineering. We make assumptions, we focus on the majority of use cases and are pushed by deadlines. All this being said, the nuclear reactor would have never been in said position without pressure from upper management to meet productivity quotas.
Business Needs
Although dramatized, the final episode depicts the directors fantasizing about being promoted because of their successful test of reactor number 4. When they get the call to stall the test for a few hours, they conclude that it will be safe.
Did they conclude it was safe because they thought through the process of stalling the test for 10 hours or did they want to keep their superiors happy? I won’t presume to know, but I can say I have seen the need to meet deadlines or quarterly budgets force decisions that affect short term goals but eventually cause long term problems.
The push to meet short term goals like monthly quotas and quarterly forecasts often force management to game the system. They will focus on just meeting the current needs or demands without considering what might happen later on. For example, perhaps an engineering manager pushes his or her team to not include a security module of code because it doesn’t impact the functionality and it will look good on them. They might be tempted to do so even if they know it is wrong.
After all, once they are promoted, it is someone else’s problem.
You’ll Always Have That One Engineer
I enjoyed the engineer speaking out in the final episode. The truth is there is always one good engineer who attempts to stop bad practices from occurring. They are not worried about the bottom line and are not worried about speaking up to management to let them know they are making an error. These recommendations can be passed over, put into the “fix it later” category (which means never), or just ignoring it altogether.
Honestly, we need to thank these people in every company.
These are the only people sitting between us and planes falling from the sky or our banking systems going on the fritz. Their discipline for good QA and integration testing is what ensures we don’t die.
Conclusion
Chernobyl is the physical embodiment of technical debt in the software world. It can be hard to see the impact technical debt has on a company because it is invisible. Oftentimes, it won’t strike until the original engineers that developed the systems are long gone.
This doesn’t make it any less real.
This HBO show speaks to a concern I have as software becomes integrated into everything.
What kind of technical debt is being laced into our cars and planes as we further try to integrate machine learning and AI into everything?
With the sheer complexity of managing IoT devices, what are the chances engineers won’t make a mistake?
What edge cases are we failing to capture? How are tight deadlines and bombastic CEOs forcing their engineers to make short term decisions in order to make sure they get their code shipped on time?
Companies attempt to pretend they are driven by higher mission statements, but at the end of the day, they are driven by the same productivity metrics that acted as a catalyst for the disaster at Chernobyl.
“Every lie we tell incurs a debt to the truth. Sooner or later that debt must be paid.” The brilliant, peerless Chernobyl. — @Baddiel david baddiel
Are You Interested In Learning About Data Science Or Tech?
Learning Data Science: Our Favorite Data Science Books
How To Get Your First Consulting Client As A Data Scientist
Analyzing Meetup Data With Sql And Tableau
How Algorithms Can Become Unethical and Biased
Data Science Consulting; How To Get Clients
How To Develop Robust Algorithms
Dynamically Bulk Inserting CSV Data Into A SQL Server
4 Must Have Skills For Data Scientists
SQL Best Practices — Designing An ETL Video


Top comments (3)
Great read!
If we're talking about nuclear reactors and the metaphors with software development, my mind immediately goes to "The Law of Triviality", responsible for "bikeshedding".
The scenario is that in building a nuclear reactor, there might be 2,000 people simultaneously working on the project, but only a handful actually understand how the reactor works, so nobody questions their judgment and there aren't a lot of silly arguments. But everyone has the requisite knowledge to have an opinion on what materials to use to construct the bike shed, so that argument gets a disproportionate amount of attention.
FWIW, your analysis is still valid, but the issue wasn't exactly the graphite tips. Overall, I think your analogy to software development holds, but due to the complicated technical details of Chernobyl, things get much more complex.
There seems to be a lot of confusion on this, partly because of the fiction aspect of the HBO series, partly because of the drama aspect.
First, the "graphite tips" were in fact rods just like the control rods. As the control rod raised, the tips exited from below the bottom of the reactor into the reactor itself, having the combined effect of removing the depressing effect of the control rods, and the graphite increasing the reactivity from the bottom of the reactor up. Completely removing the control rods (and, primarily, SO MANY of the control rods) has a two-fold effect. One was that it jacked the reactivity way, way up, while removing all depressing effects from the control rods, setting up an "explosive" situation. Second, there was a ~1.2m gap between the "channels" the rods (control and graphite) ran in, and the size of the graphite "tips." That meant that water would rush into the channel if (and only if) the control rods were completely removed as they were. Then the resultant immediate increase in reactivity of reinserting the graphite rods (specifically, due to one of the many, many flaws of the RBMK reactor design) at the bottom BEFORE the boron control rods even began to depress reactivity, caused the water in the channel to vaporize instantly, steam explosion, and history was made.
The other major thing, regarding your analysis and its relation to software development, is that the engineers did anticipate many of these exact choices that the operators made, and there were protections in place to prevent them from occurring. But, as part of the protocol (I can only imagine, written not by the original engineers!), those safety systems were disabled.
In any case, as I said, I think the conclusions you reach are valid, but there are even more conclusions we can draw.
Were the original choices of the engineers properly documented? The answer appears to be no, but there's also some evidence Dyatlov was aware of this fundamental design flaw of the RBMK reactor and chose to move ahead with the non-critical function test anyway. Incidentally, there's also significant evidence that he was not operating outside of safety guidelines in doing so, so it's harder to make the case that Dyatlov was the villain portrayed in the HBO show as well.
There was also a culture of "we can do no wrong" operating that lent a false confidence to disabling some of the safety measures -- because "we have defense in depth" (except some of those defenses were insufficient or irrelevant)! The RBMK reactor is known today as one of the worst designed of that type of reactor at that size, but that's not what the operators "knew" at the time.
Then there are the issues no one was aware of. I'll jump to the 3 Mile Island incident for the best example, no one was aware of the stuck valve issue until it caused a problem. Luckily, there were multiple things in place to prevent disaster there. In Chernobyl's case, no one was aware of the problem with the shortened length of the graphite tips until they had multiple RBMK reactors -- finding a bug in production. This is inevitable, so we should also be aware of the ways in which our systems can fail unpredictably. In Chernobyl's case, reactor 4 was scheduled for upgrade for this issue until the explosion rendered the upgrade somewhat unnecessary, perhaps the issue was no longer in Dyatlov's mind because the upgrade had been scheduled? I don't know.
It is a good read and thanks for sharing the links.