DEV Community

Cover image for Safety-Critical Software: 15 things every developer should know

Safety-Critical Software: 15 things every developer should know

Blaine Osepchuk on March 01, 2020

Despite being all around us, safety-critical software isn't on the average developer's radar. But recent failures of safety-critical software syste...
Collapse
 
dmfay profile image
Dian Fay • Edited

Trevor Kletz devotes chapter 20 of What Went Wrong? Case Histories of Process Plant Disasters to "problems with computer control". It has some good general points:

Computer hardware is similar to other hardware. Once initial faults have been removed and before wear becomes significant, failure can be considered random and treated probabilistically. In contrast, failure of software is systemic. Once a fault is present, it will always produce the same result when the right conditions arise, wherever and whenever that piece of software is used.

(discussing a computer-enabled methanol spill) A thorough hazop [hazard and operability study] would have revealed that this error could have occurred. The control system could have been modified, or better still, separate lines could have been installed for the various different movements, thus greatly reducing the opportunities for error. The incident shows how easily errors in complex systems can be overlooked if the system is not thoroughly analyzed. In addition, it illustrates the paradox that we are very willing to spend money on complexity but are less willing to spend it on simplicity. Yet the simpler solution, independent lines (actually installed after the spillage), makes errors much less likely and may not be more expensive if lifetime costs are considered. Control systems need regular testing and maintenance, which roughly doubles their lifetime cost (even after discounting), while extra pipelines involve little extra operating cost.

Computers do not introduce new errors, but they can provide new opportunities for making old errors; they allow us to make more errors faster than ever before. Incidents will occur on any plant if we do not check readings from time to time or if instructions do not allow for foreseeable failures of equipment.

Collapse
 
bosepchuk profile image
Blaine Osepchuk

Good points. Thanks for sharing.

The belief that all software errors are systemic appears to be outdated.

I read Embedded Software Development for Safety-Critical Systems by Chris Hobbs as part of my research for this post. And he writes extensively about heisenbugs, which are bugs caused by subtle timing errors, memory corruption, etc. In fact, he shared a simple 15 line C program in his book that crashes once or twice every few million times it's run.

In the multicore, out-of-order executing, distributed computing era, systems aren't nearly as deterministic as they used to be.

Collapse
 
dmfay profile image
Dian Fay • Edited

They're exactly as deterministic as they used to be! What Went Wrong's first edition dates to 1998 -- at that point hardware and software engineers had been dealing with race conditions, scheduling issues, and the like for decades, although Kletz doesn't get into the gory details as he's writing for process engineers rather than software developers. Computer systems have not become non-deterministic (barring maybe the quantum stuff, which I know nothing about); rather, they've become so complex that working out the conditions or classes of conditions under which an error occurs tests the limits of human analytical capacity. From our perspective, this can look a lot like nondeterministic behavior, but that's on us, not the systems.

Thread Thread
 
bosepchuk profile image
Blaine Osepchuk

Isn't what you are saying effectively amount to non-determinism? If your safety-critical product crashes dangerously once every million hours of operation on average for reasons you can't explain or reproduce, no matter how hard you try, isn't it hard to say that it's a systemic error for all practical purposes?

This really isn't my area of expertise by the way. Chris Hobbs explains what he means in this YouTube talk.

Collapse
 
phlash profile image
Phil Ashby

Great article Blaine!

In the multicore, out-of-order executing, distributed computing era,
systems aren't nearly as deterministic as they used to be.

As Dian has noted, it's a function of the complexity of such systems that produce apparently stochastic behaviour (with a little help from jitter: chronox.de/jent.html) and as you mention in the article itself, is why engineers often prefer to choose their own hardware, typically picking the simplest system that meets the processing needs then writing their own system software for it, or perhaps starting with a verified kernel (sigops.org/s/conferences/sosp/2009...) and building carefully on that.

I wonder how the safety experts feel about more nature-inspired evolutionary pressure approaches using dynamic testing (fuzzing, simian army) to harden software against bad inputs, power failures, etc? This sort of fits back in with the modern security view that failure is inevitable, what matters is how the whole system behaves under continuous failure conditions, and use of newer properties of modern software deployment to 'roll with the punches' and carry on working: slideshare.net/sounilyu/distribute...

Disclaimer: I've not worked on safety critical systems: the nearest I have been is satellite firmware (dev.to/phlash909/space-the-final-d...), which was important from a reputation and usefulness viewpoint and very much not fixable post deployment :)

Thread Thread
 
bosepchuk profile image
Blaine Osepchuk

Thanks, Phil.

It's perfectly acceptable to go over and above the standards and do as much fuzz/dynamic/exploratory testing as you like. I don't think you would have much luck convincing regulators that it's a good substitute for MC/DC unit test coverage. But you could capture all the inputs that cause faults, fix the errors, and then add them to your official regression test suite.

Your SlideShare link appears to be broken. I'm curious to read what was there.

I've bookmarked your satellite project post and I'll read it when I get a minute. Writing code that either flies or runs in space is on my bucket list. I'm envious.

Thread Thread
 
phlash profile image
Phil Ashby

Ah ok, here's an InfoQ page on the topic that refers back to my favourite infosec speaker, Kelly Shortridge: infoq.com/news/2019/11/infosec-dev... The topic is Distributed, Immutable, Ephemeral (yep, DIE), using chaos engineering to defend information systems.

I get the envy reaction quite a bit :) - it was however plain luck that I was asked by a work colleague who is an AMSAT member to help out, and ended up with another friend writing firmware for a tiny CPU going to space.

Thread Thread
 
bosepchuk profile image
Blaine Osepchuk

Thanks for the updated link. Interesting article. I don't think the details of the technique are exactly applicable to safety-critical systems. But I have read about how complicated safety-critical systems with redundancies and fail-overs test how their systems respond to failures, disagreement in voting architectures, power brownouts, missed deadlines, etc. I suppose it would all fall under the banner of chaos engineering.

I doubt very much it was plain luck that you were asked to participate. I'm sure your engineering skills had something to do with your invitation.

Cheers.

Collapse
 
tomowens profile image
Thomas J Owens

This is a great article!

I do have a few comments.

Despite being all around us, safety-critical software isn't on the average developer's radar.

Not only is it not on the average developer's radar, but it's almost certainly not on the average consumer's radar either. And as software systems get more information about our personal lives - what we look like, who we talk to, where we go - the systems become more safety-critical for more people. Maybe we aren't talking about the system itself causing bodily harm, but the system containing data that, in the wrong hands, could cause harm to the individual. We don't all need NASA levels of development processes and procedures to build systems that have runtimes of tens of thousands of years without errors, but many can learn from some techniques that go into building these critical systems.

Safety-critical software is about as far from agile as you can get

This is probably the only thing that I don't agree with at all.

Agile in safety-critical systems isn't about improving speed. In fact, nothing about Agile Software Development is about improving speed - the "speed improvement" is mostly perceived due to more frequently deliver and demonstration of software. The advantage of agility is about responding to uncertainty and changes.

The ability to respond to changing requirements is important, even in critical systems. I can't tell you the number of times that the software requirements changed because the hardware was designed and built in a manner that didn't fully support the system requirements. It was more cost effective to fix the hardware problems in software and go through the software release process than it was to redesign, manufacture, and correct the hardware in fielded systems. In some cases, the hardware was fixed for future builds of the system, but the software fix was necessary for systems already deployed in the field.

The basic values and principles still apply to safety-critical systems and I highly recommend considering many of the techniques commonly associated with Agile Software Development as I've personally seen it improve the quality of software components going into integration and validation.

Collapse
 
bosepchuk profile image
Blaine Osepchuk • Edited

Hi Thomas, thanks for taking the time to leave a comment.

I totally agree with your first comment.

I have some thoughts about your second comment though. It's probably more correct to say that teams adopt agile to better respond to uncertainty and change than to go faster and reduce costs. Thanks for pointing that out.

I want to preface my next comments by stating that I've never worked on a safety-critical project. I'm just writing about what I learned from reading and watching talks about it.

Let's look at the agile principles and how they might present themselves in a large safety-critical system:

  1. Customer satisfaction by early and continuous delivery of valuable software. (Continuous delivery is unlikely. Daily builds are possible and desirable but that's not the same thing)
  2. Welcome changing requirements, even in late development. (I doubt they will be welcome, but they may be grudgelying accepted as the best way forward, given the enormous amount of extra work changes to requirements may entail)
  3. Deliver working software frequently (weeks rather than months) (Doubtful. Product will be delivered after it is certified)
  4. Close, daily cooperation between business people and developers (Possible. Desirable)
  5. Projects are built around motivated individuals, who should be trusted (Products are built from detailed plans and documented processes. Motivated individuals are good but everybody make mistakes so extensive checks and balances are required to ensure correctness and quality)
  6. Face-to-face conversation is the best form of communication (co-location) (Agreed. But many large systems are built by different teams or even different companies. Colocating 500 developers is rarely practical)
  7. Working software is the primary measure of progress (I'm not sure how this one would be viewed in a safety-critical project)
  8. Sustainable development, able to maintain a constant pace (Definitely desirable. But I have read about death marches in safety-critical software development)
  9. Continuous attention to technical excellence and good design (Absolutely, although the design and the code are almost certainly not created by the same people)
  10. Simplicity—the art of maximizing the amount of work not done—is essential (Excellent goal for all projects but on a safety critical project the individual has little discretion over what not to do)
  11. Best architectures, requirements, and designs emerge from self-organizing teams (It's more likely that people are assigned roles in the project by management. Self-organization will likely be discouraged. You must follow the process)
  12. Regularly, the team reflects on how to become more effective, and adjusts accordingly (Great ideal. I haven't read anything about team learning or retrospectives on safety-critical projects. Training is emphasized in several documents but that's not the same thing)

So, after going through that exercise I think I agree that the agile principles can add value to a safety-critical development effort. But I think several of them are in direct conflict with the processes imposed by the standards and the nature of these projects. We are therefore unlikely to see them as significant drivers of behavior in these kinds of projects.

If you were asked to look at a safety-critical project, examine its documents and plans, and even watch people work, and then rate the project from one to ten where one was for a waterfall project and ten was for an agile project, my impression is that most people would rate safety-critical projects closer to one than ten. Would you agree with that?

Collapse
 
tomowens profile image
Thomas J Owens

Customer satisfaction by early and continuous delivery of valuable software. (Continuous delivery is unlikely. Daily builds are possible and desirable but that's not the same thing)

It depends on how you define "customer". Consider a value stream map in a critical system. The immediate downstream "customer" of the software development process isn't the end user. You won't be able to continuously deliver to the end user or end customer - the process of going through an assessment or validation process is simply too costly. In a hardware/software system, it's likely to be a systems integration team. It could also be an independent software quality assurance team.

Continuous integration is almost certainly achievable in critical systems. Continuous delivery (when defined as delivery to the right people) is also achievable. Depending on the type of system, continuous deployment to a test environment (perhaps even a customer-facing test environment) is possible, but not it's not going to be to a production environment like you can with some environment.

Welcome changing requirements, even in late development. (I doubt they will be welcome, but they may be grudgelying accepted as the best way forward, given the enormous amount of extra work changes to requirements may entail)

If you were to change a requirement after certification or validation, yeah, it's a mess. You would need to go through the certification or validation process again. That's going to be product and industry specific, but it likely takes time and costs a bunch of money.

However, changing requirements before certification or validation is a different beast. It's much easier. However, if it could have impacts in how the certification or validation is done. It also matters a lot if it's a new or modified system requirement or a new or modified software requirement (in cases of hardware/software systems).

This is one of the harder ones, but in my experience, most of the "changing requirements" in critical systems comes in one of two forms. First is changing the software requirements to account for hardware problems to ensure the system meets its system requirements. Second is reuse of the system in a new context that may require software to be changed.

Deliver working software frequently (weeks rather than months) (Doubtful. Product will be delivered after it is certified)

Again, you can think outside the box on what it means to "deliver working software". The development team frequently delivers software not to end users or end customers but integration and test teams. They can be set up to receive a new iteration of the software in weeks or months and be able to create and dry run system level verification and validation tests and get feedback to development teams on the appropriate cadence.

Close, daily cooperation between business people and developers (Possible. Desirable)

I don't think there's a difference here. Hardware/software systems also need this collaboration between software developers and the hardware engineering teams. There may also be independent test teams and such to collaborate with. But the ideal of collaboration is still vital. Throwing work over the wall is antithetical to not only agile values and principles, but lean values and principles.

Projects are built around motivated individuals, who should be trusted (Products are built from detailed plans and documented processes. Motivated individuals are good but everybody make mistakes so extensive checks and balances are required to ensure correctness and quality)

Yes, you need more documentation around the product and the processes used to build it. But highly motivated individuals go a long way to supporting continuous improvement and building a high quality product.

Face-to-face conversation is the best form of communication (co-location) (Agreed. But many large systems are built by different teams or even different companies. Colocating 500 developers is rarely practical)

I'm not familiar with any instance with anywhere close to 500 developers. Maybe on an entire large scale system, but you typically build to agreed upon interfaces. Each piece may have a team or a few teams working on it. This is hard to do on large programs, but when you look at the individual products that make up that large program, it's definitely achievable.

Working software is the primary measure of progress (I'm not sure how this one would be viewed in a safety-critical project)

This goes back to defining who you deliver to. Getting working software to integration and test teams so they can integrate it with hardware and check it out helps them prepare for the real system testing much earlier. They can make sure all the tests are in place and dry run. Any test harnesses or tools can be built iteratively just like the software. Since testing usually takes a hit in project scheduling and budgeting anyway, this helps identify risk early.

Sustainable development, able to maintain a constant pace (Definitely desirable. But I have read about death marches in safety-critical software development)

I've also read about death marches in non-safety-critical software development. Other techniques such as frequent delivery and involvement of the downstream parties helps to identify and mitigate risk early.

Continuous attention to technical excellence and good design (Absolutely, although the design and the code are almost certainly not created by the same people)

The idea of a bunch of people sitting in a room coming up with a design and then throwing it over the wall exists, but it's not as common as you'd think. When I worked in aerospace, it started with the system engineers and working with senior engineers from various disciplines to figure out the building blocks. When it came to software, the development team that wrote the code also did the detailed design. The senior engineer who was involved at the system level was usually on the team that did the detailed design and coding as well.

Simplicity—the art of maximizing the amount of work not done—is essential (Excellent goal for all projects but on a safety critical project, the individual has little discretion over what not to do)

This is very closely related to the lean principles of reducing waste. It's true that there is little discretion over the requirements that need to be implemented before the system can be used, but there can still be ways to ensure that all the requirements do trace back to a need. There's also room to lean out the process and make sure that the documentation being produced is required to support downstream activities. Going electronic for things like bidirectional traceability between requirements and code and test and test results and using tools that allow reports and artifacts to be generated and "fall out of doing the work" also go a long way to agility in a regulated context.

Best architectures, requirements, and designs emerge from self-organizing teams (It's more likely that people are assigned roles in the project by management. Self-organization will likely be discouraged. You must follow the process)

This depends greatly on the organization and the criticality of the system being developed. It's important to realize that the regulations and guidelines around building critical systems almost always tell you what you must do, not how to do it. With the right support, a team can develop methods that facilitate agility that meet any rules they must follow.

Regularly, the team reflects on how to become more effective, and adjusts accordingly (Great ideal. I haven't read anything about team learning or retrospectives on safety-critical projects. Training is emphasized in several documents but that's not the same thing)

Retros for a safety critical project aren't that different than anything else. The biggest difference is that the team is more constrained in what they are allowed to do with their process by regulatory compliance and perhaps their organization's quality management system.

If you were asked to look at a safety-critical project, examine its documents and plans, and even watch people work, and then rate the project from one to ten where one was for a waterfall project and ten was for an agile project, my impression is that most people would rate safety-critical projects closer to one than ten. Would you agree with that?

I believe that you could get up to a 7 or 8. I think that agility is still gaining traction in the safety-critical community, and it's probably at a 1 or 2 now. It's extremely difficult to coach a development team operating in a safety critical or regulated space without a background in that space. But after having done it, it's possible to see several benefits from agility.

Thread Thread
 
bosepchuk profile image
Blaine Osepchuk

Awesome feedback! Thanks for sharing your knowledge and experience.

Collapse
 
eljayadobe profile image
Eljay-Adobe

Hi Blaine, another excellent article!

Steve McConnell's book, Code Complete “Industry Average: about 15 – 50 errors per 1000 lines of delivered code.” And I think Steve was being optimistic.

Easy enough to do a find . -iname "*.cpp" -print0 | xargs -0 wc -l and divide by 50 to get a sense of how many defects are in the codebase. Using the high-end of Steve's range, which I think the low end is already optimistic.

And this is why I never use the self-driving feature of my car. No way. I don't want to be the next Walter Huang.

Collapse
 
bosepchuk profile image
Blaine Osepchuk

Thanks.

I wrote a post where I criticized into Tesla's autopilot. In the area of driver assist features, I much prefer Toyota's Guardian approach of actually finding ways to help drivers drive more safety instead of asking them to give control to a giving half-baked AI.

Would you use something like Guardian if your car had it? Or do you distrust all safety-critical systems that rely on AI/ML?

Collapse
 
eljayadobe profile image
Eljay-Adobe

I am okay with driver assist technologies. I'm not okay with fully autonomous self-driving cars. Some day, I imagine self-driving cars will be the mainstay. But in the meantime, it'll be a slog up that technology hill.

Collapse
 
motlib profile image
Motlib

Hi Blaine, thanks for this really great post! Sadly you're right: there are no magic tricks to quickly develop safe software. I develop safety-critical software in the automotive area (ISO26262) and it's also extremely much process, analysis, testing and documentation. Coding is only a few percent of the work.

Collapse
 
bosepchuk profile image
Blaine Osepchuk

Care to share a couple more details about your job?

What specifically are you working on? What SIL level? Team size? What do you like best/worst about your job?

Collapse
 
steelwolf180 profile image
Max Ong Zong Bao

First of all I think it is good that you use NASA as a way to talk about safety critical system standard. I think the common industry application is actually based upon certain variation of IEC 61508 due to it being general in nature but there is certain safety standards for different industry.

Second in terms of SIL level the introduction of AI is considered a breach of it. Therefore the use of AI is labeled to be "experimental" aka you do it your own risk that you might die or injured you. My professor always joked on it that as a safety critical guy. He would never a ride in a self driving car because you can never quantify or justify that it will work as intended due to nature of software is unpredictable. What you can do is to based on the probability of failure in the hardware which there is a expected graph of how a it will lead to failure overtime.

3rdly the whole purpose of safety critical system is to prevent the lost of life or detrimental physically harm to a human or equivalent to it. This is justified by the cost of a human life is about 1 million USD. Which is why the higher a SIL compliance the more expensive it become to build that piece of software to comply with the standard. Which is why you only do it because you want to enter or sell it to certain market or country that adopts that safety critical standard.
Depending on depending on the nature of industry, a higher failure rate is allowed like for medical devices.

Collapse
 
bosepchuk profile image
Blaine Osepchuk

All good points.

I used NASA's standard because anyone can look at it for free, which is not the case for IEC 61508 or ISO 26262.

I have no idea how these ML/AI systems are being installed, certified, and sold in cars as safe. My best friend was nearly killed twice in one day by one of these systems. In the first case his car veered into oncoming traffic. And in the second case the adaptive cruise control would have driven him right into the car in front of him had he not intervened and turned it off.

He pulled over and found snow had collected in front of the sensors. The car didn't warn him about questionable sensor reading, or refuse to engage those features. It just executed its algorithm assuming the sensors were correct. Not very safe behavior in my opinion.

Good point about the cost of saving one life. I've read it's different from industry to industry and from country to country. I believe nuclear and avionics in the US put the highest value on a life in the data I saw.

Collapse
 
steelwolf180 profile image
Max Ong Zong Bao

Ahhh... now I understand the why you choose it. Since a module of mine is based on that particular standard in my university. Plus it's really a niche subject that my professor shared to the class. He has to go to China or Singapore from time to time to teach due to the lack of it. Despite its really important especially you are implementing in the area you had mentioned.

As much as I want to have a self driving car to fetch me to move from point A to point B. Till now I'm in the same thinking as my professor of having to drive it myself or grab a taxi.

Collapse
 
bobjohnson_dev profile image
Bob Johnson • Edited

I enjoyed your well written article. I plan on taking the time to look up some of the references you mentioned. I have always got a kick out of the disclaimer on the retail box of Windoz long ago.

Collapse
 
bosepchuk profile image
Blaine Osepchuk

I'm glad you enjoyed it.

I don't understand your reference. What does the disclaimer say?

Collapse
 
zilti_500 profile image
Daniel Ziltener

I am surprised you didn't mention Ada in your article. Ada is great for safety-critical software, especially since programs written in a subset of it can be mathematically verified to be bug-free.

Collapse
 
bosepchuk profile image
Blaine Osepchuk

Oh man, I'm a huge fan of Ada and SPARK. I just wrote and open sourced a sumobot written in SPARK.

My post just kept growing and growing in length so I decided to make some hard edit choices to prevent it from becoming a short book. But safer languages are definitely an option.