DEV Community 👩‍💻👨‍💻

Cover image for Assume nothing; check everything. Fail safe.
Sean Charles
Sean Charles

Posted on • Updated on

Assume nothing; check everything. Fail safe.

Photo courtesy of Jackson Simmer

How many of us developers are guilty of not checking return values from things all the time? A quick look around the room says most of us. Yeah, sure, most of the time stuff works, nothing happens and so what, it's only a printf() why should I need to check that?

I'll tell you why, because one day, that printf() might be sending a message across dual-port RAM to another processor and you need to know it arrived. Because one day, that fclose() will fail because the disk is full and then the file you just wrote gets vaporised and all hell breaks loose at 2AM, as the first day of your holiday begins.

My first job ever, but a mere 36 years ago now, was working with embedded microprocessor systems in a fail-safe environment, railway signalling and oil industry based SCADA and telemetry systems. And, with 36 years of hindsight, even though it FELT anything but at the time, I now consider it to have been one of the most satisfying jobs I ever had. That and working with Neil Dyer. Hello Neil, wherever you are, thanks for showing me Squeak!

When you work in railway signalling, you have to be sure that things fail safe. What does that mean? That means that as soon as there is a problem, you shut down such that the real world doesn't get adversely affected in such a way that people die.

Just because a bit of code goes wobbly shouldn't mean two trains end up racing towards eachother on the same section of track at 100MPH. No sir, that just won't do.

So what's a good practice in that situation ?

Check everything.

E V E R Y T H I N G.

Even if in your wildest imagination you just know this line of code can't possibly fail. If that function returns nothing nor raises any kind of exception then of course there's not much you can do about that particular function.

Scenario 1 : Railway track points.

You may or may not know anything about railway signalling but let me tell you, it's beautiful. I am out of touch with it all now, a SPAD is still a SPAD though (signal passed at danger) and to be avoided at any time. Back in my time, and probably still today, when the control room requests that a set of points change from one track to another, that's dangerous if it doesn't complete correctly, the word here is derailment. In order for the control system to be sure that the points are fully positioned, there are relays and sensors everywhere. It's not good enough to check that the sensors say the points moved from A, you have to check they arrived at B as well or they could be stuck in the middle, think dead bird, dead bodies, fly tippers, anything could have caused those points to jam mid-travel. And as soon as that happens you want to know about it.

Scenario 2 : Unexpected disk full.

There is absolutely nothing like a disk full of log files to bring a system down. You use log rotation right ? How many of you C coders have ever done this:

FILE *fh = fopen(...);
if (fh) {
    // shizzle happens here
Enter fullscreen mode Exit fullscreen mode

Come on be honest, I know I USED to but not anymore. Not since that day. That day when all the alarms started going off and we all started running around wondering what the hell was going on. Box to box comms was failing, we thought maybe an ICBM had hit us... finally it was traced to the fact that the disk was full of log files because somebody had failed to properly test the log rotation was set up correctly.

But it SHOULD have been detected, and could have, simple by doing this instead:

FILE *fh = fopen(...);
if (fh) {
    // shizzle happens here
    if (fclose(fh)) {
        // failed: tell somebody
    else {
        // sweet.
Enter fullscreen mode Exit fullscreen mode

For the same of an IF statement we all got ulcers for half a day. The reason it failed to close the file was because other processes were also consuming disk space (this was about 1986, everything was expensive back then) and so when the time came to commit the log entry, there wasn't enough space left to flush the internal file buffer to disk, the close failed but nobody knew and then...boom.

Scenario 3 : System Suicide

I once had the pleasure of reviewing a few thousand lines of 8089 assembler (or was it 8051, can't remember) that was the firmware for a product that was in the category of 'single track block unit' working. Essentially it was the electronic replacement for what, in the old days, literally would have been a block of metal the signalman would give to the train driver to obtain access to the track. I don't know how that worked mechanically but it was a real world semaphore that guaranteed that for the entire time that that block sat in the engine room of the train, that no other train could come down the same piece of track. When the driver arrives at the other end, he he hands it back to the signalman at that end. This mutually exclusive sempaphore ensured no high speed head on collisions could happen.

Enter microprocessors. As I reviewed the code, I realised that the author had in fact implemented a very clever system of checks. The processor board was duplicated, on power up two identical versions (you hope) of the software would exchange version information, halting if they differed. Assuming all was well, system boot was done and then the main loop was entered. Here's the cool bit; at EVERY single point where flow could branch a unique token was exchanged via a GPIO (general purpose input output) port and these had to match at every single exchange. Why? The dual processors were physically separated but driven by the SAME set of digital inputs and analog values and so should be perfectly in synch at all times, if not, look out! Actually it was managed better than that... if a token comparison failed then a routine called hare_kari was called. The first thing I noticed in the comment was the bold letters THIS FUNCTION DOES NOT RETURN.

As I read the code and the comment I was immediately impressed by the gravity of the situation. The function did not return because it couldn't. Why? Because what it did was to switch a relay that would deliberately cause the main board fuse to blow in milliseconds. The board going off line was the trigger for high level monitoring and fault management systems to take over. That is so cool, I thought it was slick then to be honest.

So, check everything that MIGHT fail in case it DID FAIL. Don't ever do this you Java guys:

try {
    // funky fashizzle
} catch(Exception e) {
    // omerta
Enter fullscreen mode Exit fullscreen mode

The Legend of The Silent Exception. At least log it somewhere so it can be found.

Trust nothing. Fail safe. Be safe.

The Universe is Indifferent.

Top comments (0)

🌚 Friends don't let friends browse without dark mode.

Sorry, it's true.