DEV Community

topjer
topjer

Posted on

Facing the Impossible

The every day strive to be in control of the software we maintain seems futile. Maybe we work with legacy code that was build 10 years ago or we are using undocumented binaries that were implemented by someone who long left the company. With time our systems become so complex that we are unable to predict all the possible side effects.
The unknown seems to harbor a vast number of hidden bugs that linger under the surface only to appear in the most unfortunate moment of time.

But there is a type of bug that is way more devilish. Because it hides in plain sight. We see the potential shortcoming but choose to ignore it because it's trigger is ... impossible
In hindsight we clearly see, that it only seemed impossible to us.

The prime example would be:
Me: "Why should I ensure that a day has 24 hours?"
Daylight saving time: "'sup?"

Let me tell you a story of when the impossible became reality.

Background

Let me start by explaining the circumstances. It happened in one of our SAS jobs - even though the specific language might not be of importance.

The job in question queries a database at some point during its execution in order to check whether data is up to date.
If that is not the case, execution is halted for 5 minutes and the check is performed again.
If up to date data is available it triggers a subprocess.

The main job and the sub-process write individual logs that are timestamped with the starting time to second precision.

The mystery

We are currently testing the job in question, thus we are executing it more often than normal. When checking the logs today two very strange things happened:

  • The logs of two consecutive runs of the main job that started at different times were last modified at exactly the same minute.
  • There was only one log for a subprocess and it seemed a bit skewed. The log line format, usually strictly defined, was broken. Some lines contained artifacts of other lines and some lines were simply duplicated. As if two processes wrote in the same log.

Since the logs are timestamped down to the second, it would only be possible for two subprocesses to write to the same log if they were started in the same second.

But this is impossible, right?

The solution

I told you that both main jobs ran at different times, right? But I did not tell you that they ran 14 minutes and 53 seconds apart. Why is that important? Because it is roughly a multiple of the sleep time of the main process.

What happened was that the the first instance of the main job came to the data base check and did not find up-to-date data. So it went to sleep.
At the time the second process came to the data base check, up-to-date data was still not available and because of some slight differences in run time both jobs were only 2 seconds apart.

Now both main jobs waited 5 minutes multiple times and different response times of the data base - maybe due to result caching - synchronized both main jobs down to the second.

By the time the latest data was available both jobs were completely in sync and the data base provided them the information at the same time.
Thus two instances of the subprocess where initialized in the same second and both used the same log because it was only identified by the second.
Each job interupting the other whenever they flushed their log buffer.

The impossible

In hindsight it actually does not seem to impossible for two jobs to be started in the same second. We were aware of this possibility and chose to ignore it because we deemed it impossible while it only was very unlikely.

Yet, this story is not about the highly unlikely. It is about something that appears so impossible that we did not even dare to think about it.

See, there is this one detail I kept from you. Both subprocesses were not only started in the same second, they were started in the same millisecond.

The data base was able to sync the main jobs down to the thousanths of a second. I was flabbergasted when I compared the starting time stamps of the subprocesses in the main job files because they were exactly identical. But see for yourself:

Screenshot 2021-01-26 174527

Conclusion

Even now when I recall all of this, I find it hard to believe. Yet, I feel joyful about this discovery. It reminded me that there are still many things I don't understand and that I want to learn.

Technology as a whole is simply fascinating and capable of unthinkable deeds. From now on, I will be careful whenever I think that something is impossible.

Top comments (0)