Ovid

Posted on Mar 16, 2021

Fixing a 40-year-old Software Bug

#bug #excel #history #etl

Note: I have exact numbers for this because I originally wrote this on the day I found the bug.

I was working on an ETL system designed to reduce the cost of Phase III clinical trials. In doing so, I was reading some data and I processed 36,916 potential dates. Two of those 36,916 failed to validate. I wasn't concerned as these dates came from clients at big pharmaceutical companies handing us spreadsheets the often had only a vague match with our specification. When you work with clients much larger than you, you often just grin and bear it and die a little inside each day (which reminds me, I need to write up my hell with Yahoo's now-extinct IDIF format some day).

On that day, however, the pharmaceuticals were blameless. When I inspected the raw data, the failed dates were January 1^st, 2011 and January 1^st, 2007. I knew those dates. This wasn't sloppy data from a client. I had a bug in software I had just written, but this bug was first released in 1983.

For anyone who doesn't understand the software ecosystem, this may sound mystifying, but it makes sense. Because of a decision taken a long time ago to make another company money, my client lost money in paying me to fix a bug that one company accidentally introduced and another company deliberately introduced. But to explain it I need to talk about a third company that introduced a feature that eventually became a bug, and a few other historical tidbits that nonetheless contributed to the obscure bug I fixed that day.

History

System Clocks

In the good ol' days, Apple computers would sometimes spontaneously reset their date to January 1^st, 1904. The reason for this is simple. Back then, Apple computers used battery-powered "system clocks" to keep track of the date and time. What happened when the battery ran out? Apple computers tracked their dates as the number of seconds since the epoch. In this sense, an epoch is merely a reference date from which we start counting and for Macintosh computers, that epoch was January 1^st, 1904 and when the system clock battery died, that was your new date and it caused a curious problem.

Back then, Apple used 32 bits (ones and zeros) to store the number of seconds from their start date. One bit can hold one of two values, 0 or 1. Two bits can hold one of four values, 00, 01, 10, 11. Three bits can hold one of eight values, 000, 001, 010, 011, 100, 101, 110, 111, and so on. How much can 32 bits hold? 32 bits can hold one of 2³², or 4,294,967,296, values. 2³² seconds is just over 136 years, which is why older Macs couldn't handle dates after 2040 and if your system clock battery died, your date would reset to 0 seconds after the epoch and you'd have to keep manually resetting the date every time you turned on your computer (or until you bought a new battery for your system clock).

However, the Apple solution of storing dates as the number of seconds after the epoch means we couldn't handle dates before the epoch and that had far-reaching implications, as we'll see. This was a feature, not a bug, that Apple introduced. It meant, amongst other things, that the Macintosh operating system was immune to the Y2K bug. (Ironically, many Mac apps weren't immune because they would introduce their own date system to work around the Mac limitations.)

Lotus 1-2-3

Moving along, we have Lotus 1-2-3, IBM's "killer app" that helped to launch the PC revolution, though it was VisiCalc on the Apple that really launched the personal computer. It's fair to say that if 1-2-3 hadn't come along, PCs would likely have not taken off as quickly as they had and computer technology would have turned out considerably differently. However, Lotus 1-2-3 incorrectly reported 1900 as a leap year. (In literary terms, that sentence is what we call "foreshadowing")

When Microsoft released Multiplan, their first spreadsheet program, it didn't have much market penetration. So when they conceived of Excel, they decided to not only copy 1-2-3's row/column naming scheme, they made it bug-for-bug compatible, including treating 1900 as a leap year, a problem that remains to this day. This wasn't to be sneaky; they needed Excel to be able to import Lotus 1-2-3 spreadsheets. So for 1-2-3, this was a bug, but for Excel, it was a feature, even if meant sometimes getting dates wrong.

Epochs

When Microsoft wanted to release Excel for Apple's Macintosh computers, they had a problem. As mentioned, Macintosh didn't recognize dates prior to January 1^st, 1904. However, Excel used January 1^st, 1900 as its epoch. So Excel was modified to recognize what the epoch was and internally stored dates relative to these respective epochs. This Microsoft support article explains the problem fairly clearly. And that leads to my bug.

My Bug

My client received spreadsheets from many customers. Those spreadsheets may have been produced on Windows, but they may have been produced on a Mac. As a result, the "epoch" date for the spreadsheets might be January 1^st, 1900 or January 1^st, 1904. How do you know which one? Well, the Excel file format exposes this information, but the parser I was using did not and it expected you to know whether you have a 1900 or 1904-based spreadsheet. I suppose I could have spent a lot of time trying to figure out how to read the binary format of Excel and sent a patch to the maintainer of the parser, but I had many other things to do for my client and so I quickly wrote a heuristic to determine whether or not a given spreadsheet was 1900 or 1904. It was pretty simple.

In Excel, you may have a date of July 5, 1998, but it might be formatted as "07-05-98" (the useless US system), "Jul 5, 98", "July 5, 1998", "5-Jul-98" or any of a number of other useless formats (ironically, the one format my version of Excel didn't offer was the standard ISO 8601 format). Internally, however, the unformatted value is either "35981", for the 1900 date system, or "34519", for the 1904 system (these numbers represent the number of days after the epoch). So I used a robust date parser to extract the year from the formatted date, and then an Excel date parser to extract the year from the unformatted value. If they're four years apart, I know I'm using the 1904 date system.

So why didn't I simply use the formatted date? Because July 5, 1998 might be formatted as "July, 98", losing me the day of the month. We get our spreadsheets from so many companies and they create them in so many different ways that they expect us (meaning me, in this case) to figure it out. After all, Excel gets it right, I should, too!

That's when 39082 kicked me in the tail. Remember how Lotus 1-2-3 considered 1900 a leap year and how that was faithfully copied to Excel? Because it adds an extra day to 1900, many date calculation functions relying on this can easily be off by a day. That means that 39082 might be January 1^st, 2011 (on Macs), or it might be December 31^st, 2006 (on Windows). If my "year parser" extracts 2011 from the formatted value, well, that's great. But since the Excel parser doesn't know whether it's a 1900 or 1904 date system, it defaults to the common 1900 date system, returns 2006 as the year, my software saw that the years were five years apart, assumed an error, logged it, and returned the unformatted value.

To work around this, I wrote the following (pseudo-code):

difference = formatted_year - parsed_year
if 0 == difference:
    assume 1900 date system
if 4 == difference:
    assume 1904 date system
if 5 == difference && 12 == month && 31 == day:
    assume 1904 date system

And all 36,916 dates parsed correctly.

As an aside, according to an anecdote from Joel Spolsky, the Lotus 1-2-3 "bug" may have been a deliberate attempt to simplify the Lotus software.

Top comments (4)

Nathan Kallman • Mar 16 '21

Whenever I hear about building a city on Mars, all I can think is: "hope I'm not the one who has to make datetime conversions between planets..."

Ovid • Mar 16 '21

On Mars, the Sol (Martian day) is 24 hours, 39 minutes, and 35 seconds long. When Kim Stanley Robinson wrote the epic Red Mars Trilogy, he took that extra 39 and a half minutes and called it the "timeslip." Clocks on Mars stop at that time. There is effectively no time for just over half an hour.

When I first read that, I immediately thought that contractual service level agreements (SLAs) are going to be a legal nightmare.

You: Your honor, the SLA guarantees an alert within five minutes of an outage.
Judge: And what time did the outage occur?
You: During the timeslip.
Judge: Case dismissed.

My next thought was Kim Stanley Robinson has never been a programmer, or at least never worked with dates before.

Nathan Kallman • Mar 16 '21 • Edited

Amazing... reminds me of the "Leap smear" technique to get rid of leap seconds. I'm sure someone is going to try to smear away 39 minutes and 35 seconds at some point 😞.

Especially fun will be having to account for time dilation as the decades roll on.

Dan Greene • Mar 21 '21

I’ve had to fix bugs like this. I feel for you. Fixing a bug like this changes who you are as a person, haha. You come out a much stronger developer.

If you’re curious, mine was the result of having to fix a timezone bug problem. And yes, before anyone asks, we were passing the date around in UTC ISO. But due to impersonation, the wrong timezone was utilized until years later when I came around to fix it. Now I feel like I know everything there is to know about time zones... and I wish I didn’t have the battlescars!

DEV Community

Fixing a 40-year-old Software Bug

History

System Clocks

Lotus 1-2-3

Epochs

My Bug

Top comments (4)

Read next

[Translations] Implementing Animations with Throttle in JavaScript

🔥 Effect Stopped Reacting to Action, Have You Ever Faced This? 🔥 I recently learned a valuable lesson about NgRx Effects – specifically, where to place map and catchError when handling service calls.

Awesome overview for using CSS variables to implement themes with Puck! 🎨

Setting values in R6 classes, and testing with shiny::MockShinySession