👻 Do you have any horror stories to share? Spooky bugs, scary data leaks, horrifying code, etc. 🎃

Ben Halpern on October 31, 2017

markdown guide
 

Every time one installs or updates the JDK, there is the message:

"3 billion devices run Java".

 

Oh yeah, I remember many years ago when my CTO wanted to clean up a few rows in the DB and ran a select without specifying a WHERE clause.

DELETE FROM users;

The team spent the next week recovering data from printouts ( Hey, it was 1999 :) )

...Opppsssss....

From that moment on, we made him run mysql with the --i-am-a-dummy option.

 
 

Once upon a time, all my work from the previous night disappeared. git reflog showed a SHA but the line was blank and my work never came back. It was spooked and never used that computer again.

 

The "monotable".

I thought it was just a myth. An urban legend people made up for a funny story on The Daily WTF. But then I witnessed it. It happened on my team. Done by people I worked with!

We were working on a system based on a very annoying CMS. We decided to move certain aspects to a database instead of however we were storing those parts in the CMS before. I joined the team near the middle the migration and helped work out some bugs on one component we were migrating to the database.

I then moved on to another aspect of the system, which had some problems with the SQL. I asked the guy on our team responsible for integrating the database in the first place to see the schemas for the tables. And that he did... he showed me the schema for the table. Yes, our entire database consisted of a single table, with columns upon columns added on for additional features.

Now, I've seen some database table monstrosities before, and heck, I've even made some when I was first learning how to work with databases, but this was coupling completely unrelated aspects of the system together. I'm talking about storing user input and system configuration in the SAME table!

So yes, that was pretty spooky.

 

Just the other day I accidentally committed a ~200mb file to git, which is not allowed in GitHub, but removing it was much more complicated than just deleting the file because it already existed in the git history.

The whole process of removing it was pretty scary as it required rewriting the git history in possibly destructive ways if done wrong. 😱

But it worked out fine. 😄

 

Or when you forget to create gitignore and you see your connection-strings commited xD

 
 

Working with a "REST" API today the error response was

{errorCode:1, error:"Error message"}

For the people who haven't had coffee yet, the quotes for errorCode and error are missing so it isn't a proper JSON. I think it actually took some effort to make it like this instead of proper JSON.

 
 

I am not sure what that does... I am writing the Android client (using Kotlin), the server code is written in PHP :)

Oh,

that’s a valid object in JS, so if you would work in JS and put it in an eval function, it’d return the object.

PLEASE FOR THE LOVE OF GOD NEVER USE EVAL. IT’S VERY UNSAFE.

 

I remember one time that I was forced to debug a 7000 lines, obfuscated javascript file.

I think that says everything. :>

 
 

One time a friend of mine and colleague was "fixing" some purchase orders and he had to delete some of the rows. He opened Management Studio and got some queries running.

After confirming the rows he needed to delete, he started to write the DELETE query. He always commented the delete statements to prevent any accidental data loss, but that night right after when he has just written the table name part of the syntax, he accidentally pressed F5 because he wanted to be sure that the SELECT conditions were correct.

Something like this:

SELECT .. 
FROM [TABLE HERE]
WHERE [LOTS OF CONDITIONS ]

DELETE FROM [SAME TABLE NAME HERE] 
[HE FORGOT TO PLACE THE CONDITIONS HERE]

The best part was when he pressed STOP to stop the query and the cancel was not working, he panicked and unplugged the Ethernet cable from his computer. xD

To this day, some people say that some of those lost rows still appear on query results.

 
 

I once not only deployed on a Friday...I changed the hosting server and underlying software AND THEN deployed on a Friday. 👻

 

Did it work out, or did you learn why the conventional wisdom about that can be summed up with "don't"?

 

It did work out, but I was enabled and supported by the manager/director from technical operations. I'm not ignorant of the conventional wisdom and I'm stability-conscious. There are nuances to the situation which I didn't feel the need to divulge in a light-hearted post.

I wasn't trying to imply otherwise. :) I just wondered if it all worked out; I'm glad it did.

 

One of my first days of work, open the project, browse, read and find 2 5000LOC God classes.

Good thing later I used those to practice all the refactoring techniques from a famous book and recognize most of the OOP anti-patterns.

 
 

The word 'Drupal' will probably already send a shiver down many a developer's spine, but that's what I was working on today: updating a client site's core Drupal libraries and modules with drush. This is something I've gotten down to a mad, mechanical sort of a science, but it's not without its flaws. I went through the process only to be met with a styleless page, and what's worse, fixing any issues brought it to a screeching halt in the form of an 500 error! Thankfully this was only a scare as I had forgotten this was on a dev server, but never something to be taken lightly.

 

I didn't do anything to prevent the unnecessary re-rendering of a React component that used the Google Location API. Within a few hours of it going live, we hit our max amount of free API calls and the whole feature was basically useless for 24 hours 🙃

Thankfully this was when we were slowly reskinning / revamping our product, so clients could still use the old version and knew there wouldn't be a seamless transition until everything was finished.

And that's how I learned why performance lifecycle functions are important.

 

For around one month I had a bug on Mac when Mac clicked random parts of the screen sometimes, like switch to another tab in Slack, open a tab in browser, etc. Then I found that it was my coworker who used bluetooth mouse, which was connected to my laptop, I used it one time, so it was saved in configuration.

 

CSS ID selectors...nested inside MORE CSS ID SELECTORS!

 

Definitely guilty of this in the past, due to some mess with jquery and widgets with dynamic content. Not enough of an excuse, I know 🤣

 

In a previous life, I was tasked with trying to scrape an automated migration task together on the DB. It was a fairly organic operation that required me to work in production. There was a lot of back and forth between my target and production and I had to drop the target DB frequently.

The hours got long, coffee ran out and just as I had wrapped up the task, I decided to clean up for one last test run. I started by deleting the target db. But it wasn't the target. It was production.

 

Also, here's a picture I share with my front-end developer friends to scare the crap out of them.

CSS selector executive order

 
 

That time I spent a week deconstructing and documenting a super complex algorythm in c only to find the last line.... return 0.5.

Still angry.

 

Understandable. I'm surprised your computer escaped intact after something like that.

 
 

Many years back I was involved in the double charging in ~100 peoples stored credit card details, during a routine scheduled payment process. Turns out someone had inverted the logic on an if clause. Spent well into a Sunday night trying to track that down.

 

This was probably 15 years or so ago at this point (and deals with mainframes and COBOL; yes, I know I'm like a living "You're not connected" icon - and get off my lawn while you're at it).

On the UNISYS mainframe, they have a transaction process side that's way more performant than the traditional model (think nginx vs. Apache). But, to really get the throughput for commonly-used programs, you use a concept called RTPS (Resident Transaction Processing System) - basically, instead of your program hitting STOP RUN and terminating, you GO TO a point near the top that reinitializes all your variables, then waits for the next input. The advantage to RTPS is that the operating system doesn't have to actually load the executable from disk; since it's in memory, it just runs it.

Anyway, our current setup didn't need this. But, our "next" setup (my project) needed to clear 100 transactions a second, 100k per day. Loading what was (in effect) the security program 100x per second was crazy I/O; when most of these programs finished, they called a second to display their output, and that's doubling the I/O. So, the security program, the screen-based output program, and the paginated plain-text output programs are prime candidates for RTPS.

As part of a contract with UNISYS to make sure this all went smoothly, my employer actually sent me to work with them, and we established that we could make the security program run as part of RTPS, and it worked - it was really fast! I returned to my office, excited to put this code change on our development box so that everyone could start exercising it; this turned these programs into what we call today "long running processes," and I knew that exercise was crucial to getting a lot of the kinks worked out. When I got ready to activate them, I was really excited; I think my fingers were shaking as I typed the command to launch 5 copies of this program into RTPS. It worked! I listed them, and it showed 5 copies; I ran a transaction, and that worked too! Then, the dreaded

SESSION PATH CLOSED

arrived in my terminal. "Great - what an awesome time for the network to suck!" We gave it a few minutes (I had a small audience at this point), and I was able to get signed back in. I ran the command to put 5 copies of that program back in RTPS, and again, we lost connection within a few seconds.

Time to call the help desk. We dialed the number, and this is literally how they answered the phone...

"WHAT are you DOING?!?!"

Long story sh... er, not quite as long, there were two patches the mainframe needed that they had never bothered to load, because "no one uses that." We chastised them for not keeping us current on patches, and they obtained and loaded them. RTPS worked great after that. This kept the ghouls away, until the same organization provisioned us 25% of what we and UNISYS told them we'd need when we actually went enterprise-wide with this project...

(On the upside, I joined a rather elusive group of programmers who made the mainframe spontaneously reboot from something other than the reboot command (which they wisely never gave any of us) - an elite group known as "real programmers".)

 

It happened to me during the winter of 2016. It was really difficult to get any details around this case but finally I managed to convince some local developers to start talking, fear in their eyes. Turned out there was this old User Story mentioning some Zombie images coming back after deleting them from our system. Regardless of effort spent on removing them, they would always come back, again and again. This foggy day I got delegated to investigate it deeper. What happened next was nothing one could ever expect...

 

In the third week at my new job, we had to import a multi-gigabyte database into MySQL on our development server, and the /var partition was too small to hold it. “Fortunately,” the machine was setup with LLVM, so we could resize it (and /home, to make room). To avoid making silly mistakes with the CLI, we chose to use the GUI tool. After we downsized /home and upsized /var, everything was corrupt. The GUI had resized the partitions, but not the volumes, so /var was overlapping /home.

(Here you have to wonder about the point of having a GUI if it makes even sillier mistakes than those you’d have made in the CLI.)

Fortunately, we could retrieve the exact previous block counts of both partitions from the logs, and resize everything back to how it was before. But wait, it gets better: this time, the GUI chose to resize partition AND volume, so everything was still corrupt, /home was now smaller than the data it contained. So we had to re-re-resize the volumes, and fortunately everything was back to normal and no interesting data was lost.

Did I mention it was my third week on the job that I almost nuked the team’s sole development machine?

From this day on, every time Linux suggests I partition and set up LLVM, I smash the Nope button.

 

That one time when the test credit card server actually processed payments... Turns out you had to use the test credit card on the test payment server or it'd still try to charge the card. Imagine that. Luckily we caught it in time before the processing went through.

And that other time when some small bug (I can't remember what it was anymore) prevented (a small subset of) payments from being processed for a day or so.

Those are probably my two biggest blunders.

As far as scary code goes, I was asked to rewrite some modules that were written years ago by a third party out of the country. Some of the things I saw in there really made me scratch my head and literally facepalm at times.

 

I can never remember any horror stories, but I'm always a bit on edge whenever I touch production data... 🙃

 

I do enjoy the thrill of playing in production :)

 

We have a async job on resque to expire passwords given an interval. No one have notice that this job stoped to work for a long time. It had about 8M jobs enqueued to process. Someone fixed the queue and send all jobs to be executed. We expired password for about 40% of our clients and some of them received more than a thousand password expiration emails in one day. Shit happens...

 

Hadoop 0.20.2 had a bug where fixing missing block replicas would not respect the rack aware placement policy. Over the course of many years we had lost enough drives to start losing blocks whenever we lost a drive. Took us a while to figure out what was happening. Luckily the data in hadoop was not the primary source, but we had to recopy data from origin. Then copy all files in HDFS to a new HDFS cluster (almost every file had at least one block affected) Petabyte scale copies don't happen quick. :)

 

Once I created bug in our system and the whole day Invoices on the website were not creating at all. It was d-day...

 

Not scary, but spooky good.

I once saw a self-contained function written in 10+ year-old Transact-SQL in a banking database, used to calculate if a date was Easter (literally IS_EASTER(DATE)...).

This was High Technomancy.

Per Wikipedia, Easter "falls on the Sunday following the full moon that follows the northern spring equinox". So, you can imagine the loops and mods and leap year and off-leap year calculating.

And, since the method of calculation depends on whether you're using the Gregorian or Julian calendar, of course there was a conditional "IF YEAR < 1752..." enclosing a whole other set of calculations.

 

Once spent an entire Christmas break 2003 tracking down a bug that turned out to be one character in a printf format specifier.

code of conduct - report abuse