What have you crashed?

twitter logo github logo ・1 min read

Over the years, when someone crashes something at work I (semi-jokingly) say:

If you don't crash something every so often, you aren't working hard enough.

And while that might not be 💯 percent correct, I feel it helps put the person at ease, especially if the person is new to the company. The truth is, it has and will happen to everyone.

So, have you 💥 crashed 💥 something important? Or brought the network to its knees? I'd ❤️ to hear about it.

twitter logo DISCUSS (23)
markdown guide
 

In my early days, I learned (...eventually...) that complicated SQL queries with incomplete JOIN conditions on large data sets in production = non-responsive server = angry customers.

Fast forward to today, I insist that the IT department provide a dev/staging server which mimics production knowing full well that a runaway query (or never-ending loop, or recursive function with a misplaced terminal condition, or ...) can happen to anyone from time to time. Even the best of us! :)

 

In the early 70s, I thought I was a hotshot programmer with 5 years of experience, I could do Assembly and Cobol. I did an optimization that impressed my supervisor, got permission to examine the computer. It was an IBM 370/148, not the small one fit in a large room partitioned into 4 but a big one, took up the entire floor at the bank. partitioned into 8. This was my first time on a big machine. I examine F0, F1, F2, F4, F5, F6, then paused. power was always on the last part F3, but this big machine power must be on F7. You never look at power, the machine would have scan every log file since it was built to fill the screen. So I go to F3, this big machine was actually a small machine that had been upgraded. First all the line printers quit, then all the tape drives quit, then nothing but the rattle of the giant HDD platters, then the phone rang, then every phone at the bank was ringing. 165 seconds after I pressed the wrong button, the screen filled up and everything went back to normal. I got to meet the bank president, the entire board, every manager, and the Federal Reserve guy. I said I was trying to speed up the computer. Seems every terminal in every branch also went down for 165 seconds. Federal Reserve banned me from any bank computer room, they took away my keyboard and terminal. I had to program with a plastic template and graph paper, get someone to keypunch it for me and then bring the stack of cards to the computer operated to be run after hours. making a change in the program and compiling it and seeing the results was about 60 hours. I would sweet talk the computer operated into fixing my code. I ended married to her, for years she kept fixing my code.

 

I ended married to her, for years she kept fixing my code.

Ha, sounds worth it!

 

This is fantastic! I was not expecting a ❤ story. Thanks for sharing.

 

My second week of an internship as a data analyst, I crashed the dev MongoDB database by discovering a (known) bug in the aggregation framework. Said bug errored once per document in a ~4-6mil document aggregate query, causing the log file to balloon to multiple gigabytes, filling the filesystem and crashing the cluster. This being dev, the logs were set to write to the same drive/partition as the root, so it also had the awesome side effect of preventing login to the system because the temporary files to allow login couldn't be created.

 

Filling the disk! Logging to a file in a tight loop will generate more output than you might think. I once logged to my home directory which was network mounted, next thing you know the server was offline and no one could login in. Doh!

 
 

Seeing as my primary area of coding is vulnerability scan automation I tend to crash a lot of things. Central web services, DNS servers, Document Repo's, Data taking systems. I've killed them all. Each time I do it I have to spend a lot of time trying to figure out how to be kinder and more gentle to the systems on the network. My Favorite is scanning for anonymously writable FTP daemons. I get a lot of complaints from users that their printers are spitting out lovely notes from me.

 

I one time took down a Facebook game I worked on with over 4 million daily players. The combination of an off-by-one error and infinite retry with no back off turned all the game clients into a DDoS attack on our own servers. Even the Akamai web console was unresponsive for about 10 minutes.

 

Brought down a couple warehouses for companies you've definitely heard of because our deployment strategy was to TeamViewer to customer site > Drop patched dlls. Not much QA there.

 

Yeah...about 5 minutes before a demo at my company, I thought I could fix a simple SQL call that was causing so invalid data. It worked on my machine fine. But when I went to push it to the server, something went wrong and it broke the demo. The code was fine but the deployment failed. Took 15 minutes into the meeting to figure out what went wrong and fixed it. Whoops. haha.

 

about 5 minutes before a demo at my company

I know that temptation, trying to squeeze in as much as possible into the demo.

 

We've got an old records system on an AS/400 that needed the data migrated. Short of writing a fully custom app to allow editing (due to the legal need to provide redaction capability forever), we needed some kind of front end. We have a robust document management system that provides all of the retention, security, search, redaction, and managed access features that we need. We looked at a ton of options, and this was the best fit.

So, I set about turning the data into a series of reports and feeding them into the system using their SDK. The first iteration just about killed the report server. At that point I learned how to generate reports locally on my batch machine, and how to do it in parallel. The new process resulted in the queues on the server being filled up, and the resources being saturated. Everything ground to a halt. My next lesson was in tuning the import process to not be so hard on the servers. The next hurdle was, in importing a couple of hundred million files, the SAN hosting the application's storage ran out of inodes. Every server hitting that storage location had a fit. There wasn't a lot I could do about that, but after the operations team cleared the issue, add that to the list of things I know to think about for the future.

I learned about a lot of neat things during that project. It gave me a ton of great lessons that I wouldn't have learned if stuff hadn't have broken.

 

I am guilty of the old Update without a "Where" statement on a accounting production database (there was no staging environment at all).

And working previously on IT (more hardware stuff), I plugged a computer to the network, and somehow it caused an IP conflict between the computer and a software license server, which crashed the software on every computer in the network. That one caused lots of mayhem all through the building.

 

Many 💥, way too often.

Working directly in production databases is rare (and still a terrible idea) these days, but it used to be status quo.

Let's just say that I still tend to write my "WHERE" clauses first when writing SQL.

:)

 

I was fixing a friends laptop and bricked it... twice. I have the same thing to my computer (with my computer the power went out during an update only for a second but it still bricked it, it went out again when i was reinstalling linux).

 

I once messed up the key used for a particular cache. The system worked fine with 1-2 users testing it. The first morning after going to production everything went haywire. The users were getting error messages on every screen, regardless of whether they were using the affected feature or not. Productivity ground to a halt. That was Friday. I was out of town Thursday through the following Wednesday.

 

I crashed a mini-server I made using a connection of multiple files, by spelling "connectTo main" as "connectTo mian", this was in python and I was making a programming language, I should have added a try and catch block...

 

I saw this earlier and thought I was just going to read along and nod to some relatable content. Then visual studio hiccuped and I watched the entire bin folder start throwing errors because vs thought half the files were deleted. Somehow the project builds without them but it won't publish and I wasted enough time trying to fix it that I could have recreated the whole project

 

I learned myself Entity Framework half a year ago because I really got tired of the way we did queries in my workplace's main product. Seems easy enough, but... Lots of fun when a seemingly innocent LINQ query refuses to do a simple join and instead fire of one query for every related item. And it doesn't slow the application down enough to be noticeable. Well, until our main products peak time in the week.. Server memory usage:📈 Our app:💥 Me that day:😓

But the occasional EF weird-stuff beats having to deal with stored procedures and stringbuilder-made query strings.

 

I was trying top delete a set of scripts in a subdirectory and typing faster than I thought led me to delete all of the scripts in our build server. I stood up immediately after realizing and walked over to the most senior developer on our team to tell him, and he said that it was no big deal. It was rsync'd to the secondary server and he could bring the scripts back with a vintage of just last night.

 

One of my early PRs at a shop doing mostly zero-downtime schema migrations turned out to... not be zero-downtime. Unexpectedly so. We got the first user email within seconds. Good times.

Once, while working a sysadmin for my family's business, I wiped an important application server and re-installed the OS only to find out that I couldn't restore the db from a backup tape. I spent all day and night in the server room, freaking out, on long phone calls with Veritas and MS, while my uncle had to do the software's job on paper, without the benefit of the now-missing historical data.

Phone support was no help, but I eventually realized that the restore job might be crashing because, in an unrelated move, I'd also re-organized the domain's admin accounts, and the one that made the backup wasn't the same one being used to restore. (Likely a bug -- the error message was completely unrelated the new account had sufficient privileges.)

In the end, I was thankfully able to restore everything, but I looked like an idiot.

 

The service that Baylor hospital systems use to estimate cost of procedures, some storage arrays, the custom hardware that my company makes (frequently, but not in the field), a motorcycle.

Classic DEV Post from May 2

The Art of Programming

One of the most consolidated misconceptions about programming, since the early days, is the idea that such activity is purely technical, completely exact in nature, like Math and Physics. Computation is exact, but programming is not.

Dave Follett profile image
Husband, father of 👧 👧, and software developer living in the Metro Detroit area. Interested in continuous learning, tinkering, and exploring new technology. LEARN.always