DevOps Horror Stories to Slow Development and Freeze Operations

#devops #sre

Halloween is a scary time to be in abandoned buildings, cemeteries, and dark forests… and DevOps teams. Developers, operations engineers, and SREs told us some DevOps horror stories that have haunted them to this day. Light some candles, gather your courage, and read the spine-chilling tales of terrifying errors, bone-chilling data loss, and nightmarish lost weekends.

Relax, It’s Just the Complete Loss of All Data

During a routine attempt to gather information from our production MySQL DB, the script I was running did not have a relevant section commented out, as the script was a dual-purpose script: to gather information about the schemas in one section, and another section dedicated to database migration, with the engineer commenting out the section not needed at the time of operation.

I neglected to comment out the migration portion of the script during my attempt to do discovery via the script, resulting in the immediate dropping of the entire production database. We had a read-replica in another AZ in AW; but by the time I recognized the error, the drop of tables had already replicated, resulting in the complete loss of all data. Compounded on this, our CTO was out of town, meaning this had to be reported directly to our CEO, who promptly spent the next hour watching over my shoulder as I spun up a new RDS database and restored data from the most recent snapshot – approximately 30 minutes ago. I was still a newly promoted DevOps engineer, having just been moved up from the desktop support team, and made this colossal error.

I still consider it to be the most stressful/terrifying event in my DevOps career.

Petrified,

Inopportune Deployment Engineer

These Credentials Don’t Work in Swedish

I was responsible to set up the flow for a tourist company. The flow had one queue which was protected by simple username/password (long time ago, it was normal). The credentials were shared with me in an email. The issue was: the guy who shared them was Swedish, and they were in a format: användare lösenord.

I didn’t speak Swedish at that time and just added the credentials. When we launched it on prod we started losing all the messages. After investigation we found out that actual user and password where hidden in whitespaces and became visible on select… It was the way to “secure” them in an email.

And those word just mean “user” and “password” in Swedish.

Terrified,

Senior Developer

A Not-So-Hot Fix in Production

In the heat of applying a hot fix in production. I accidentally deleted all k8s deployments that were in the non-default namespace with just one command. With collaborated efforts from development, we recovered quickly. But just a simple kubectl command can wipe almost your whole cluster without any request for confirmation.

Paranoid,

Senior Site Reliability Engineer

A Weekend Ruined By Floppy Disks

A long time ago, when it was still a fairly common and feasible practice to put an entire app’s database on a few floppy disks, I made the mistake of fiddling with the .DBF files without first making a backup. Needless to say, I screwed something up and had to spend the rest of my weekend fixing the files using a C program I cobbled together to gather up all the old data into new tables.

Luckily, I had enough information from reference materials on hand to be able to figure out the file format and where all the data was on the disks (this pre-Internet times). Still wasn’t fun and my supervisor rightfully chewed me out for not taking proper precautions.

Freaked out,

Consultant

A Case of Bad Timing

I once deployed an application ahead of time and scheduled a cron to restart the webserver at 8am, but instead it was every 8 minutes. #DevOoops!

With curdled blood,

Developer

The Road to Prod Is Paved with Good Intentions

One of my developers decided to “improve” a production deployment script. He started making changes directly in the production environment, not in development, against my advice; but management didn’t seem concerned. At 5pm on the dot he left work every day. This day he left as usual. The script changes were unfinished and untested, but live in production. All production deployments failed overnight, costing the company many tens of thousands of dollars.

I came into the office in the morning, was confronted by livid Operations staff (and their manager), and quickly reverted his code. This helped convince Development management to see that code changes needed to be done in dev first. The developer was “convinced to resign from the company,” and he did. Many years later I still run into developers who want to take shortcuts to production, and I tell them this story.

Alarmed,

Architect

Wedding Day Fiasco

Years ago, one of our developers added a new feature to our web application on Friday just before our release. The application was delivered to the customer that evening and deployed overnight. On Saturday morning, application users began to report that one of the main functions of the application was not responsive – essentially preventing them from doing their job. The incident was escalated, and I was called in early Saturday afternoon to help troubleshoot the problem.

It took a couple of hours to find the problem because we didn’t really have great application telemetry at the time or a great way to debug the deployed application. There was an SQL query that was inadvertently pulling hundreds of thousands of records from the customer database, triggered by every application user. The problem wasn’t seen in development because the developers were using a tiny dataset in comparison. Patching the SQL statement with a LIMIT clause restored the application to its normal speedy self.

Oh, and by the way, when I was called in to troubleshoot the problem, I was called away from a friend’s wedding.

Shrieking,

Senior Principal Software Developer

Apologies in advance for your sleepless night tonight. But if you're like us, and can't keep your eyes away, we have some more unfortunate tales where these came from, like the “Zero Width Space”character that broke a k8s deployment, and the fact that things break all the time when you’re an SRE.