Ben Halpern

Posted on Jan 12, 2018

What was the worst bug you've ever written?

#discuss

So I just wrote my worst bug ever.

Any stories to make me feel better?

Top comments (99)

David Muckle • Jan 12 '18

Already told this story as a tweet responding to the above, but longer stories are always fun.

I set up a RPI Zero with a webcam to monitor my front door for potential visiters. I wanted to have it text me any time someone was at the door. Originally I tried to use the text-email gateway for my provider, but sending emails from the RPI just ended up looking like spam, and I didn't want to bother setting up some email relay, so I looked into using Twilio. Thankfully they have a nice little script you can use to text a given number, you just have to supply your API key and get a Twilio number to text from. The software I was using, motion, also has a bunch of nifty hooks, including one for when it detects an "event" of motion, the definition of which you can tune. Awesome! So, plug in the script to that, restart motion, and start waving my hand in front of the camera!

I then proceeded to get about 500 texts in the span of about a minute.

Eventually, after shutting everything down, I figured out what I did wrong. I had put the script into the on_motion hook, instead of the on_motion_event hook. The former hook is if you want to trigger something any time a frame of motion is detected. Wave your hand in front of that a bunch, and you're bound to trigger that hook many times. So not only did I manage to flood my phone with texts, but I also burnt through a good amount of Twilio trial credits.

Doug Stull • Jan 13 '18 • Edited

I once was testing sudoers, and happened to name the account f*ckyou@companydomain.com. Well I made a mistake in the sudoers setup and caused it to send the traditional: f*ckyou@companydomain.com is not authorized to the support email address that had about 60 ppl on it ranging from business to unix admins...

Ben Halpern • Jan 13 '18

This is pretty great

Rob Waller

@robdwaller

@ThePracticalDev @heroku That's nothing, I once wrote an email script to send 100,000 emails out. It sent one email to the first email, two to the second, three to the third, four to the fourth... I managed to turn the server off at about 1,000...

22:56 PM - 12 Jan 2018

1290

Pietro Bongiovanni • Jan 18 '18

Had a similar one. I had just rewritten a bunch of cron jobs that send time sensitive emails.

There was one cron job that was supposed to run every minute and check the status of a bunch of records that where created 20 minutes before.

When I was testing this little script I changed the time span where the record needed to be created from 1 minute to something like three days and I forgot to change it back before deploying.

When we deployed the change everything went fine because the script wasn't supposed to run till the following morning at around 9am.

Well the morning after I got to the office and everybody was going around crazy because some customers had already received ~30 mails each. I ran to my computer and started killing the container that was sending the emails before going through the source code and finding where the problem was.

I sent around 10k emails in 10 minutes.

Never have I ever wrote a cron job without a unit test after that.

Shalvah • Mar 21 '18

Wait...cron jobs have unit tests?

sehe • Jul 4 '18

Yours don't?

Jake Casto • Jan 13 '18

I was messing with a (dumb) co worker. I told him that ‘rm -rF /‘ would fix all issues in his code. As a joke I added it at to the top of the loader for our web app... I accidentally pushed it to a production server. Long story short I lost my job and had to rewrite everything.

Jason C. McDonald • Jan 13 '18

Yikes! This is exactly why we tell people not to make light of death commands. :P You make that mistake once, and never again!

Max Cerrina • Jan 13 '18

Such a benign version, but I still remember how super upset I was as a kid and someone told me alt+f4 would refresh the page or something.

𝚓𝚘𝚕𝚕𝚢 | 𝚠𝚒𝚕𝚌𝚘 • Jan 13 '18

"there's this really cool Gameboy cheat code for Tetris, if you press 'Select, Start, B, and A' all at the same time" is a really cool/cruel joke kids of the early 90s would tell each other ...

𝚓𝚘𝚕𝚕𝚢 | 𝚠𝚒𝚕𝚌𝚘 • Jan 13 '18

more related to your benign example (and even more benign): when i was first really tinkering with Linux back in like 2001/2002, i couldn't figure out how to get out of this program i was using (you'd think it would be Vi, but it was probably just a Man page) and a co-worker suggested Alt-F4-- while that did technically get me unstuck from my issue, it took me another day or so to figure out what i'd actually done

Jason C. McDonald • Jan 13 '18

Let me guess - erases scores?

𝚓𝚘𝚕𝚕𝚢 | 𝚠𝚒𝚕𝚌𝚘 • Jan 14 '18

it reboots the device!

Ben Sinclair • Apr 16 '18

On IRC it'd be "use /quit bugging me " to ignore someone.

Paula • Jan 13 '18

Damn, this is hardcore

Andy Zhao (he/him) • Jan 13 '18

Oh man... My heart really sunk for you...

Jake Casto • Jan 13 '18

It did fix all code issues though :)

Ben Halpern • Jan 13 '18

Are you being serious? That is remarkable.

Jake Casto • Jan 13 '18

I’m dead serious.

𝚓𝚘𝚕𝚕𝚢 | 𝚠𝚒𝚕𝚌𝚘 • Jan 13 '18

that's the Chicago Fire method of fixing things-- burn it all down, and start over

Pietro Bongiovanni • Jan 18 '18

I once ran rm -rf / in a virtual machine just for fun to see what it would happen.

The VM had some shared disks with my actual OS and it started deleting everything that I had on my desktop and some other important folders.

I was long due for a backup so I ended up losing a crap ton of data, but now I will never forget to check for shared disks when i'm working in a VM.

Michael Peyper • Jan 13 '18

I think you win this round or "who's made the biggest mistake"

Phil Ashby • Jan 15 '18

To your credit you identified a number of workflow issues doing that:

peer review anyone?
the code only /exists/ on production (where's the source control)?
firing you /before/ asking you to re-write seems pretty short sighted..

I hope you've got a job somewhere more thoughtful now :)

Jake Casto • Jan 18 '18

Thank you for trying to credit me, a few issues there haha. I was the primary reviewer, I had worked on everything so long that no one really bothered reviewing my code. We had two git repos, one development & one production. I pushed the commit to the production repo instead of the development one haha.

Dan Fellini • Apr 21 '18

Epic. I love it.

Jason C. McDonald • Jan 13 '18

Many years ago, I tried to use some deeply nested-for loops in ActionScript 3.0 code I wrote to find the right combination of values in an XML template to match the hash of a given file. (The goal was to determine what exactly had changed in the file.)

About 20 minutes into waiting for it to finish, I decided to figure out how long it was going to take.

I determined mathematically it would finish on June 18, 2598 at 8:42 AM.

I learned two things that day:

1) One-way hashes are called that for a reason.

2) Deeply nested for loops are the work of the Devil anyway.

Ben Halpern • Jan 13 '18

Well how's the script coming along now? Any celebrations planned for 2598?

Jason C. McDonald • Jan 13 '18

I don't have that much patience, so I just aborted it instead. :P

Bruno Louzada • Jan 15 '18

I discovered Insert with multiple rows, tried insert 250k lines each line a insert command in a remote database it will take 5 days hahaha. With one insert and 250k rows it took < 5 mins.

Dian Fay • Jan 13 '18

I stopped the line at a major North American steel producer for the better part of a day.

marcellothearcane • Nov 22 '18

But how?

Dian Fay • Nov 22 '18

It's been over ten years and I don't remember the exact details. It probably had something to do with the fact that I was very much still learning basic SQL (at a co-op/internship kind of thing) and the company didn't have any kind of process controls for pushing to production until very shortly afterward.

Anthony Orona • Feb 22 '20

Haha that reminds me of my mess-up at my internship. I was working on my first big feature and iteratively pushing code into the master branch as I learned. My code was tagged for release and deployed to production with bugs like links not working properly etc. Not that big of a deal but really embarrassing for me.

Vaidehi Joshi • Jan 13 '18

I wrote a database migration to add a column and was as happy as a clam! I deployed it to production and thought everything would be fine.

Five minutes later I realize that my migration also involved updating every single user record and the database is locked and no one can access prod and OOPS 🙈

Vaidehi Joshi • Jan 13 '18

PS. You all did a great job handling the downtime ❤️

Ben Halpern • Jan 13 '18

Shalvah • Mar 21 '18

Yasss! I did this too. 😂😂

Evan Plaice • Jan 13 '18

Network communication bug on a custom ping-pong handshake protocol caused by stale state in an array overflow.

For a little perspective. I was writing a graphical multi-touchscreen front-end for a 80's era commercial flight simulator. The sim host ran a early 80s era realtime AIX environment. Booting the damn thing took about 15 minutes and involved following a 'bootstrap loader' procedure whereby you'd have to fat-finger the boot instructions into a keypad in hexadecimal. Ethernet networking didn't exist when it was built so we had to have an ethernet card custom built. On the software end we were working with an AIX specialist from the UK who designed the protocol, implemented a bare bones ethernet driver, and provided a client/server implementation on the host end.

Our AIX guy had done this before on other similar machines but it was my first journey into networking programming. I wouldn't have physical access to the host until integration so the AIX guy gave me a mini client/server simulator app written in C that I could train my client/server implementation against.

My end was entirely written in C#, where trying to do layer-2 networking is difficult enough. Either way, I found a PCAP wrapper (ie SharpPcap)to hack together a networking protocol, translated a floating point format converter (Ie the host didn't use IEEE 754) from C, reverse engineered a raw dump of the symbol table and loaded it into a DB, etc...

Integration finally came and I was a nervous wreck. We had 7 full days of downtime to make the changes, including ripping out the old hardware (ie ray-tracing displays + 100+ buttons) and replacing it with the new (3 touch screens). Flight simulators are the backbone of an airline, pilots who can't keep up with the FAA requirements for required training are grounded until they can. Therefore, flight simulators typically operate 20 hours/day, 7 days/week. Downtime beyond our scheduled window was not an option.

Months of preparation and death-march sprinting was finally yielding results. Everything was working brilliantly to the point where we could even start identifying opportunities to make adjustments and performance improvements.

That was, until somebody noticed something strange happening on the interface. Instead of updating to show the current state of the host as expected, a small select number of labels were constantly toggling between the correct value and something else. The symptom only occurred on a few pages, and only when the pages were loaded in a specific order.

I racked my brain for hours chugging coffee and pouring over every detail of my networking code until at about 4AM. That's when I finally managed to catch the AIX specialist taking a break from his VT220 terminal. I picked his brain, going back over the the specifics of the networking protocol spec. When that failed yield results, I started picking his brain about how he implemented the protocol on the host. That's when I had a sudden 'lightbulb' moment.

It turns out that, on 80's era hardware running under real-time constraints, re-initializing the state array on his end for every update (ie roughly every couple 100ms) is a very expensive operation. To avoid the performance penalty, he initially allocated the array to a fixed maximum size, and wrote over the existing values with the new values on each update.

On my end, running C# on modern Windows hardware, array initialization is the cheap and 'safe' approach so that's exactly what I did. Since, I wasn't sending fixed-length set large enough to blank out values outside of the new set, it was possible for stale values to persist in the overflow of the array.

If the old label existed in the overflow (ie which wasn't set to update), and the same label was present in the new set of values (ie which was set to update), the UI the value would read either the old or new at random.

This would only occur in under very specific conditions. The set of current labels had to be shorter than the previous set of labels. The same label had to be present in the current set and in the overflow. The overflow would persist as long as the size of a new set of labels was shorter than the set of labels the old label was contained in.

Through sheer luck, I managed to get the code patched and working before I left for the hotel. Up until that point, I had worked mostly on the hardware side. I even planned and installed all the hardware for the update before switching back to code.

It's not like this was my first 'aha' troubleshooting moment but it was the first time I had my first full Boris-esque 'I am Invincible!' moment. TBH, I've been hooked ever since.

Max Cerrina • Jan 13 '18

Honest to god, reading that was a hell of a rush. Write more! That was amazing.

Forest Hoffman • Jan 12 '18

There were a few that occurred in development, so no production environments were affected (that I can remember). However, I did once mistype a sudo rm -rf command...

I intended to type ./ for the current directory, but failed to enter the period. So, it was after I hit enter that I realized I had run it with a / in front of everything. I had several backups, so I only lost a bit of work.

I still gave myself a minor heart attack.

Leah • Jan 13 '18

That's why you just write . so you can't accidentally delete everything 👀

Christopher McClellan • Jan 13 '18

How about “I’ve done this too.”?

Fixed a bug in our build system. (Commits direct on a branch instead of merging weren’t building.)
Our build system relied on itself to perform its own build.
Yup. Around 9000 deploys before we managed to shut it down and pin the build to an older version of itself...

View full discussion (99 comments)

Forem

What was the worst bug you've ever written?

Top comments (99)

Read next

AI Models' Reasoning Skills Don't Easily Transfer to Finance, Study Shows

New AI Method Cuts Training Data Needs for Image Segmentation by 50%

AI Models Can Now Be Combined More Effectively Using Spectral Analysis - Performance Jumps 3%

New AI Method Captures Complex Group Relationships in Networks, Boosting Accuracy by 20%

Okay