Have You Ever Encountered A Ghost in the Machine? 👻

#discuss #watercooler

Ever felt like you've crossed paths with "Ghosts in the Machine"? Whether it's mysterious tech phenomena or real-world ghostly encounters, we want to hear your tales. What unexplained events have you witnessed, and how do you perceive the supernatural in our tech-infused lives? Join the conversation and share your stories! 👻📡

👻🎃 Join Our Halloween Costume Contest! 🎃👻

Embrace the eerie season, don your most creative costume, and snap a pic. Share it with our community for a chance to win badges. Check out our post for more info. 😈📸🏆

Follow the DEVteam for more discussions and online camaraderie!

The DEV Team

The team behind this very platform. 😄

Top comments (5)

Ben Sinclair • Oct 29 '23

I remember having an HP Proliant server (the ones that could be configured as a tower or rack-mounted) back in the 90s that was hosting a database for something or other.

It had intermittent crashes in the database application (Lotus Domino!) and so I took it off the rack and brought it to my office work bench so I could work on it. I couldn't replicate the issue. Re-fitted it to the rack, same problems started popping up.

So I took it back out, replaced the drive with a fresh one, restored a backup and tested it. Solid. Put it back in the rack. Failed within a day.

This went on, with me changing parts and tinkering with software. Nothing in the logs. It just corrupted the database a couple of times a day.

What was different about the server room?

Well, I eventually ran an extension power cord from somewhere else directly to the machine, bypassing the power distribution block in the rack, and that fixed it. There was something spikey in the power and it was evidently just enough to cause the machine to be unstable under heavy load, such as when the database was doing its verification/replication proceedure. Replaced the power cabling and all was good.

But for a puzzling week or so it did feel like the server room was haunted.

André Buse • Oct 30 '23

It was a cold but calm monday evening and I was about to leave the datacenter when the eerie silence was broken by my pager. "Not again". We've had minor emergencies since we migrated to this datacenter, so I sighed and went back to my workplace. The monitoring dashboard informed me that almost all Kubernetes nodes were unreachable.

I tried to open Rancher to inspect the cluster, but the browser stared back at me with an empty page that just would not load. "Great. This will fun". You could hear each key echoing through the room as I typed the ssh command to directly have a look onto one of our control-plane nodes.

[Enter]
Then silence.
Then a loud beeping-sound informing me of the failed command: ssh: connection to root@10.66.66.66 port 22: Connection timed out¹. I tried a few other nodes too, but each time the connection would just time out. And with each failed command, the beeping would get more and more aggressive.

Usually I'd make use of our remote KVM console for this, but in this new datacenter it wasn't fully working yet and I would need to use the hardware console. The hallway leading to the servers was lit by flickering overhead lights, casting long shadows of my silhouette as I walked towards the main server hall. The closer I got, the louder I could hear the humming of our servers. But it wasn't the usual monotone humming I'm used to. Something was different, more sinister about this. It was almost as if the servers were trying to communicate².

Now, I didn't give much thought about the noise. I should have, but there I was still hoping that a simple restart and maybe deleting some rogue and resource-hungry k8s workloads would fix my problems. The restart at least worked, and the command prompt stared at me with a blinking cursor. I typed furiously, trying to debug the issues plaguing the cluster. It seemed like the worker nodes were stuck in a restart loop. Just as they came online, they seemed to immediately run out of memory and crash again. And not only the worker nodes, sure enough the control node I was currently using also ran out of memory after a while. And the top command to see memory usage for processes would just segfault³. This felt way more sinister than any other issue we had recently.

After a call with the rest of my team, we decided that the cluster was way beyond recoverable. So we decided to shut all nodes down and do a cold boot. We had pretty recent internal guideline for this process, since we just migrated to this datacenter. I started by first shutting down all servers.

Silence.
Then I started the first control nodes.
Everything fine so far. The familiar humming of the servers was back.
Then we added the workers. The humming got louder. And not just a little bit: over the next seconds the fans seemed to emit an electronic scream before the servers crashed again.

Silence.
Though the silence didn't last long, as configured the servers would restart after a crash. And sure they did, repeating the same pattern of fans whirring louder and louder. It was as if their mechanical hearts started beating in rhythm with their blinking LEDs. A constant, frightening restart-scream-silence rhythm, just like a heartbeat of our datacenter. It continued well over an hour while I was seeking shelter at my workplace. Being in a conference call with my team again, the mechanical rhythm was still audible through my headphones⁴.

As the clock struck midnight, the servers screamed one last time in agony.

Silence.
Then the usual, calm and monotone humming resumed. And with that, the monitoring dashboard started to show more and more nodes in healthy green.⁵

Disclaimer: The story may or may not be slightly exaggerated for dramatic purposes. No servers were harmed in the process. All humans that were in the datacenter recovered from this experience.

[1] Yes, we really had this IP address as an internal running joke
[2] The servers were trying to communicate quite literally, attempting to download data
[3] top crashing turned out to be access.redhat.com/solutions/4343051
[4] Really crappy headphones without noise cancelling, also the "hallway" to the servers was only a few meters long
[5] The process that caused the OOM crashes was not intended for these servers. It was a data import process that ran on the first day of each month at 18:00. It was supposed to run on other servers with a lot more memory. But instead, it ended up being installed on our Kubernetes nodes. It brought them down almost instantly. And our systematic reboots were node-by-node, but by coincidence the time between bringing up nodes was equal to the time it took the nodes to crash. So after a while the nodes crashed in unison. Hence the rhythmic sound. As for why the script ended up on the k8s nodes? Well, when setting up the new datacenter we decided on a new scheme to assign IP address ranges. The k8s nodes happened to have the IPs that were previously used for the data import processes. In our later investigation it became clear that the ansible script provisioning the data import stuff used the old inventory file. The MR changing those was still marked as draft. The person running the playbook was an admin with access rights to both groups of servers, which lead to the playbook running just fine without many errors. The data import itself was configured to run if day_of_month == 1 and hours >= 18 and not_run_today. This caused the issue to magically disappear at midnight.