Seangles

Posted on Mar 16

The 30-Second Death: A Memoir

#mongodb #linux #devops #database

A true story about a database that died on a schedule, a developer who tried everything, and a CPU security feature nobody told anyone about.

My database kept dying. Every. Thirty. Seconds.

Not crashing with an error. Not leaving a note. Just... gone. Exit code 139. No explanation. No apology.

I checked the logs. MongoDB had started successfully. Recovered from the previous unclean shutdown. Accepted connections. Everything looked fine.

Then it died.

Act I: Blame the Obvious Thing

Me: Why is the DB crashing?

MongoDB: dies

Me: Okay, unclean shutdown, probably just needs to recover—

MongoDB: recovers successfully, then dies again

I did what any reasonable engineer does. I blamed the obvious thing.

"It's mongo:latest," I said confidently. "Never trust latest." I pinned the version. mongo:8.0.14-noble. Stable. Specific. Professional.

image: mongo:8.0.14-noble

MongoDB: dies at exactly 30 seconds

Act II: Disable Everything

Fine. The diagnostics collector — FTDC. It reads /proc files in the background. Kernel 6.19 is brand new. Maybe something changed.

command: mongod --setParameter diagnosticDataCollectionEnabled=false

MongoDB: dies at exactly 30 seconds

Seccomp, then. Docker's syscall filter. A classic gotcha. I ran the container with --security-opt seccomp=unconfined, essentially handing it a skeleton key to the entire kernel.

MongoDB: dies at exactly 30 seconds

The audacity.

Act III: The Nuclear Option

At this point I had one move left: mongo:7. Downgrade. Accept defeat. Move on with my life.

image: mongo:7

MongoDB 7: reads the data files MongoDB 8 wrote

MongoDB 7: these are not my files

MongoDB 7: exits with code 62

Two databases. Zero working. Outstanding.

Act IV: The Internet Knows

I went searching. I found a GitHub issue titled:

"[arm64] MongoDB 8.x crashes ~30s after startup on 6.19.0-sky1-latest.r5"

ARM64. I am on x86_64. I kept reading anyway.

Then, buried in a bug report, the answer:

"SIGSEGV exactly 30s after start on AMD Zen 5 due to hardware Shadow Stacks (user_shstk) clashing with coroutines."

The Actual Explanation (I Promise It Makes Sense)

Let me walk you through what actually happened here, because I need you to appreciate the full chain of events.

Intel invented a security feature called Control-flow Enforcement Technology (CET). The idea: add a second, hardware-enforced call stack that runs in parallel with the normal one. Every time a function returns, the CPU checks that the return address matches what the shadow stack recorded. Exploits that hijack return addresses — like ROP chains — get caught at the hardware level before they can do anything. It's genuinely clever engineering.

AMD implemented it too, under the name Shadow Stacks.

Linux kernel 6.19 decided to enable it by default for user processes.

Now. MongoDB 8.0 uses coroutines. Coroutines do context switching by manually swapping stack pointers — they save the current stack state and jump to a different one. This is a completely normal thing coroutines do.

The shadow stack looked at this manual stack swap and said:

"That return address does not match what I recorded. This is a security violation."

Then it fired a SIGSEGV. Signal 11. Exit code 139.

After exactly 30 seconds — the interval at which MongoDB's coroutine scheduler runs a particular background task.

Every time. Like clockwork.

The Fix

One line:

environment:
  GLIBC_TUNABLES: glibc.cpu.hwcaps=-SHSTK

This tells glibc to disable Shadow Stacks for the process before it starts. MongoDB's coroutines do their stack swaps unmolested. The database lives.

Status: running, ExitCode: 0, Restarts: 0

Post-Mortem


Time spent	too long
Lines of code changed	1
Red herrings investigated	`mongo:latest` tag, FTDC, Docker seccomp, data file incompatibility
Actual cause	AMD Zen 5 + Linux 6.19 + CET Shadow Stacks + MongoDB coroutines
What I should have searched first	literally anything other than what I searched

Takeaway

If you're running MongoDB 8.0 in Docker on a Linux 6.19+ kernel with a newer AMD CPU and your container dies silently every ~30 seconds with exit code 139 — this is your fix:

environment:
  GLIBC_TUNABLES: glibc.cpu.hwcaps=-SHSTK

You're welcome. I suffered so you don't have to.

Dedicated to everyone who has ever watched a container die on a 30-second interval and felt their personality change. Yes the article is written with the help of an LLM. I spent too much time on this already, it's genuinely one of the weirdest issues of the month that I faced :p

Top comments (1)

francwa • Apr 29 • Edited

❤️