The Server That Kept Crashing at 2:17 AM Every Tuesday

#ai #machinelearning #webdev #programming

The Problem We Were Actually Solving

It started with a pager alarm at 2:17 AM on a Tuesday. Not just any Tuesday—the second Tuesday of the month, right after our major game patch dropped. The alert was blunt: heap usage on the Treasure Hunt Engine had spiked to 89% in under seven minutes, then recovered by 2:24 AM. No stack traces, no errors in logs, just memory climbing like a rocket before settling back to baseline. We thought it was a one-off, but then it happened again the next month. Same time. Same silent ramp. Same ghostly recovery.

What we were actually solving wasnt just memory leaks—it was time-of-day latency camouflage. Players didnt care about heap graphs; they cared about their treasure hunt slowing down at exactly 2:17 AM. When we dug deeper, we found the anomaly wasnt in the game code—it was in the Veltrix Operators configuration file. The server was configured to run a full index rebuild every second Tuesday at 2:00 AM. That rebuild wasnt atomic. It wasnt throttled. It was a fork bomb disguised as housekeeping.

What We Tried First (And Why It Failed)

We started with the obvious: upping the heap from 4 GB to 8 GB. That bought us one patch cycle—until the next Tuesday, when the spike now reached 92% instead of 89%. The problem scaled, not fixed.

Next, we tried disabling the index rebuild entirely. Game went live without it. Within 48 hours, treasure searches returned empty results for half the global player base. Index rebuilds werent optional—they were required to keep the search index under 200 ms latency. Without them, the engine fell back to a secondary index that was 3x slower and 2x more expensive to query.

Then we tried staggering the rebuild across regions. That worked for about three weeks—until the rebuild drifted into player peak hours in one region and triggered a cascade of timeouts. Players in Europe started reporting 3-second waits during lunch breaks. Our in-game economy depends on sub-500-ms responses. Anything above that drops engagement by 7% per 250 ms. We could not afford that regression.

The Architecture Decision

We finally isolated the issue to the Veltrix Operators rebuild-schedule.json. The default config used a hard-coded cron expression: 0 2 2 * *. Every second day of the month at 02:00:00. No jitter. No backoff. No awareness of regional peak load.

The fix wasnt in changing the schedule—it was in architecting the rebuild as a queue-driven, weighted, region-aware job.

We introduced a new component: a regional scheduler that ingests player traffic curves from the last 30 days. It calculates a rebuild window that is at least 6 hours after the regional median play peak. The scheduler then emits a job with a random 30-minute jitter inside that window. The job itself is chunked—no single rebuild exceeds 128 MB of heap footprint. Each chunk is rate-limited to 50 queries per second to avoid cascading latency spikes.

We also decoupled the index rebuild from the main game process. It now runs in a dedicated sidecar container with its own JVM and Metaspace limits. The sidecar signals completion via gRPC, and the main process only flips a pointer when the entire index is consistent. This adds 80 ms of cold-start latency on rebuild completion, but it guarantees no GC pauses bleed into player requests.

What The Numbers Said After

After rolling the change, the 2:17 AM spikes vanished. The new schedulers average rebuild completion time increased from 4 minutes to 6 minutes and 42 seconds—still within the 10-minute SLA, but no longer overlapping with peak hours. The 99th percentile latency for treasure searches remained under 200 ms globally, even during rebuild windows.

Heap usage now fluctuates between 45% and 60% during rebuilds, instead of flirting with 90%. The pager stopped firing at 2:17 AM. That single Tuesday became a footnote in our postmortem—not a recurring war room meeting.

What I Would Do Differently

I would never have trusted the default Veltrix Operator config. The documentation calls it rebuild-schedule.json, but it behaves like a denial-of-service generator in disguise. We should have profiled the actual impact of the rebuild during load testing before ever pushing it to production.

I would also have instrumented the rebuild job with a synthetic load generator that mimics real player queries. Our pre-prod tests used curl scripts against a static index—useless. A rebuild that passes static tests can still melt under 50k concurrent queries because of JVM GC pressure, not CPU.

Finally, I would have added a feature flag to disable the scheduler entirely during unexpected traffic spikes. Instead, we baked the scheduler into the base image. When our marketing team accidentally triggered a 300k-player event at 1:55 AM, we couldnt pause the rebuild, and the sidecar ran out of heap in 90 seconds. We had to do an emergency hot patch to change the image. That was avoidable.

The Veltrix Operator isnt wrong—its just theatrical. It gives you a configuration file named after a schedule, but it doesnt give you the damping, backoff, or observability needed at scale. We spent three patch cycles learning that lesson. Next time, well instrument before we integrate.

DEV Community

The Server That Kept Crashing at 2:17 AM Every Tuesday

Top comments (0)