DEV Community

Cover image for How I Turned My Worst On-Call Nightmare Into a Browser Game
Steven Leggett
Steven Leggett

Posted on • Originally published at youbrokeprod.com

How I Turned My Worst On-Call Nightmare Into a Browser Game

It's 3:14 AM. Your phone is screaming.

You silence the PagerDuty alert, sit up in the dark, and try to remember what you're on call for. Right. The analytics platform. You open your laptop, blink at the brightness, and read the alert title: "No space left on device."

Oh no.

You SSH in. The disk is 100% full. The database has crashed because it can't write WAL files. The API is returning 500s. You have three engineers waking up in your Slack channel asking "what's wrong?" and a support queue filling with "why is everything down?"

You start digging. df -h. The /var partition is gone. du -sh /var/*. The log directory is 78GB. du -sh /var/log/*. The analytics worker has 64 gigabytes of logs. You stare at the number for a second.

git log --oneline -n 10. There it is. Eight days ago, commit 9b4d7e3:

debug: temporarily enable debug logging for analytics issue
Enter fullscreen mode Exit fullscreen mode

TODO comment in the body: "REMEMBER TO TURN THIS OFF AFTER DEBUGGING!!!"

Narrator: She did not remember.

At debug level, that analytics service was logging every single event - user ID, event type, timestamp, cache hit rates, query durations - thousands of entries per minute, 24 hours a day, for eight days straight. No log rotation. No disk monitoring. No alert at 85% capacity. Nothing to catch it before it hit 100%.

You fix it in 20 minutes. You write the postmortem at 4 AM. You add disk monitoring. You go back to sleep and you think: I have been on the other side of that TODO comment. Every engineer I know has been on the other side of that TODO comment.

And that was the moment I decided to build a game.

The Situation Room - 3D war room with live architecture map, alerts, and terminal
The Situation Room - a 3D war room with live architecture, cascading alerts, and a ticking clock


The Idea: A Flight Simulator for On-Call Engineers

Here's the thing about learning incident response: you learn it by breaking things in production, on a clock, while people are angry. There is no training simulator. You get one real scenario at a time, with real consequences, and you either figure it out or you don't.

Flight simulators exist so pilots can experience a landing gear failure, a stall at 30,000 feet, a hydraulics problem - without dying. Musicians practice scales before performing. Athletes train before competing. Surgeons use simulation before cutting into a real person.

But SREs? You get dropped into a live incident and told to figure it out. Maybe your company has a "Wheel of Misfortune" session occasionally, but realistically you just wait for the pager to go off and hope the scenario isn't too bad.

I wanted to build something different. A place where you could practice the muscle memory of incident response without the 3 AM wake-up call. Where you could learn what a connection pool exhaustion actually looks like in the logs before you encounter it at 2 AM with a VP watching over your shoulder.

The tagline came quickly: Wordle meets Hack The Box for DevOps.

  • Wordle because it needed to be quick, daily, shareable, and satisfying
  • Hack The Box because it needed to be genuinely technical and respect your intelligence
  • For DevOps because SREs, platform engineers, and backend developers are the audience

I wanted the game to feel like you were actually sitting at a terminal in a crisis. Not a quiz. Not a multiple choice test. A real investigation.


Building the Thing

Picking the Stack

I wanted to go from zero to playable as fast as possible, and I wanted the infrastructure costs to scale from "$0 at launch" to "reasonable at 10,000 users." That meant serverless-first.

Next.js 16 was a given. App Router, React Server Components, edge runtime, first-class Vercel support. I've used it for years and I can move fast in it.

Turso for the database. This was the interesting choice. Turso is SQLite at the edge - it gives you sub-10ms reads globally, scales to zero when nobody is playing, and the free tier covers a lot of users. I was already using Drizzle ORM and had a schema ready. The alternative was Supabase's Postgres, but for a read-heavy app (leaderboards, profile lookups) Turso made more sense and the economics were better at launch.

Supabase Auth for authentication. Free up to 50,000 monthly active users. GitHub OAuth and Google OAuth both built in. It handles email verification, password resets, and OAuth flows out of the box. When you're building solo, you don't want to think about auth.

Tailwind v4 for styling. The v4 release with its CSS-first configuration cleaned up a lot of the config overhead I used to fight with.

Zustand for client state. The game state during an active incident - your command history, your elapsed time, your discovered clues, your score - lives in a Zustand store. Simple, no boilerplate.

Anime.js v4 for animations. This was the fun one.

The Game Engine

Clicking a service node reveals alerts and investigation options
Clicking a service node in the architecture map reveals alerts and investigation options

The core of the whole thing is IncidentEngine.ts - a TypeScript class that runs each scenario.

Every incident is defined as a TypeScript object: it has metadata (title, difficulty, time limit), a scenario description, an environment with simulated logs and metrics, a list of commands the player can run, diagnosis options, and a solution with hints.

The engine does a few things:

YouBrokeProd situtation room 3d experience for SREs and devops

  1. Manages the game timer and emits events at warning thresholds (30% time remaining, 10% time remaining, time expired)
  2. Processes commands by pattern-matching against the incident's defined commands, then checking shared easter egg commands
  3. Tracks which clues have been discovered as the player runs investigative commands
  4. Handles diagnosis submission - correct gets you to the "fixing" phase, incorrect costs you points
  5. Validates fix commands with flexible matching so you don't have to type a 200-character shell command perfectly
  6. Calculates a final score based on time taken vs par time, whether you diagnosed correctly on the first try, and how many commands you used vs the optimal path

The command system was one of the more interesting design challenges. Real terminal commands have arguments, flags, and variations. I couldn't enumerate every possible way someone might run df -h vs df -H vs df --human-readable. The solution was pattern matching with some tolerance - commands match if the input starts with the pattern, or if it matches a regex, or if it contains enough of the key terms.

YouBrokeProd 3d simulation for SRE Payment orchestrator with detailed service logs

The clue discovery system is what makes the game feel like an investigation rather than a quiz. Running df -h reveals the clue "var-full." Running du -sh /var/log/* reveals "analytics-logs." Each clue unlocked drives you deeper into the diagnosis. The engine emits a clue_discovered event that triggers a satisfying UI animation.

Here's the simplified structure of what defines each scenario:

export const diskFullIncident: IncidentDefinition = {
  id: "disk-full-001",
  title: "No Space Left on Device",
  difficulty: "beginner",
  parTimeSeconds: 240,

  scenario: {
    service: "datavault-platform",
    context: "It's 3 AM and everything is broken...",
    symptoms: [
      "Database refusing writes with 'No space left on device'",
      "Application crashes when trying to log errors",
    ],
  },

  environment: {
    logs: [ /* realistic log entries */ ],
    metrics: [ /* time-series metrics data */ ],
    commands: [
      {
        pattern: "df -h",
        isOnPath: true,
        revealsClue: "var-full",
        output: `Filesystem      Size  Used Avail Use%
/dev/sda2       100G   97G     0  100% /var
🚨 CRITICAL: /var is 100% full!`,
      },
      // ...
    ],
  },

  solution: {
    rootCause: "Debug logging left enabled in production for 8 days...",
    diagnosticPath: ["df -h", "du -sh /var/log/*", "..."],
    fixCommand: "systemctl stop analytics-worker && sed -i ...",
    hints: [ /* progressive hints */ ],
  },

  diagnosisOptions: [
    { id: "debug-logs", text: "Debug logging enabled without log rotation", isCorrect: true },
    { id: "db-corruption", text: "Database corruption from hardware failure", isCorrect: false },
    // ...
  ],
};
Enter fullscreen mode Exit fullscreen mode

The real disk-full incident file is about 1,000 lines of TypeScript, with 30+ commands, easter eggs, realistic log entries, and even a git log that reveals the exact commit where someone enabled debug logging "temporarily."

The Animation Layer

Anime.js v4 brought a significant API change - scoped animations tied to a React ref. That means animations clean themselves up automatically when components unmount. No leaked intervals, no DOM elements getting animated after the component is gone.

I built a useAnime hook around the v4 createScope API:

export function useAnime(
  setupFn: () => void | (() => void),
  deps: React.DependencyList = []
): UseAnimeReturn {
  const root = useRef<HTMLDivElement>(null);

  useEffect(() => {
    if (!root.current) return;
    const scope = createScope({ root: root.current });
    scope.add(setupFn);
    return () => { scope.revert(); };
  }, deps);

  return { root };
}
Enter fullscreen mode Exit fullscreen mode

And a library of named animation presets for the common patterns:

  • timerPulse - pulses the countdown timer every second, shifts from green to yellow to red
  • screenShake - translates the body element when something goes wrong
  • glitch - skews and flickers elements for error states (three looping keyframes)
  • scoreIncrement - spring physics pop when points are awarded
  • staggerIn - staggered list entry animations for leaderboard entries
  • pulseAlert - repeating box-shadow pulse for critical alerts

The most satisfying one to build was glitch:

glitch: (element: string | Element) => {
  const tl = createTimeline({ loop: 3 });
  tl.add(element, {
    skewX: [0, 20, -20, 0],
    duration: 100,
    ease: "steps(1)",
  })
  .add(element, {
    x: [0, -5, 5, -3, 3, 0],
    duration: 150,
    ease: "steps(1)",
  }, 0)
  .add(element, {
    opacity: [1, 0, 1, 0, 1],
    duration: 200,
    ease: "steps(1)",
  }, 0);
  return tl;
},
Enter fullscreen mode Exit fullscreen mode

Three animations running in parallel at offset 0: skew, horizontal jitter, opacity flicker. The steps(1) easing makes each frame snap rather than interpolate, which is what gives it that CRT-malfunction feel. This plays when you run a command that makes things worse, or when you misdiagnose the incident.

The Easter Egg Problem

One of the best decisions I made was leaning into easter egg commands hard. Every scenario has 10-15 of them baked in.

blame sarah in the disk-full scenario gives you a mock blame analysis with mitigating factors ("she left a clear TODO comment," "it was a Sunday evening"). stackoverflow returns a fake search result where the top answer is "rm -rf /var/log/*" with 1.2K upvotes and a comment chain of people saying it made the incident worse. sudo reboot gets blocked with an explanation of why rebooting won't free disk space.

The slack command shows a mock Slack channel with "#incidents: 89 unread" and a thread where the culprit realizes what happened, someone asks if they should reboot the server, and everyone replies "NO!" in chorus.

These were genuinely fun to write. And they do something useful beyond being funny - they teach you what not to do. Every "wrong" move has an explanation attached.

There's also a history command that shows the command history of someone panicking at 3 AM:

847  why is everything broken
848  df -h
849  oh no
850  OH NO
851  du -sh /var/*
852  SIXTY FOUR GIGABYTES OF LOGS?!
853  who enabled debug logging
Enter fullscreen mode Exit fullscreen mode

That landed well in playtesting.

The Database Design

Eleven tables in Turso, all defined with Drizzle ORM. The key ones:

  • users - XP, rank, streak, last activity
  • incidents - scenario definitions (or at least their metadata; the actual scenario data lives in the TypeScript files for now)
  • attempts - every game played, with time, score, commands used, investigation path as JSON
  • badges and user_badges - achievement system
  • daily_challenges - which incident is today's challenge, plus leaderboard data

The scoring formula rewards speed (time vs par time gets you up to 2x), accuracy (first-try correct diagnosis is +50%), and efficiency (fewer commands than the optimal path is +25%). Streak multipliers will stack on top of that.

Leaderboard queries are simple SQL against Turso. For v1, polling via TanStack Query every 30 seconds is fast enough. Redis sorted sets are the plan when leaderboard queries need to get faster.


What I Learned Building It

The balance between educational and fun is harder than it sounds.

Every scenario has to be realistic enough that a working engineer recognizes the pattern from their own experience. But it also has to be solvable in a few minutes by someone who hasn't been doing this for ten years. Beginner scenarios need to feel approachable. Advanced ones need to feel like a genuine challenge.

I spent more time on scenario design than on any other part of the project. The disk-full scenario alone went through several iterations before it felt right. The key insight was: start with real postmortems. The disk-full debug-logging story is a thing that actually happens. The DB connection pool exhaustion that has a leak in the error handling path - real. The K8s crash loop from a process trying to bind to port 80 as a non-root user - real. The DNS failure that's half because of propagation and half because of TTL misconfiguration - real.

Scenarios based on real patterns teach real skills. Made-up scenarios feel like trivia.

Scoped animations in React are worth the setup cost.

Anime.js v4's scoped animation system was unfamiliar at first. The v3 API was simpler. But the v4 approach where you attach animations to a DOM ref and call scope.revert() on cleanup eliminated an entire class of bugs I was hitting - animations targeting elements that no longer existed in the DOM, intervals that kept firing after a game ended. The useAnime hook is 20 lines of code and everything downstream just works.

TypeScript for game data is underrated.

Defining incidents as typed TypeScript objects gives you autocomplete on every field, type errors if you forget a required property, and no JSON parsing issues. The alternative would be a database of scenario definitions or a YAML/JSON format. Those have their place, but for this project where the scenario logic is complex and needs to be right, having the TypeScript compiler check your work is worth it.

Write the easter eggs.

I almost cut them because they felt like scope creep. I'm glad I didn't. The game without easter eggs is a quiz. With them, it's something you want to share. The chatgpt easter egg - where ChatGPT confidently gives you unhelpful advice and then asks if maybe you're using MongoDB - has gotten more reactions in playtesting than any feature I actually planned.

SQLite at the edge is genuinely good now.

Turso was a gamble as the database choice. SQLite has a reputation for being a toy. But Turso's implementation is production-ready. The Drizzle integration is clean. The free tier is generous. For a read-heavy app with global users, edge replication means reads that actually are fast regardless of where you are. I haven't had a single database-related issue since launch.


The Situation Room

Full war room with node selected, live metrics, and terminal
The full war room mid-incident - architecture map, live metrics, alert feed, and terminal all running simultaneously

After launching the terminal version and getting 106 signups on day one, I realized something was missing. The terminal captures the commands, but not the cognitive load. Real incidents feel like ten things are wrong at once and you have to decide what to look at first.

So I built the Situation Room - a 3D interactive war room using React Three Fiber. Live architecture visualization with traffic flows and service dependencies rendered in real time. Cascading alerts. Streaming logs. A countdown clock as revenue drops by the second. Countdown music that kicks in at 45 seconds and genuinely stresses me out even though I wrote the thing.

The first Situation Room scenario is a Cyber Monday DNS failure. Your payment provider becomes unreachable at peak traffic. You have the full service map in front of you - every node clickable, every service inspectable. Four difficulty modes from guided (5 services, extra hints) to Hardcore with permadeath and a 30-service enterprise architecture full of red herrings.

Nobody has earned the "Nerves of Steel" badge yet. That's the one for beating the Situation Room on Hardcore.

The Result

10 scenarios are live at youbrokeprod.com, free to play, no credit card required.

Beginner:

  • Disk full from debug logs left on in production (the 3 AM story above)
  • Expired SSL certificate (everyone's had this one)
  • DB connection pool exhaustion from a connection leak in the error path

Intermediate:

  • K8s CrashLoopBackOff from a port binding permission error
  • Mobile API breaking change that someone forgot to version
  • DNS failure in the Situation Room - the new 3D war room mode

Advanced:

  • Memory leak that only reproduces under production traffic patterns
  • Redis thundering herd when all connections in the pool expire simultaneously
  • Lambda cold start cascade turning a routine deploy into a 10-minute outage
  • The terraform destroy scenario based on the real DataTalksClub incident (685K+ views, 9% win rate)

Each scenario has a scoring system (time, accuracy, efficiency), progressive hints, easter eggs, and a result card you can share. The Situation Room adds live architecture visualization, post-game postmortems with real-world incidents (Dyn 2016, Fastly 2021, Cloudflare 2020), and difficulty modes that scale the architecture complexity.

SRE post mortems

The game is free. Pro unlocks harder difficulty modes and deep-dive postmortems with actual monitoring configs and prevention playbooks you can take back to work. For now I just want engineers to play it, get better at incident response, and maybe think about setting up disk space monitoring before they need it at 3 AM.


If you've ever sat at a terminal at 2 AM and typed "why is everything broken" into a command prompt as if the computer would answer you directly - this game is for you.

The intro animation with traffic flowing through the architecture
Traffic flowing through the architecture before the incident hits

Play it at youbrokeprod.com. Try the Situation Room if you want the full 3D war room experience, or start with the disk-full scenario if you want to ease in. Either way, the countdown is ticking.

And please, for the love of everything - set up log rotation.


Built with Next.js 16, React Three Fiber, Tailwind v4, Anime.js v4, Zustand, TanStack Query, Turso, Supabase Auth, and Drizzle ORM. Deployed on Vercel.

Top comments (0)