DEV Community

Cover image for Debug Real Production Incidents in Your Browser Before They Happen to You
Steven Leggett
Steven Leggett

Posted on • Originally published at youbrokeprod.com

Debug Real Production Incidents in Your Browser Before They Happen to You

The Situation Room - 3D war room with live architecture map, alerts, and terminal
The Situation Room - a 3D war room with live architecture, cascading alerts, and a ticking clock


The Idea: A Flight Simulator for On-Call Engineers

Here's the thing about learning incident response: you learn it by breaking things in production, on a clock, while people are angry. There is no training simulator. You get one real scenario at a time, with real consequences, and you either figure it out or you don't.

Flight simulators exist so pilots can experience a landing gear failure, a stall at 30,000 feet, a hydraulics problem - without dying. Musicians practice scales before performing. Athletes train before competing. Surgeons use simulation before cutting into a real person.

But SREs? You get dropped into a live incident and told to figure it out. Maybe your company has a "Wheel of Misfortune" session occasionally, but realistically you just wait for the pager to go off and hope the scenario isn't too bad.

I wanted to build something different. A place where you could practice the muscle memory of incident response without the 3 AM wake-up call. Where you could learn what a connection pool exhaustion actually looks like in the logs before you encounter it at 2 AM with a VP watching over your shoulder.

The tagline came quickly: Wordle meets Hack The Box for DevOps.

  • Wordle because it needed to be quick, daily, shareable, and satisfying
  • Hack The Box because it needed to be genuinely technical and respect your intelligence
  • For DevOps because SREs, platform engineers, and backend developers are the audience

I wanted the game to feel like you were actually sitting at a terminal in a crisis. Not a quiz. Not a multiple choice test. A real investigation.


Building the Thing

Picking the Stack

I wanted to go from zero to playable as fast as possible, and I wanted the infrastructure costs to scale from "$0 at launch" to "reasonable at 10,000 users." That meant serverless-first.

Next.js 16 was a given. App Router, React Server Components, edge runtime, first-class Vercel support. I've used it for years and I can move fast in it.

Turso for the database. This was the interesting choice. Turso is SQLite at the edge - it gives you sub-10ms reads globally, scales to zero when nobody is playing, and the free tier covers a lot of users. I was already using Drizzle ORM and had a schema ready. The alternative was Supabase's Postgres, but for a read-heavy app (leaderboards, profile lookups) Turso made more sense and the economics were better at launch.

Supabase Auth for authentication. Free up to 50,000 monthly active users. GitHub OAuth and Google OAuth both built in. It handles email verification, password resets, and OAuth flows out of the box. When you're building solo, you don't want to think about auth.

Tailwind v4 for styling. The v4 release with its CSS-first configuration cleaned up a lot of the config overhead I used to fight with.

Zustand for client state. The game state during an active incident - your command history, your elapsed time, your discovered clues, your score - lives in a Zustand store. Simple, no boilerplate.

Anime.js v4 for animations. This was the fun one.

The Game Engine

Clicking a service node reveals alerts and investigation options
Clicking a service node in the architecture map reveals alerts and investigation options

The core of the whole thing is IncidentEngine.ts - a TypeScript class that runs each scenario.

Every incident is defined as a TypeScript object: it has metadata (title, difficulty, time limit), a scenario description, an environment with simulated logs and metrics, a list of commands the player can run, diagnosis options, and a solution with hints.

The engine does a few things:

YouBrokeProd situtation room 3d experience for SREs and devops

  1. Manages the game timer and emits events at warning thresholds (30% time remaining, 10% time remaining, time expired)
  2. Processes commands by pattern-matching against the incident's defined commands, then checking shared easter egg commands
  3. Tracks which clues have been discovered as the player runs investigative commands
  4. Handles diagnosis submission - correct gets you to the "fixing" phase, incorrect costs you points
  5. Validates fix commands with flexible matching so you don't have to type a 200-character shell command perfectly
  6. Calculates a final score based on time taken vs par time, whether you diagnosed correctly on the first try, and how many commands you used vs the optimal path

The command system was one of the more interesting design challenges. Real terminal commands have arguments, flags, and variations. I couldn't enumerate every possible way someone might run df -h vs df -H vs df --human-readable. The solution was pattern matching with some tolerance - commands match if the input starts with the pattern, or if it matches a regex, or if it contains enough of the key terms.

YouBrokeProd 3d simulation for SRE Payment orchestrator with detailed service logs

The clue discovery system is what makes the game feel like an investigation rather than a quiz. Running df -h reveals the clue "var-full." Running du -sh /var/log/* reveals "analytics-logs." Each clue unlocked drives you deeper into the diagnosis. The engine emits a clue_discovered event that triggers a satisfying UI animation.


What I Learned Building It

The balance between educational and fun is harder than it sounds.

Every scenario has to be realistic enough that a working engineer recognizes the pattern from their own experience. But it also has to be solvable in a few minutes by someone who hasn't been doing this for ten years. Beginner scenarios need to feel approachable. Advanced ones need to feel like a genuine challenge.

I spent more time on scenario design than on any other part of the project. The disk-full scenario alone went through several iterations before it felt right. The key insight was: start with real postmortems. The disk-full debug-logging story is a thing that actually happens. The DB connection pool exhaustion that has a leak in the error handling path - real. The K8s crash loop from a process trying to bind to port 80 as a non-root user - real. The DNS failure that's half because of propagation and half because of TTL misconfiguration - real.

Scenarios based on real patterns teach real skills. Made-up scenarios feel like trivia.

Scoped animations in React are worth the setup cost.

Anime.js v4's scoped animation system was unfamiliar at first. The v3 API was simpler. But the v4 approach where you attach animations to a DOM ref and call scope.revert() on cleanup eliminated an entire class of bugs I was hitting - animations targeting elements that no longer existed in the DOM, intervals that kept firing after a game ended. The useAnime hook is 20 lines of code and everything downstream just works.

TypeScript for game data is underrated.

Defining incidents as typed TypeScript objects gives you autocomplete on every field, type errors if you forget a required property, and no JSON parsing issues. The alternative would be a database of scenario definitions or a YAML/JSON format. Those have their place, but for this project where the scenario logic is complex and needs to be right, having the TypeScript compiler check your work is worth it.

Write the easter eggs.

I almost cut them because they felt like scope creep. I'm glad I didn't. The game without easter eggs is a quiz. With them, it's something you want to share. The chatgpt easter egg - where ChatGPT confidently gives you unhelpful advice and then asks if maybe you're using MongoDB - has gotten more reactions in playtesting than any feature I actually planned.

SQLite at the edge is genuinely good now.

Turso was a gamble as the database choice. SQLite has a reputation for being a toy. But Turso's implementation is production-ready. The Drizzle integration is clean. The free tier is generous. For a read-heavy app with global users, edge replication means reads that actually are fast regardless of where you are. I haven't had a single database-related issue since launch.


The Situation Room

Full war room with node selected, live metrics, and terminal
The full war room mid-incident - architecture map, live metrics, alert feed, and terminal all running simultaneously

The Situation Room - a 3D interactive war room. Live architecture visualization with traffic flows and service dependencies rendered in real time. Cascading alerts. Streaming logs. A countdown clock as revenue drops by the second. Countdown music that kicks in at 45 seconds and genuinely stresses me out even though I wrote the thing.

The first Situation Room scenario is a Cyber Monday DNS failure. Your payment provider becomes unreachable at peak traffic. You have the full service map in front of you - every node clickable, every service inspectable. Four difficulty modes from guided (5 services, extra hints) to Hardcore with permadeath and a 30-service enterprise architecture full of red herrings.

Nobody has earned the "Nerves of Steel" badge yet. That's the one for beating the Situation Room on Hardcore.

The Result

10+ scenarios are live at youbrokeprod.com, free to play, no credit card required.

Beginner:

  • Disk full from debug logs left on in production (the 3 AM story above)
  • Expired SSL certificate (everyone's had this one)
  • DB connection pool exhaustion from a connection leak in the error path

Intermediate:

  • K8s CrashLoopBackOff from a port binding permission error
  • Mobile API breaking change that someone forgot to version
  • DNS failure in the Situation Room - the new 3D war room mode

Advanced:

  • Memory leak that only reproduces under production traffic patterns
  • Redis thundering herd when all connections in the pool expire simultaneously
  • Lambda cold start cascade turning a routine deploy into a 10-minute outage
  • The terraform destroy scenario based on the real DataTalksClub incident (685K+ views, 9% win rate)

Each scenario has a scoring system (time, accuracy, efficiency), progressive hints, easter eggs, and a result card you can share. The Situation Room adds live architecture visualization, post-game postmortems with real-world incidents (Dyn 2016, Fastly 2021, Cloudflare 2020), and difficulty modes that scale the architecture complexity.

SRE post mortems

The game is free. Pro unlocks harder difficulty modes and deep-dive postmortems with actual monitoring configs and prevention playbooks you can take back to work. For now I just want engineers to play it, get better at incident response, and maybe think about setting up disk space monitoring before they need it at 3 AM.


If you've ever sat at a terminal at 2 AM and typed "why is everything broken" into a command prompt as if the computer would answer you directly - this game is for you.

The intro animation with traffic flowing through the architecture
Traffic flowing through the architecture before the incident hits

Play it at youbrokeprod.com. Try the Situation Room if you want the full 3D war room experience, or start with the disk-full scenario if you want to ease in. Either way, the countdown is ticking.


Top comments (0)