DEV Community

Cover image for Errloom- Platform to Practice Debugging Real Production Outages (and It's Open Source)
Om Patel
Om Patel

Posted on

Errloom- Platform to Practice Debugging Real Production Outages (and It's Open Source)

TL;DR: Created Errloom - a browser-based playground with 15 (soon 100+) scenarios based on real outages from companies like Reddit, GitLab, and Discord. Practice debugging without breaking prod. GitHub repo here.

The Problem I Was Trying to Solve
I'm wrapping up my MS in CS, and while I can build features and write clean code, there's this whole domain of knowledge I felt I was missing: production debugging.
You know what I mean - the stuff that happens at 2 AM:

"Why is the database suddenly slow?"
"Where are these 504s coming from?"
"Did someone just leak credentials?"

Here's the thing: you can't really practice this stuff.
❌ Can't touch production as a junior dev
❌ Staging never has the same chaos
❌ Tutorials teach you how to build, not how to fix
❌ Incident reports are interesting but passive reading
So I thought: what if you could practice debugging in a safe sandbox, using actual incidents from real companies?

What I Built
Errloom is a hands-on debugging playground. Each scenario drops you into a production incident with:

Real logs (sanitized, obviously)
System metrics and dashboards
Incident timeline
The actual tools engineers use

You investigate, form hypotheses, and work toward the root cause. Then compare your approach to what actually happened.

Current Scenarios (15 total):
Database Disasters
Postgres deadlocks bringing down checkout
MySQL connection pool exhaustion
Replication lag causing stale reads

🌐 Infrastructure Chaos

CDN cache poisoning
DNS propagation gone wrong
Load balancer session affinity breaking

⚡ Application Nightmares

Memory leaks in Node.js services
Race conditions in concurrent requests
N+1 queries killing response times

🔐 Security Incidents

Accidentally committed AWS keys
Misconfigured S3 bucket permissions
Session token leakage

The Tech Stack
Since this needed to be 100% browser-based with no backend setup:
Frontend: React + TypeScript + Vite
UI: Tailwind CSS + shadcn/ui
Terminal: Custom component with syntax highlighting
State: Zustand for scenario progress
Deployment: Vercel

Each scenario is a JSON config that defines:
Initial logs
Available commands
Validation logic
Hints and explanations
Links to real postmortems

typescriptinterface Scenario {
id: string;
title: string;
company: string;
difficulty: 'novice' | 'intermediate' | 'expert';
logs: LogEntry[];
commands: Command[];
rootCause: string;
learningPoints: string[];
}

How It Works

Pick a scenario - Choose your difficulty level
Investigate - Browse logs, check metrics, run commands
Form hypothesis - What do you think went wrong?
Find root cause - Work through the incident
Learn - Compare with actual RCA

The UI intentionally feels terminal-inspired because that's where debugging happens:
bash$ grep "ERROR" app.log | tail -50
$ SELECT * FROM pg_stat_activity WHERE state = 'active';
$ curl -I https://api.example.com/health

Why Open Source?
I built 15 scenarios by reading public incident reports from companies that share their postmortems. But here's the thing:
Experienced engineers have way better material.

If you've debugged a gnarly production issue, you have knowledge that could help thousands of developers. The scenario template makes it easy to contribute:

markdown## Scenario Template

Title: A platform to Practice Debugging Real Production Outages
Difficulty: novice/intermediate/expert

Initial Symptoms

  • What users/monitoring saw first

Available Logs

  • Application logs
  • System metrics
  • Error traces

Root Cause

  • What actually happened

Learning Points

  • Key takeaways Goal: 100+ community-contributed scenarios covering every type of production failure imaginable.

What I Learned Building This:

  1. Real incidents are educational gold
    Reading through Reddit's 2023 outage or GitLab's database deletion taught me more about distributed systems than any tutorial.

  2. Debugging is pattern recognition
    After building 15 scenarios, I started seeing patterns: cache invalidation issues, connection pool exhaustion, race conditions. Real production debugging is recognizing these patterns fast.

  3. Good logs are everything
    Half the scenarios could've been solved in 5 minutes with better logging. The other half needed actual investigative work. That's the difference between debuggable systems and nightmares.

  4. UI/UX matters for learning tools
    Early versions had too much text. Current version is interactive, scannable, and progressive. People learn better when they're actively investigating vs passively reading.

Try It Out
🌐 Live: errloom.dev
⭐ GitHub: github.com/OSP06/errloom
📝 Contribute: Check out CONTRIBUTING.md
Perfect for:

Junior/mid-level engineers leveling up
Interview prep (especially SRE/DevOps)
Team training on incident response
Anyone who learns by doing

Roadmap:
100+ scenarios (need community help!)
Multi-step incidents with cascading failures
Time-pressure mode (simulate real urgency)
Team collaboration features
Integrations with real observability tools
Mobile-friendly version

Looking for Contributors
If you've got production war stories, I'd love your help:

Scenario authors - Share incidents you've debugged
Code contributors - React/TypeScript improvements
UX feedback - How can the learning experience be better?
Testers - Try scenarios and report issues

Even just sharing the project helps more devs discover it.

Final Thoughts:
Production debugging is one of those skills you can't really learn until you're thrown into it. Errloom won't replace real experience, but it might help you fail faster, learn patterns, and build confidence before the stakes are real.

And hey, if this helps even one person debug their first production incident more confidently, it's worth it.

What do you think? What production nightmares have you faced that would make good learning scenarios?

Drop a comment or check out the repo! 🚀

Top comments (0)