DEV Community

David Cruz
David Cruz

Posted on • Edited on

A Clean-Room Kubernetes CrashLoopBackOff Incident Exercise for SRE/DevOps Learners

I have been building a small learning project called SRE Incident Practice Labs.

The idea is simple:

Most junior DevOps/SRE learners can read Kubernetes commands, run tutorials, or watch incident-response videos, but they rarely get to practice the judgment part of on-call work:

  • What is the alert actually saying?
  • What is affected?
  • What evidence supports the current hypothesis?
  • What is still unknown?
  • Is this a full outage or a degraded workflow?
  • What should be communicated now?
  • What should wait until there is more evidence?
  • How do you write the postmortem afterward?

So I built two free interactive browser labs where learners can practice first-pass incident triage using real terminal commands in a clean-room training environment.

No real company systems.
No private logs.
No employer incidents.
No customer data.
No proprietary runbooks.

Everything is fictional and synthetic.

The free labs

There are currently two free interactive labs on Killercoda.

Lab 1: Kubernetes CrashLoopBackOff Triage

https://killercoda.com/josuecross/scenario/kubernetes-crashloopbackoff-triage

In this lab, you are the on-call responder for a fictional application called TaskFlow Demo.

A Kubernetes deployment left api-service in CrashLoopBackOff, and your job is to make the first triage pass.

You use real Kubernetes commands like:

  • kubectl get pods
  • kubectl describe pod
  • kubectl logs
  • kubectl get deployment
  • kubectl apply
  • kubectl rollout status

You inspect pod status, review events and logs, compare configuration expectations, apply a safe fix-forward, verify recovery, and write a short first incident update.

The goal is not to memorize one CrashLoopBackOff fix.

The goal is to practice the incident-response habit:

  • observe the symptom
  • inspect the evidence
  • avoid jumping too early
  • make a bounded hypothesis
  • apply a safe training fix
  • verify recovery
  • communicate clearly

Lab 2: SRE On-Call Triage: API Error Rate Alert

https://killercoda.com/josuecross/scenario/sre-api-error-rate-triage

In this lab, you investigate an API 5xx error-rate alert.

You work with a running training API and use real commands like:

  • curl
  • docker logs
  • grep
  • awk
  • nano
  • cat

The scenario focuses on an elevated error rate affecting a fictional task-create workflow.

You compare read and write paths, reproduce intermittent failures, inspect logs, estimate impact, classify severity, and draft the first stakeholder update.

This lab is less about Kubernetes and more about SRE/on-call judgment:

  • Is the service down or degraded?
  • Which workflow is affected?
  • What is the primary signal?
  • Is latency the main issue or supporting evidence?
  • What can be communicated now?
  • What should not be claimed yet?

Why I made them interactive

I originally started with written incident kits.

Those are still useful, but I realized something important:

If I say this is an incident-practice product, the free experience should feel like incident practice.

A Markdown preview is not enough.

The learner should be able to open a browser lab, run commands, inspect output, make decisions, and write notes.

That is why both core labs now have free Killercoda scenarios.

The paid material is not replacing the labs. It is the companion layer for people who want to go deeper after trying them.

What clean-room means here

Clean-room is important to me because SRE and DevOps learning material can easily become unsafe if it is based on real company work.

These labs do not use:

  • real incidents
  • real logs
  • real dashboards
  • real tickets
  • real customer names
  • private Slack messages
  • employer systems
  • private runbooks
  • proprietary architecture

The fictional company is called TaskFlow Demo.

The service names, alerts, logs, metrics, timelines, root causes, manifests, and postmortems are all synthetic training material.

That makes the labs safer for public learning, portfolio practice, and interview discussion.

What the paid companion pack adds

The free labs are designed for first-pass practice.

They intentionally do not reveal everything.

I also created a paid SRE Incident Practice Labs — Companion Pack for learners who want the deeper study material after attempting the labs.

Companion Pack:

https://cruzer480.gumroad.com/l/cwepcj

It includes:

  • full written incident briefs
  • synthetic evidence packs
  • investigation runbooks
  • troubleshooting worksheets
  • severity and SLO reasoning guides
  • stakeholder update examples
  • answer keys
  • completed postmortems
  • portfolio guides
  • one optional local Kubernetes lab for the CrashLoopBackOff scenario

The paid pack is meant to help you compare your reasoning, study the expected investigation path, read completed postmortem examples, and turn the exercises into portfolio-safe learning material.

Free vs paid

The free Killercoda labs give you:

  • browser-based hands-on practice
  • real terminal commands
  • running training systems where useful
  • guided steps
  • progressive hints
  • first-pass investigation practice

The paid companion pack gives you:

  • answer keys
  • completed postmortems
  • worksheets
  • written runbooks
  • portfolio guidance
  • deeper incident analysis
  • downloadable study material

You can use the free labs without buying anything.

The paid companion pack is for people who want to compare their work against the deeper answer material.

Who this is for

This is mainly for:

  • junior DevOps/SRE learners
  • backend developers moving toward on-call work
  • Kubernetes learners who want incident-response practice
  • people building portfolio projects
  • people preparing for SRE/DevOps interviews
  • learners who want to practice stakeholder updates and postmortems

It is not a certification.

It is not a job guarantee.

It is not production guidance.

It is a clean-room practice environment for building better incident-response judgment.

What I am trying to validate

This is also a product experiment.

I want to know whether learners find value in this format:

  • free interactive incident labs
  • paid companion materials
  • clean-room scenarios
  • answer keys and postmortems
  • portfolio-safe explanations

If people find it useful, I want to keep adding scenarios.

Possible future labs:

  • queue backlog / worker saturation
  • noisy alert / false positive
  • deployment rollback decision
  • weak postmortem action items
  • latency spike
  • SLO burn alert

Links

Free Kubernetes CrashLoopBackOff lab:

https://killercoda.com/josuecross/scenario/kubernetes-crashloopbackoff-triage

Free API Error Rate lab:

https://killercoda.com/josuecross/scenario/sre-api-error-rate-triage

Public scenarios repo:

https://github.com/josuecross/killercoda-sre-oncall-triage

Public CrashLoopBackOff sample repo:

https://github.com/josuecross/sre-crashloopbackoff-incident-kit

Paid Companion Pack:

https://cruzer480.gumroad.com/l/cwepcj

Top comments (0)