A Clean-Room Kubernetes CrashLoopBackOff Incident Exercise for SRE/DevOps Learners

David Cruz — Fri, 05 Jun 2026 00:00:33 +0000

I have been building a small learning project called SRE Incident Practice Labs.

The idea is simple:

Most junior DevOps/SRE learners can read Kubernetes commands, run tutorials, or watch incident-response videos, but they rarely get to practice the judgment part of on-call work:

What is the alert actually saying?
What is affected?
What evidence supports the current hypothesis?
What is still unknown?
Is this a full outage or a degraded workflow?
What should be communicated now?
What should wait until there is more evidence?
How do you write the postmortem afterward?

So I built two free interactive browser labs where learners can practice first-pass incident triage using real terminal commands in a clean-room training environment.

No real company systems.
No private logs.
No employer incidents.
No customer data.
No proprietary runbooks.

Everything is fictional and synthetic.

The free labs

There are currently two free interactive labs on Killercoda.

Lab 1: Kubernetes CrashLoopBackOff Triage

https://killercoda.com/josuecross/scenario/kubernetes-crashloopbackoff-triage

In this lab, you are the on-call responder for a fictional application called TaskFlow Demo.

A Kubernetes deployment left api-service in CrashLoopBackOff, and your job is to make the first triage pass.

You use real Kubernetes commands like:

kubectl get pods
kubectl describe pod
kubectl logs
kubectl get deployment
kubectl apply
kubectl rollout status

You inspect pod status, review events and logs, compare configuration expectations, apply a safe fix-forward, verify recovery, and write a short first incident update.

The goal is not to memorize one CrashLoopBackOff fix.

The goal is to practice the incident-response habit:

observe the symptom
inspect the evidence
avoid jumping too early
make a bounded hypothesis
apply a safe training fix
verify recovery
communicate clearly

Lab 2: SRE On-Call Triage: API Error Rate Alert

https://killercoda.com/josuecross/scenario/sre-api-error-rate-triage

In this lab, you investigate an API 5xx error-rate alert.

You work with a running training API and use real commands like:

curl
docker logs
grep
awk
nano
cat

The scenario focuses on an elevated error rate affecting a fictional task-create workflow.

You compare read and write paths, reproduce intermittent failures, inspect logs, estimate impact, classify severity, and draft the first stakeholder update.

This lab is less about Kubernetes and more about SRE/on-call judgment:

Is the service down or degraded?
Which workflow is affected?
What is the primary signal?
Is latency the main issue or supporting evidence?
What can be communicated now?
What should not be claimed yet?

Why I made them interactive

I originally started with written incident kits.

Those are still useful, but I realized something important:

If I say this is an incident-practice product, the free experience should feel like incident practice.

A Markdown preview is not enough.

The learner should be able to open a browser lab, run commands, inspect output, make decisions, and write notes.

That is why both core labs now have free Killercoda scenarios.

The paid material is not replacing the labs. It is the companion layer for people who want to go deeper after trying them.

What clean-room means here

Clean-room is important to me because SRE and DevOps learning material can easily become unsafe if it is based on real company work.

These labs do not use:

real incidents
real logs
real dashboards
real tickets
real customer names
private Slack messages
employer systems
private runbooks
proprietary architecture

The fictional company is called TaskFlow Demo.

The service names, alerts, logs, metrics, timelines, root causes, manifests, and postmortems are all synthetic training material.

That makes the labs safer for public learning, portfolio practice, and interview discussion.

What the paid companion pack adds

The free labs are designed for first-pass practice.

They intentionally do not reveal everything.

I also created a paid SRE Incident Practice Labs — Companion Pack for learners who want the deeper study material after attempting the labs.

Companion Pack:

https://cruzer480.gumroad.com/l/cwepcj

It includes:

full written incident briefs
synthetic evidence packs
investigation runbooks
troubleshooting worksheets
severity and SLO reasoning guides
stakeholder update examples
answer keys
completed postmortems
portfolio guides
one optional local Kubernetes lab for the CrashLoopBackOff scenario

The paid pack is meant to help you compare your reasoning, study the expected investigation path, read completed postmortem examples, and turn the exercises into portfolio-safe learning material.

Free vs paid

The free Killercoda labs give you:

browser-based hands-on practice
real terminal commands
running training systems where useful
guided steps
progressive hints
first-pass investigation practice

The paid companion pack gives you:

answer keys
completed postmortems
worksheets
written runbooks
portfolio guidance
deeper incident analysis
downloadable study material

You can use the free labs without buying anything.

The paid companion pack is for people who want to compare their work against the deeper answer material.

Who this is for

This is mainly for:

junior DevOps/SRE learners
backend developers moving toward on-call work
Kubernetes learners who want incident-response practice
people building portfolio projects
people preparing for SRE/DevOps interviews
learners who want to practice stakeholder updates and postmortems

It is not a certification.

It is not a job guarantee.

It is not production guidance.

It is a clean-room practice environment for building better incident-response judgment.

What I am trying to validate

This is also a product experiment.

I want to know whether learners find value in this format:

free interactive incident labs
paid companion materials
clean-room scenarios
answer keys and postmortems
portfolio-safe explanations

If people find it useful, I want to keep adding scenarios.

Possible future labs:

queue backlog / worker saturation
noisy alert / false positive
deployment rollback decision
weak postmortem action items
latency spike
SLO burn alert

Links

Free Kubernetes CrashLoopBackOff lab:

https://killercoda.com/josuecross/scenario/kubernetes-crashloopbackoff-triage

Free API Error Rate lab:

https://killercoda.com/josuecross/scenario/sre-api-error-rate-triage

Public scenarios repo:

https://github.com/josuecross/killercoda-sre-oncall-triage

Public CrashLoopBackOff sample repo:

https://github.com/josuecross/sre-crashloopbackoff-incident-kit

Paid Companion Pack: