David Cruz

Posted on Jun 5 • Edited on Jun 7

A Clean-Room Kubernetes CrashLoopBackOff Incident Exercise for SRE/DevOps Learners

#kubernetes #devops #sre #learning

I have been building a small learning project called SRE Incident Practice Labs.

The idea is simple:

Most junior DevOps/SRE learners can read Kubernetes commands, run tutorials, or watch incident-response videos, but they rarely get to practice the judgment part of on-call work:

What is the alert actually saying?
What is affected?
What evidence supports the current hypothesis?
What is still unknown?
Is this a full outage or a degraded workflow?
What should be communicated now?
What should wait until there is more evidence?
How do you write the postmortem afterward?

So I built two free interactive browser labs where learners can practice first-pass incident triage using real terminal commands in a clean-room training environment.

No real company systems.
No private logs.
No employer incidents.
No customer data.
No proprietary runbooks.

Everything is fictional and synthetic.

The free labs

There are currently two free interactive labs on Killercoda.

Lab 1: Kubernetes CrashLoopBackOff Triage

https://killercoda.com/josuecross/scenario/kubernetes-crashloopbackoff-triage

In this lab, you are the on-call responder for a fictional application called TaskFlow Demo.

A Kubernetes deployment left api-service in CrashLoopBackOff, and your job is to make the first triage pass.

You use real Kubernetes commands like:

kubectl get pods
kubectl describe pod
kubectl logs
kubectl get deployment
kubectl apply
kubectl rollout status

You inspect pod status, review events and logs, compare configuration expectations, apply a safe fix-forward, verify recovery, and write a short first incident update.

The goal is not to memorize one CrashLoopBackOff fix.

The goal is to practice the incident-response habit:

observe the symptom
inspect the evidence
avoid jumping too early
make a bounded hypothesis
apply a safe training fix
verify recovery
communicate clearly

Lab 2: SRE On-Call Triage: API Error Rate Alert

https://killercoda.com/josuecross/scenario/sre-api-error-rate-triage

In this lab, you investigate an API 5xx error-rate alert.

You work with a running training API and use real commands like:

curl
docker logs
grep
awk
nano
cat

The scenario focuses on an elevated error rate affecting a fictional task-create workflow.

You compare read and write paths, reproduce intermittent failures, inspect logs, estimate impact, classify severity, and draft the first stakeholder update.

This lab is less about Kubernetes and more about SRE/on-call judgment:

Is the service down or degraded?
Which workflow is affected?
What is the primary signal?
Is latency the main issue or supporting evidence?
What can be communicated now?
What should not be claimed yet?

Why I made them interactive

I originally started with written incident kits.

Those are still useful, but I realized something important:

If I say this is an incident-practice product, the free experience should feel like incident practice.

A Markdown preview is not enough.

The learner should be able to open a browser lab, run commands, inspect output, make decisions, and write notes.

That is why both core labs now have free Killercoda scenarios.

The paid material is not replacing the labs. It is the companion layer for people who want to go deeper after trying them.

What clean-room means here

Clean-room is important to me because SRE and DevOps learning material can easily become unsafe if it is based on real company work.

These labs do not use:

real incidents
real logs
real dashboards
real tickets
real customer names
private Slack messages
employer systems
private runbooks
proprietary architecture

The fictional company is called TaskFlow Demo.

The service names, alerts, logs, metrics, timelines, root causes, manifests, and postmortems are all synthetic training material.

That makes the labs safer for public learning, portfolio practice, and interview discussion.

What the paid companion pack adds

The free labs are designed for first-pass practice.

They intentionally do not reveal everything.

I also created a paid SRE Incident Practice Labs — Companion Pack for learners who want the deeper study material after attempting the labs.

Companion Pack:

https://cruzer480.gumroad.com/l/cwepcj

It includes:

full written incident briefs
synthetic evidence packs
investigation runbooks
troubleshooting worksheets
severity and SLO reasoning guides
stakeholder update examples
answer keys
completed postmortems
portfolio guides
one optional local Kubernetes lab for the CrashLoopBackOff scenario

The paid pack is meant to help you compare your reasoning, study the expected investigation path, read completed postmortem examples, and turn the exercises into portfolio-safe learning material.