I have been building a small learning project called SRE Incident Practice Labs.
The idea is simple:
Most junior DevOps/SRE learners can read Kubernetes commands, run tutorials, or watch incident-response videos, but they rarely get to practice the judgment part of on-call work:
- What is the alert actually saying?
- What is affected?
- What evidence supports the current hypothesis?
- What is still unknown?
- Is this a full outage or a degraded workflow?
- What should be communicated now?
- What should wait until there is more evidence?
- How do you write the postmortem afterward?
So I built two free interactive browser labs where learners can practice first-pass incident triage using real terminal commands in a clean-room training environment.
No real company systems.
No private logs.
No employer incidents.
No customer data.
No proprietary runbooks.
Everything is fictional and synthetic.
The free labs
There are currently two free interactive labs on Killercoda.
Lab 1: Kubernetes CrashLoopBackOff Triage
https://killercoda.com/josuecross/scenario/kubernetes-crashloopbackoff-triage
In this lab, you are the on-call responder for a fictional application called TaskFlow Demo.
A Kubernetes deployment left api-service in CrashLoopBackOff, and your job is to make the first triage pass.
You use real Kubernetes commands like:
kubectl get podskubectl describe podkubectl logskubectl get deploymentkubectl applykubectl rollout status
You inspect pod status, review events and logs, compare configuration expectations, apply a safe fix-forward, verify recovery, and write a short first incident update.
The goal is not to memorize one CrashLoopBackOff fix.
The goal is to practice the incident-response habit:
- observe the symptom
- inspect the evidence
- avoid jumping too early
- make a bounded hypothesis
- apply a safe training fix
- verify recovery
- communicate clearly
Lab 2: SRE On-Call Triage: API Error Rate Alert
https://killercoda.com/josuecross/scenario/sre-api-error-rate-triage
In this lab, you investigate an API 5xx error-rate alert.
You work with a running training API and use real commands like:
curldocker logsgrepawknanocat
The scenario focuses on an elevated error rate affecting a fictional task-create workflow.
You compare read and write paths, reproduce intermittent failures, inspect logs, estimate impact, classify severity, and draft the first stakeholder update.
This lab is less about Kubernetes and more about SRE/on-call judgment:
- Is the service down or degraded?
- Which workflow is affected?
- What is the primary signal?
- Is latency the main issue or supporting evidence?
- What can be communicated now?
- What should not be claimed yet?
Why I made them interactive
I originally started with written incident kits.
Those are still useful, but I realized something important:
If I say this is an incident-practice product, the free experience should feel like incident practice.
A Markdown preview is not enough.
The learner should be able to open a browser lab, run commands, inspect output, make decisions, and write notes.
That is why both core labs now have free Killercoda scenarios.
The paid material is not replacing the labs. It is the companion layer for people who want to go deeper after trying them.
What clean-room means here
Clean-room is important to me because SRE and DevOps learning material can easily become unsafe if it is based on real company work.
These labs do not use:
- real incidents
- real logs
- real dashboards
- real tickets
- real customer names
- private Slack messages
- employer systems
- private runbooks
- proprietary architecture
The fictional company is called TaskFlow Demo.
The service names, alerts, logs, metrics, timelines, root causes, manifests, and postmortems are all synthetic training material.
That makes the labs safer for public learning, portfolio practice, and interview discussion.
What the paid companion pack adds
The free labs are designed for first-pass practice.
They intentionally do not reveal everything.
I also created a paid SRE Incident Practice Labs — Companion Pack for learners who want the deeper study material after attempting the labs.
Companion Pack:
https://cruzer480.gumroad.com/l/cwepcj
It includes:
- full written incident briefs
- synthetic evidence packs
- investigation runbooks
- troubleshooting worksheets
- severity and SLO reasoning guides
- stakeholder update examples
- answer keys
- completed postmortems
- portfolio guides
- one optional local Kubernetes lab for the CrashLoopBackOff scenario
The paid pack is meant to help you compare your reasoning, study the expected investigation path, read completed postmortem examples, and turn the exercises into portfolio-safe learning material.
Free vs paid
The free Killercoda labs give you:
- browser-based hands-on practice
- real terminal commands
- running training systems where useful
- guided steps
- progressive hints
- first-pass investigation practice
The paid companion pack gives you:
- answer keys
- completed postmortems
- worksheets
- written runbooks
- portfolio guidance
- deeper incident analysis
- downloadable study material
You can use the free labs without buying anything.
The paid companion pack is for people who want to compare their work against the deeper answer material.
Who this is for
This is mainly for:
- junior DevOps/SRE learners
- backend developers moving toward on-call work
- Kubernetes learners who want incident-response practice
- people building portfolio projects
- people preparing for SRE/DevOps interviews
- learners who want to practice stakeholder updates and postmortems
It is not a certification.
It is not a job guarantee.
It is not production guidance.
It is a clean-room practice environment for building better incident-response judgment.
What I am trying to validate
This is also a product experiment.
I want to know whether learners find value in this format:
- free interactive incident labs
- paid companion materials
- clean-room scenarios
- answer keys and postmortems
- portfolio-safe explanations
If people find it useful, I want to keep adding scenarios.
Possible future labs:
- queue backlog / worker saturation
- noisy alert / false positive
- deployment rollback decision
- weak postmortem action items
- latency spike
- SLO burn alert
Links
Free Kubernetes CrashLoopBackOff lab:
https://killercoda.com/josuecross/scenario/kubernetes-crashloopbackoff-triage
Free API Error Rate lab:
https://killercoda.com/josuecross/scenario/sre-api-error-rate-triage
Public scenarios repo:
https://github.com/josuecross/killercoda-sre-oncall-triage
Public CrashLoopBackOff sample repo:
https://github.com/josuecross/sre-crashloopbackoff-incident-kit
Paid Companion Pack:
Top comments (0)