<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: David Cruz</title>
    <description>The latest articles on DEV Community by David Cruz (@josuecross).</description>
    <link>https://dev.to/josuecross</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3968913%2F52538ae3-94b8-4eb4-b4a1-4d16a4ba4d06.png</url>
      <title>DEV Community: David Cruz</title>
      <link>https://dev.to/josuecross</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/josuecross"/>
    <language>en</language>
    <item>
      <title>A Clean-Room Kubernetes CrashLoopBackOff Incident Exercise for SRE/DevOps Learners</title>
      <dc:creator>David Cruz</dc:creator>
      <pubDate>Fri, 05 Jun 2026 00:00:33 +0000</pubDate>
      <link>https://dev.to/josuecross/a-clean-room-kubernetes-crashloopbackoff-incident-exercise-for-sredevops-learners-25j8</link>
      <guid>https://dev.to/josuecross/a-clean-room-kubernetes-crashloopbackoff-incident-exercise-for-sredevops-learners-25j8</guid>
      <description>&lt;p&gt;I have been building a small learning project called &lt;strong&gt;SRE Incident Practice Labs&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The idea is simple:&lt;/p&gt;

&lt;p&gt;Most junior DevOps/SRE learners can read Kubernetes commands, run tutorials, or watch incident-response videos, but they rarely get to practice the judgment part of on-call work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is the alert actually saying?&lt;/li&gt;
&lt;li&gt;What is affected?&lt;/li&gt;
&lt;li&gt;What evidence supports the current hypothesis?&lt;/li&gt;
&lt;li&gt;What is still unknown?&lt;/li&gt;
&lt;li&gt;Is this a full outage or a degraded workflow?&lt;/li&gt;
&lt;li&gt;What should be communicated now?&lt;/li&gt;
&lt;li&gt;What should wait until there is more evidence?&lt;/li&gt;
&lt;li&gt;How do you write the postmortem afterward?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I built two free interactive browser labs where learners can practice first-pass incident triage using real terminal commands in a clean-room training environment.&lt;/p&gt;

&lt;p&gt;No real company systems.&lt;br&gt;
No private logs.&lt;br&gt;
No employer incidents.&lt;br&gt;
No customer data.&lt;br&gt;
No proprietary runbooks.&lt;/p&gt;

&lt;p&gt;Everything is fictional and synthetic.&lt;/p&gt;

&lt;h2&gt;
  
  
  The free labs
&lt;/h2&gt;

&lt;p&gt;There are currently two free interactive labs on Killercoda.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lab 1: Kubernetes CrashLoopBackOff Triage
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://killercoda.com/josuecross/scenario/kubernetes-crashloopbackoff-triage" rel="noopener noreferrer"&gt;https://killercoda.com/josuecross/scenario/kubernetes-crashloopbackoff-triage&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this lab, you are the on-call responder for a fictional application called TaskFlow Demo.&lt;/p&gt;

&lt;p&gt;A Kubernetes deployment left &lt;code&gt;api-service&lt;/code&gt; in &lt;code&gt;CrashLoopBackOff&lt;/code&gt;, and your job is to make the first triage pass.&lt;/p&gt;

&lt;p&gt;You use real Kubernetes commands like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;kubectl get pods&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;kubectl describe pod&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;kubectl logs&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;kubectl get deployment&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;kubectl apply&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;kubectl rollout status&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You inspect pod status, review events and logs, compare configuration expectations, apply a safe fix-forward, verify recovery, and write a short first incident update.&lt;/p&gt;

&lt;p&gt;The goal is not to memorize one CrashLoopBackOff fix.&lt;/p&gt;

&lt;p&gt;The goal is to practice the incident-response habit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;observe the symptom&lt;/li&gt;
&lt;li&gt;inspect the evidence&lt;/li&gt;
&lt;li&gt;avoid jumping too early&lt;/li&gt;
&lt;li&gt;make a bounded hypothesis&lt;/li&gt;
&lt;li&gt;apply a safe training fix&lt;/li&gt;
&lt;li&gt;verify recovery&lt;/li&gt;
&lt;li&gt;communicate clearly&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Lab 2: SRE On-Call Triage: API Error Rate Alert
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://killercoda.com/josuecross/scenario/sre-api-error-rate-triage" rel="noopener noreferrer"&gt;https://killercoda.com/josuecross/scenario/sre-api-error-rate-triage&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this lab, you investigate an API 5xx error-rate alert.&lt;/p&gt;

&lt;p&gt;You work with a running training API and use real commands like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;curl&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docker logs&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;grep&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;awk&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;nano&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;cat&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The scenario focuses on an elevated error rate affecting a fictional &lt;code&gt;task-create&lt;/code&gt; workflow.&lt;/p&gt;

&lt;p&gt;You compare read and write paths, reproduce intermittent failures, inspect logs, estimate impact, classify severity, and draft the first stakeholder update.&lt;/p&gt;

&lt;p&gt;This lab is less about Kubernetes and more about SRE/on-call judgment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the service down or degraded?&lt;/li&gt;
&lt;li&gt;Which workflow is affected?&lt;/li&gt;
&lt;li&gt;What is the primary signal?&lt;/li&gt;
&lt;li&gt;Is latency the main issue or supporting evidence?&lt;/li&gt;
&lt;li&gt;What can be communicated now?&lt;/li&gt;
&lt;li&gt;What should not be claimed yet?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why I made them interactive
&lt;/h2&gt;

&lt;p&gt;I originally started with written incident kits.&lt;/p&gt;

&lt;p&gt;Those are still useful, but I realized something important:&lt;/p&gt;

&lt;p&gt;If I say this is an incident-practice product, the free experience should feel like incident practice.&lt;/p&gt;

&lt;p&gt;A Markdown preview is not enough.&lt;/p&gt;

&lt;p&gt;The learner should be able to open a browser lab, run commands, inspect output, make decisions, and write notes.&lt;/p&gt;

&lt;p&gt;That is why both core labs now have free Killercoda scenarios.&lt;/p&gt;

&lt;p&gt;The paid material is not replacing the labs. It is the companion layer for people who want to go deeper after trying them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What clean-room means here
&lt;/h2&gt;

&lt;p&gt;Clean-room is important to me because SRE and DevOps learning material can easily become unsafe if it is based on real company work.&lt;/p&gt;

&lt;p&gt;These labs do not use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;real incidents&lt;/li&gt;
&lt;li&gt;real logs&lt;/li&gt;
&lt;li&gt;real dashboards&lt;/li&gt;
&lt;li&gt;real tickets&lt;/li&gt;
&lt;li&gt;real customer names&lt;/li&gt;
&lt;li&gt;private Slack messages&lt;/li&gt;
&lt;li&gt;employer systems&lt;/li&gt;
&lt;li&gt;private runbooks&lt;/li&gt;
&lt;li&gt;proprietary architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fictional company is called TaskFlow Demo.&lt;/p&gt;

&lt;p&gt;The service names, alerts, logs, metrics, timelines, root causes, manifests, and postmortems are all synthetic training material.&lt;/p&gt;

&lt;p&gt;That makes the labs safer for public learning, portfolio practice, and interview discussion.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the paid companion pack adds
&lt;/h2&gt;

&lt;p&gt;The free labs are designed for first-pass practice.&lt;/p&gt;

&lt;p&gt;They intentionally do not reveal everything.&lt;/p&gt;

&lt;p&gt;I also created a paid &lt;strong&gt;SRE Incident Practice Labs — Companion Pack&lt;/strong&gt; for learners who want the deeper study material after attempting the labs.&lt;/p&gt;

&lt;p&gt;Companion Pack:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cruzer480.gumroad.com/l/cwepcj" rel="noopener noreferrer"&gt;https://cruzer480.gumroad.com/l/cwepcj&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;full written incident briefs&lt;/li&gt;
&lt;li&gt;synthetic evidence packs&lt;/li&gt;
&lt;li&gt;investigation runbooks&lt;/li&gt;
&lt;li&gt;troubleshooting worksheets&lt;/li&gt;
&lt;li&gt;severity and SLO reasoning guides&lt;/li&gt;
&lt;li&gt;stakeholder update examples&lt;/li&gt;
&lt;li&gt;answer keys&lt;/li&gt;
&lt;li&gt;completed postmortems&lt;/li&gt;
&lt;li&gt;portfolio guides&lt;/li&gt;
&lt;li&gt;one optional local Kubernetes lab for the CrashLoopBackOff scenario&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The paid pack is meant to help you compare your reasoning, study the expected investigation path, read completed postmortem examples, and turn the exercises into portfolio-safe learning material.&lt;/p&gt;

&lt;h2&gt;
  
  
  Free vs paid
&lt;/h2&gt;

&lt;p&gt;The free Killercoda labs give you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;browser-based hands-on practice&lt;/li&gt;
&lt;li&gt;real terminal commands&lt;/li&gt;
&lt;li&gt;running training systems where useful&lt;/li&gt;
&lt;li&gt;guided steps&lt;/li&gt;
&lt;li&gt;progressive hints&lt;/li&gt;
&lt;li&gt;first-pass investigation practice&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The paid companion pack gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;answer keys&lt;/li&gt;
&lt;li&gt;completed postmortems&lt;/li&gt;
&lt;li&gt;worksheets&lt;/li&gt;
&lt;li&gt;written runbooks&lt;/li&gt;
&lt;li&gt;portfolio guidance&lt;/li&gt;
&lt;li&gt;deeper incident analysis&lt;/li&gt;
&lt;li&gt;downloadable study material&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can use the free labs without buying anything.&lt;/p&gt;

&lt;p&gt;The paid companion pack is for people who want to compare their work against the deeper answer material.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who this is for
&lt;/h2&gt;

&lt;p&gt;This is mainly for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;junior DevOps/SRE learners&lt;/li&gt;
&lt;li&gt;backend developers moving toward on-call work&lt;/li&gt;
&lt;li&gt;Kubernetes learners who want incident-response practice&lt;/li&gt;
&lt;li&gt;people building portfolio projects&lt;/li&gt;
&lt;li&gt;people preparing for SRE/DevOps interviews&lt;/li&gt;
&lt;li&gt;learners who want to practice stakeholder updates and postmortems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is not a certification.&lt;/p&gt;

&lt;p&gt;It is not a job guarantee.&lt;/p&gt;

&lt;p&gt;It is not production guidance.&lt;/p&gt;

&lt;p&gt;It is a clean-room practice environment for building better incident-response judgment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I am trying to validate
&lt;/h2&gt;

&lt;p&gt;This is also a product experiment.&lt;/p&gt;

&lt;p&gt;I want to know whether learners find value in this format:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;free interactive incident labs&lt;/li&gt;
&lt;li&gt;paid companion materials&lt;/li&gt;
&lt;li&gt;clean-room scenarios&lt;/li&gt;
&lt;li&gt;answer keys and postmortems&lt;/li&gt;
&lt;li&gt;portfolio-safe explanations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If people find it useful, I want to keep adding scenarios.&lt;/p&gt;

&lt;p&gt;Possible future labs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;queue backlog / worker saturation&lt;/li&gt;
&lt;li&gt;noisy alert / false positive&lt;/li&gt;
&lt;li&gt;deployment rollback decision&lt;/li&gt;
&lt;li&gt;weak postmortem action items&lt;/li&gt;
&lt;li&gt;latency spike&lt;/li&gt;
&lt;li&gt;SLO burn alert&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;p&gt;Free Kubernetes CrashLoopBackOff lab:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://killercoda.com/josuecross/scenario/kubernetes-crashloopbackoff-triage" rel="noopener noreferrer"&gt;https://killercoda.com/josuecross/scenario/kubernetes-crashloopbackoff-triage&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Free API Error Rate lab:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://killercoda.com/josuecross/scenario/sre-api-error-rate-triage" rel="noopener noreferrer"&gt;https://killercoda.com/josuecross/scenario/sre-api-error-rate-triage&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Public scenarios repo:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/josuecross/killercoda-sre-oncall-triage" rel="noopener noreferrer"&gt;https://github.com/josuecross/killercoda-sre-oncall-triage&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Public CrashLoopBackOff sample repo:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/josuecross/sre-crashloopbackoff-incident-kit" rel="noopener noreferrer"&gt;https://github.com/josuecross/sre-crashloopbackoff-incident-kit&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Paid Companion Pack:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cruzer480.gumroad.com/l/cwepcj" rel="noopener noreferrer"&gt;https://cruzer480.gumroad.com/l/cwepcj&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>sre</category>
      <category>learning</category>
    </item>
  </channel>
</rss>
