<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: David Cruz</title>
    <description>The latest articles on DEV Community by David Cruz (@josuecross).</description>
    <link>https://dev.to/josuecross</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3968913%2F52538ae3-94b8-4eb4-b4a1-4d16a4ba4d06.png</url>
      <title>DEV Community: David Cruz</title>
      <link>https://dev.to/josuecross</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/josuecross"/>
    <language>en</language>
    <item>
      <title>A Clean-Room Kubernetes CrashLoopBackOff Incident Exercise for SRE/DevOps Learners</title>
      <dc:creator>David Cruz</dc:creator>
      <pubDate>Fri, 05 Jun 2026 00:00:33 +0000</pubDate>
      <link>https://dev.to/josuecross/a-clean-room-kubernetes-crashloopbackoff-incident-exercise-for-sredevops-learners-25j8</link>
      <guid>https://dev.to/josuecross/a-clean-room-kubernetes-crashloopbackoff-incident-exercise-for-sredevops-learners-25j8</guid>
      <description>&lt;p&gt;&lt;code&gt;CrashLoopBackOff&lt;/code&gt; is one of those Kubernetes states that many learners recognize, but fewer people practice investigating in a structured incident-response flow.&lt;/p&gt;

&lt;p&gt;It is tempting to treat &lt;code&gt;CrashLoopBackOff&lt;/code&gt; as the root cause.&lt;/p&gt;

&lt;p&gt;It usually is not.&lt;/p&gt;

&lt;p&gt;It is a symptom: Kubernetes is repeatedly trying to restart a container that exits. The actual cause might be missing configuration, a bad command, dependency assumptions, bad startup logic, a failed migration, a missing secret, or something else.&lt;/p&gt;

&lt;p&gt;I wanted a small clean-room exercise for practicing the reasoning flow around this kind of incident without using any real company system, private incident, internal runbook, or proprietary material.&lt;/p&gt;

&lt;p&gt;So I built a fictional Kubernetes incident exercise around a fake SaaS app called &lt;strong&gt;TaskFlow Demo&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The scenario
&lt;/h2&gt;

&lt;p&gt;The fictional setup is intentionally small:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;app: &lt;code&gt;TaskFlow Demo&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;namespace: &lt;code&gt;taskflow-demo&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;affected component: &lt;code&gt;api-service&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;symptom: &lt;code&gt;CrashLoopBackOff&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;context: a new version was deployed shortly before the failure&lt;/li&gt;
&lt;li&gt;learner role: on-call responder&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The free sample does not reveal the full answer key. It gives enough context to practice the first investigation pass.&lt;/p&gt;

&lt;p&gt;Example synthetic pod status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; taskflow-demo
&lt;span class="go"&gt;NAME                           READY   STATUS             RESTARTS   AGE
api-service-6f7d8c9b7c-px42q   0/1     CrashLoopBackOff   5          9m
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example synthetic event excerpt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  9m                   default-scheduler  Successfully assigned taskflow-demo/api-service-6f7d8c9b7c-px42q
  Normal   Pulled     8m                   kubelet            Container image already present on machine
  Normal   Started    8m                   kubelet            Started container api-service
  Warning  BackOff    2m                   kubelet            Back-off restarting failed container api-service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The point is not to memorize commands.&lt;/p&gt;

&lt;p&gt;The point is to practice moving from symptom to evidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  The investigation flow
&lt;/h2&gt;

&lt;p&gt;A useful beginner-friendly flow is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Confirm the failing component.&lt;/li&gt;
&lt;li&gt;Confirm the namespace.&lt;/li&gt;
&lt;li&gt;Check whether the restart count is increasing.&lt;/li&gt;
&lt;li&gt;Look at recent pod events.&lt;/li&gt;
&lt;li&gt;Review startup logs.&lt;/li&gt;
&lt;li&gt;Check whether a recent deployment lines up with the failure.&lt;/li&gt;
&lt;li&gt;Separate facts from assumptions.&lt;/li&gt;
&lt;li&gt;Decide whether rollback or fix-forward is safer.&lt;/li&gt;
&lt;li&gt;Verify recovery before calling the incident resolved.&lt;/li&gt;
&lt;li&gt;Write a concise postmortem.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For example, if scheduling succeeds and the container starts, but then exits, that points away from scheduling and image-pull problems and toward application startup behavior.&lt;/p&gt;

&lt;p&gt;That does not prove the root cause yet.&lt;/p&gt;

&lt;p&gt;It only narrows the search.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I made it clean-room
&lt;/h2&gt;

&lt;p&gt;I wanted the exercise to be safe to publish, discuss, and use as a learning artifact.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no real company systems&lt;/li&gt;
&lt;li&gt;no internal service names&lt;/li&gt;
&lt;li&gt;no real logs&lt;/li&gt;
&lt;li&gt;no real dashboards&lt;/li&gt;
&lt;li&gt;no private runbooks&lt;/li&gt;
&lt;li&gt;no customer data&lt;/li&gt;
&lt;li&gt;no copied incident timelines&lt;/li&gt;
&lt;li&gt;no employer-specific architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything in the scenario is fictional and synthetic.&lt;/p&gt;

&lt;p&gt;This also makes it easier for learners to talk about the exercise in a portfolio or interview without pretending it was real production experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the free sample includes
&lt;/h2&gt;

&lt;p&gt;The public GitHub repo includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a short architecture overview&lt;/li&gt;
&lt;li&gt;a synthetic incident preview&lt;/li&gt;
&lt;li&gt;a partial investigation runbook&lt;/li&gt;
&lt;li&gt;a preview postmortem template&lt;/li&gt;
&lt;li&gt;a clean-room policy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Free sample:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/josuecross/sre-crashloopbackoff-incident-kit" rel="noopener noreferrer"&gt;https://github.com/josuecross/sre-crashloopbackoff-incident-kit&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I added in the paid v0.2 kit
&lt;/h2&gt;

&lt;p&gt;I also made a small paid version for people who want the full exercise.&lt;/p&gt;

&lt;p&gt;It includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;full incident brief&lt;/li&gt;
&lt;li&gt;incident commander checklist&lt;/li&gt;
&lt;li&gt;severity matrix&lt;/li&gt;
&lt;li&gt;complete CrashLoopBackOff investigation runbook&lt;/li&gt;
&lt;li&gt;troubleshooting worksheet&lt;/li&gt;
&lt;li&gt;stakeholder update examples&lt;/li&gt;
&lt;li&gt;postmortem template&lt;/li&gt;
&lt;li&gt;completed example postmortem&lt;/li&gt;
&lt;li&gt;answer key / expected investigation path&lt;/li&gt;
&lt;li&gt;portfolio guide&lt;/li&gt;
&lt;li&gt;optional local Kubernetes CrashLoopBackOff lab&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The local lab is intentionally minimal. It uses a fictional &lt;code&gt;api-service&lt;/code&gt; and two synthetic Kubernetes manifests so the learner can reproduce the failure and apply the fixed version on a disposable local Kind or Minikube cluster.&lt;/p&gt;

&lt;p&gt;It does not include a real app, real database, Dockerfile, cloud infrastructure, Grafana, Prometheus, EKS, Terraform, or production guidance.&lt;/p&gt;

&lt;p&gt;Paid v0.2 kit:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cruzer480.gumroad.com/l/sre-crashloopbackoff-kit" rel="noopener noreferrer"&gt;https://cruzer480.gumroad.com/l/sre-crashloopbackoff-kit&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I am trying to validate
&lt;/h2&gt;

&lt;p&gt;I am testing whether this format is useful for junior DevOps/SRE learners:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;written incident scenario&lt;/li&gt;
&lt;li&gt;guided investigation&lt;/li&gt;
&lt;li&gt;answer key&lt;/li&gt;
&lt;li&gt;postmortem practice&lt;/li&gt;
&lt;li&gt;optional local lab&lt;/li&gt;
&lt;li&gt;portfolio-friendly explanation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I am especially interested in whether learners would prefer:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;more written incident scenarios,&lt;/li&gt;
&lt;li&gt;more local Kubernetes labs,&lt;/li&gt;
&lt;li&gt;a guided local runner,&lt;/li&gt;
&lt;li&gt;monitoring/Grafana-style follow-up labs,&lt;/li&gt;
&lt;li&gt;or a different incident type entirely.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Question
&lt;/h2&gt;

&lt;p&gt;Would this kind of clean-room incident exercise be useful for people learning Kubernetes/SRE before they get real on-call experience?&lt;/p&gt;

&lt;p&gt;And if you were learning from it, what scenario should come next?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;bad readiness probe&lt;/li&gt;
&lt;li&gt;image pull failure&lt;/li&gt;
&lt;li&gt;failing migration&lt;/li&gt;
&lt;li&gt;OOMKilled&lt;/li&gt;
&lt;li&gt;service routing issue&lt;/li&gt;
&lt;li&gt;noisy alert / false positive&lt;/li&gt;
&lt;li&gt;deployment rollback practice&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Disclosure: I used AI assistance to draft and edit this article, and reviewed the final content for clean-room safety and accuracy.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>sre</category>
      <category>learning</category>
    </item>
  </channel>
</rss>
