Osomudeya Zudonu

Posted on Apr 2

6 Real Debugging Failures I Hit in My Homelab (And What They Taught Me)

#devops #docker #homelab #kubernetes

The first time a pod crashed in production, I ran kubectl logs and got nothing. It was empty, clean, and had no errors.

I didn't know the container had already restarted, nor did I know about --previous. I was staring at a blank screen while the app was down, and I had no idea why.

The command exists, but I just didn't know to reach for it when it mattered. So I kept digging in the wrong place. I will restart pods, re-run requests, but nothing changes.

Meanwhile, the actual error had already disappeared with the previous container.

That's the part no tutorial really prepares you for. Not the command, but knowing when to use it.

That only comes from things breaking in your own lab. From hitting errors, misreading them, and going back until it clicks.

The more you build, the more it breaks. The more it breaks, the more you learn, if you slow down to understand what actually happened.

This article walks through six of those failures. What they look like, how to debug them, and how to document them so the lesson sticks.

Errors You'll Hit in Your Home Lab And What to Do When You Do

Container keeps crashing, CrashLoopBackOff, OOMKilled, exit codes that tell you exactly what happened
Your CI/CD pipeline reports success, but the app is down
Terraform says "no changes," but your infra is out of sync
Two services can't talk to each other, and you don't know why
A Prometheus alert fires, and you need to trace it to a root cause

Each one comes with the exact commands, what you're looking for, and a five-line incident report template, so every error becomes an experience you can talk about.

Why Building in a Home Lab Gives You Real Experience

Tutorials show you what to do when everything works. Your home lab shows you what to do when it doesn't.

When you build a real app in your lab, deploy it, wire up monitoring, and connect it to a database, things break. Containers crash, pipelines fail, services stop talking to each other, and Terraform and your actual infra fall out of sync.

Most people hit these errors, Google the fix, copy-paste it, and move on. The error is gone. Nothing was learned.

The engineers who get hired and trusted are the ones who stop when something breaks, read what the system was telling them, fix it based on what they found, and write down what happened.

That's the difference between using a home lab and learning from one.

What You Need to Set This Up

A laptop with 8GB RAM (16GB recommended) and 50GB free disk space. No cloud account. No Raspberry Pi.

Never built a DevOps lab before? Start here first, it walks you through Docker, Kubernetes, Vagrant, Terraform, and Ansible from scratch on your laptop:

→ How to Set Up a DevOps Lab on Your Laptop at Zero Cost.

How to Set Up a DevOps Lab on Your Laptop (Without Spending a Dime)

Come back when your lab is running.

Step 1: Install local Kubernetes. Pick one:

kind lightest, runs on any OS
MicroK8scloser to production behavior, best on Ubuntu
K3s good inside VMs

# kind - fastest start
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.22.0/kind-linux-amd64
chmod +x ./kind && sudo mv ./kind /usr/local/bin/kind
kind create cluster --name devops-lab

kubectl get nodes
# NAME                      STATUS   ROLES           AGE
# devops-lab-control-plane  Ready    control-plane   30s

Step 2: Deploy a real app. You need something with actual moving parts, an API, a database, and services talking to each other. That's where real errors come from. A single hello-world container won't teach you anything.

Clone this free repo and follow its setup steps to get it running in your cluster: → 🔗 DevOps Home-Lab 2026

It builds a full multi-service app, API, Postgres, and Redis from Docker Compose through to Kubernetes.

Follow the setup steps in the repo.

Don't move to Step 3 until your pods are healthy:

kubectl get pods -A
# Your app's pods should show STATUS: Running

Step 3: Wire up observability. Install Prometheus and Grafana now that your app is deployed, and there's something real to monitor.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace

# Access Grafana
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Login: admin / prom-operator

Open Grafana → "Kubernetes / Compute Resources / Pod" dashboard → confirm your app's pods are visible.

That's your baseline. When something breaks as you build, this is where the evidence shows up.

What Turns Errors Into Experience

Most people who build homelabs end up doing the same thing: installing tools, following a tutorial, and deleting the cluster, no experience gained.

The ones who come out of it with real skills do something different every time something breaks. They follow this loop. Especially the last step.

Spot it: something isn't working, let's say a pod won't start or a deploy failed, or two services can't reach each other. This is the beginning of a lesson.
Read it: before you Google anything, read what the system is telling you. Logs, events, metrics. The answer is almost always there. Your job is to learn how to see it.
Fix it: solve it from what you observed.
Document it: write the incident report immediately, while it's fresh.

The incident report template. Copy this. Use it every time something breaks.

INCIDENT: [error type] - [service name]
ROOT CAUSE: [one sentence - what actually caused it]
DETECTION: [which command or metric showed you the problem]
FIX: [exactly what you did to resolve it]
LESSON: [what you now know that you didn't before]

Every error you hit and document becomes an experience you can speak about. Not "I've read about CrashLoopBackOff." But: "I hit this at 11 pm, building my lab, here's what the logs showed, here's what fixed it."

That's what interviewers are actually asking for.

6 Errors You'll Hit as You Build, and How to Debug Each One

These are real errors from the DevOps Home-Lab 2026 repo - pulled from actual lab sessions, not invented for a tutorial. You will hit most of them.

Throughout this section, replace <your-deployment> and <pod-name> with your actual names. Run kubectl get pods -A to see what's running.

1. Database Connection Refused on Startup

Stack: Docker Compose.

What you'll learn: depends_on doesn't mean "wait until ready." It means "start after." Those are not the same thing.

How it happens:

You run docker compose up. Both containers start. But the backend throws an error immediately and dies:

Error: connect ECONNREFUSED 127.0.0.1:5432

What happened: your backend container started before Postgres finished its initialization sequence. Docker Compose started them roughly in parallel, depends_on only waits for the container to exist, not for the database inside it to be ready to accept connections.

How to read it:

# What did the backend actually see?
docker logs <your-backend-container>

# Did Postgres finish starting?
docker logs postgres

# Look for: "database system is ready to accept connections"
# Are both containers up?
docker ps

How to fix it:

# Quick fix: restart the backend after Postgres is ready
docker restart <your-backend-container>

# Permanent fix: add a healthcheck-based dependency in docker-compose.yml

depends_on:
  postgres:
    condition: service_healthy

Then add a healthcheck block to your Postgres service that runs pg_isready. Docker Compose will wait for the health check to pass before starting dependent containers.

Write your incident report:

INCIDENT: Backend container crashed on startup - ECONNREFUSED
ROOT CAUSE: Backend started before Postgres finished initializing.
  depends_on controls start order, not service readiness.
DETECTION: docker logs backend showed ECONNREFUSED to port 5432.
  docker logs postgres confirmed it hadn't finished booting yet.
FIX: Added healthcheck to postgres service. Set depends_on condition
  to service_healthy. Backend now waits until Postgres is ready.
LESSON: Always use healthchecks for stateful services.
  This same pattern applies in Kubernetes via readiness probes.

2. Pods Stuck in Pending: Nothing Is Scheduling

Stack: Kubernetes.

What you'll learn: Pending means the scheduler wants to place the pod, but can't. kubectl describe tells you exactly why.

How it happens:

You deploy your app to Kubernetes. The pods sit in Pending forever. No errors in the logs, because the container never started. This happens when your cluster doesn't have enough memory to satisfy the pod's resource requests. Running a full stack (Postgres, Redis, backend, frontend, Prometheus, Grafana) on a single-node cluster with Docker Desktop memory limited to 4GB will hit this fast.

kubectl get pods -A
# NAME                    READY   STATUS    RESTARTS
# backend-7d9f-xk4m       0/1     Pending   0

How to read it:

kubectl describe pod <pod-name> -n <namespace>
# Read the Events section at the bottom:
# "0/1 nodes are available: 1 Insufficient memory."

# What's the cluster actually using right now?
kubectl top nodes

# Do nodes exist at all?
kubectl get nodes -o wide

The Events block in kubectl describe is the most useful thing in Kubernetes for this error. It tells you exactly what the scheduler is thinking.

How to fix it:

# Option A: recreate the cluster with more capacity
k3d cluster delete devops-lab
k3d cluster create devops-lab --agents 2

# Option B: lower resource requests in your deployment YAML

resources:
  requests:
    memory: "128Mi"   # reduce from whatever you had
    cpu: "50m"

Write your incident report:

INCIDENT: Pods stuck in Pending - backend and frontend
ROOT CAUSE: Cluster had insufficient memory to schedule pods.
  Resource requests exceeded available node capacity.
DETECTION: kubectl describe pod showed "0/1 nodes available: Insufficient memory"
  in Events. kubectl top nodes confirmed node was at capacity.
FIX: Reduced memory requests in deployment YAML. Pods scheduled immediately.
LESSON: Kubernetes won't silently lower your resource requests.
  Pending forever = scheduler can't fit the pod. Describe it first.

3. ImagePullBackOff: The Cluster Can't Find Your Local Image

Stack: Kubernetes / k3d.

What you'll learn: Building an image locally and deploying it to a cluster are two separate steps. The cluster has its own image context.

How it happens:

You build your Docker image locally, write a deployment YAML that references it, and apply it to your k3d cluster. The pod fails immediately:

kubectl get pods -n <namespace>
# NAME              READY   STATUS             RESTARTS
# backend-abc123    0/1     ImagePullBackOff   0

What happened: k3d runs inside Docker but has its own separate image registry. When you randocker build -t my-backend:latest .That image lives in Docker's local daemon; k3d nodes can't see it. The cluster tries to pull from Docker Hub, fails, and gives up.

How to read it:

kubectl describe pod <pod-name> -n <namespace>
# Events will say:
# "Failed to pull image: rpc error...
#  repository does not exist or may require authentication"

# The image IS here:
docker images | grep my-backend
# But NOT here:
k3d image list devops-lab

How to fix it:

# Import your local image into the k3d cluster
k3d image import my-backend:latest -c devops-lab

# Also set imagePullPolicy: Never in your deployment YAML
# so Kubernetes doesn't attempt to pull from Docker Hub

spec:
  containers:
    - name: backend
      image: my-backend:latest
      imagePullPolicy: Never

Every time you rebuild the image, you need to re-import it.

Write your incident report:

INCIDENT: ImagePullBackOff - backend deployment
ROOT CAUSE: Local Docker image not imported into k3d cluster.
  k3d has its own image context - it can't see Docker daemon images.
DETECTION: kubectl describe pod showed "repository does not exist" in Events.
  docker images confirmed image existed locally. k3d image list confirmed
  it was absent from the cluster.
FIX: Ran k3d image import. Set imagePullPolicy: Never in deployment YAML.
LESSON: Build → import → deploy. Every rebuild needs a re-import.
  Or set up a local registry to automate this.

4. Ingress Returns 404: Service Selector Mismatch

Stack: Kubernetes / Ingress.

What you'll learn: 404 from an Ingress rarely means the Ingress is broken. It means the service behind it has no endpoints.

How it happens:

You set up an Ingress. You visit the URL. NGINX returns a clean 404. The Ingress looks fine. The service exists. But something in the chain is broken, usually a label mismatch between the service selector and the pod labels, or a wrong port number.

curl http://yourapp.local
# <html>404 Not Found</html>

How to read it:

# Check if the service actually has any endpoints
kubectl get endpoints -n <namespace>
# NAME       ENDPOINTS   AGE
# frontend   <none>      5m   ← this is the problem

# What labels are your pods actually using?
kubectl get pods -n <namespace> --show-labels

# What is the service selecting for?
kubectl describe svc <service-name> -n <namespace>
# Look at the Selector field

Zero endpoints means Kubernetes never connected the service to any pod. That's always a label mismatch or port mismatch, not an Ingress problem.

How to fix it:

The service selector must exactly match the pod labels:

# In your service:
selector:
  app: my-frontend   # this must match exactly

# In your deployment pod template:
labels:
  app: my-frontend   # must be identical - case, spelling, everything

Also verify targetPort in the service matches the containerPort in the deployment.

Write your incident report:

INCIDENT: Ingress returning 404 - frontend unreachable
ROOT CAUSE: Service selector label didn't match pod labels.
  Service had zero endpoints - never connected to any pod.
DETECTION: kubectl get endpoints showed <none> for frontend service.
  kubectl get pods --show-labels revealed the label mismatch.
FIX: Updated service selector to match actual pod labels. Endpoints
  populated immediately. Ingress routed correctly.
LESSON: Check endpoints first, not the Ingress YAML. Zero endpoints
  = label selector or port mismatch, not an Ingress problem.

5. Grafana Shows "No Data": Metrics Not Reaching Prometheus

Stack: Observability / Prometheus / Grafana

What you'll learn: "No data" in Grafana means something in the chain between your app and Grafana is broken. Walk it backwards.

How it happens:

You open Grafana. Your dashboards show "No data" on every panel. You didn't change anything recently. This happens when Prometheus isn't scraping your app, either because the app never exposed a /metrics endpoint, or the ServiceMonitor is missing or misconfigured, or the label selectors don't match what Prometheus is watching for.

How to read it:

# Step 1: Is Prometheus even trying to scrape your app?
kubectl port-forward svc/prometheus -n monitoring 9090:9090
# Open: http://localhost:9090/targets
# Check the Status column - is your app listed? Is it UP or DOWN?

# Step 2: Does your app actually expose metrics?
kubectl port-forward svc/<your-backend> -n <namespace> 3001:3001
curl http://localhost:3001/metrics
# Should return Prometheus text format. If 404 - the endpoint isn't wired up.

# Step 3: Does the ServiceMonitor exist?
kubectl get servicemonitor -n monitoring

Walk the chain: Grafana → Prometheus data source → Prometheus targets → app /metrics endpoint. The failure is always at one of those four links.

How to fix it:

// Fix A: expose a /metrics endpoint in your backend (Node.js example)
const client = require('prom-client');
client.collectDefaultMetrics();

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

# Fix B: ensure ServiceMonitor label selector matches your app
spec:
  selector:
    matchLabels:
      app: my-backend   # must match your pod labels
  namespaceSelector:
    matchNames: [your-namespace]

Write your incident report:

INCIDENT: Grafana showing "No data" - all panels blank
ROOT CAUSE: Backend had no /metrics endpoint. Prometheus had nothing to scrape.
DETECTION: Prometheus targets page showed app as absent. curl to /metrics
  returned 404 - endpoint was never implemented.
FIX: Added prom-client middleware to Express app. Exposed /metrics route.
  Prometheus began scraping within one scrape interval. Grafana populated.
LESSON: Grafana "No data" is not a Grafana problem. Walk the chain backwards.
  Prometheus targets page tells you exactly where the chain breaks.

6. ERR_TOO_MANY_REDIRECTS: When Two Systems Both Try to Enforce HTTPS

Stack: Ingress / Cloudflare / ArgoCD

What you'll learn: Redirect loops happen when two layers both try to handle HTTPS. You fix it by deciding which layer owns TLS termination and disabling it everywhere else.

How it happens:

You expose your app through Cloudflare Tunnel. It works. Then you add ArgoCD or enable ssl-redirect on your NGINX Ingress. Suddenly, your browser returns ERR_TOO_MANY_REDIRECTS on every request.

What's happening: Cloudflare Tunnel terminates HTTPS at the edge and forwards plain HTTP into your cluster. NGINX Ingress sees HTTP traffic and immediately 301-redirects to HTTPS. Cloudflare sends that HTTPS back through the tunnel, where it gets redirected again. Infinite loop.

How to read it:

# Is ssl-redirect enabled on the Ingress?
kubectl get ingress -n <namespace> -o yaml | grep ssl

# Is ArgoCD running with HTTPS enforcement?
kubectl get deployment argocd-server -n argocd -o yaml | grep insecure

# Watch the redirect chain
curl -I http://yourapp.domain.com
# You'll see 301 → 301 → 301 repeating

How to fix it:

Rule: terminate TLS at ONE layer only. If Cloudflare handles TLS at the edge, everything inside the cluster runs plain HTTP.

# Fix A: disable ssl-redirect on the Ingress
annotations:
  nginx.ingress.kubernetes.io/ssl-redirect: "false"
  nginx.ingress.kubernetes.io/force-ssl-redirect: "false"

# Fix B: run ArgoCD in insecure mode (behind Cloudflare, it's fine)
# In argocd-server deployment args:
args: ["--insecure"]

Write your incident report:

INCIDENT: ERR_TOO_MANY_REDIRECTS - app unreachable after adding Cloudflare Tunnel
ROOT CAUSE: Cloudflare terminates HTTPS at edge, forwards HTTP to cluster.
  NGINX Ingress had ssl-redirect enabled - redirected HTTP back to HTTPS.
  Cloudflare re-sent HTTPS, NGINX redirected again. Infinite loop.
DETECTION: curl -I showed 301 → 301 chain. kubectl get ingress yaml showed
  ssl-redirect: true. Cloudflare logs showed HTTP being forwarded.
FIX: Set ssl-redirect: false on Ingress. Cloudflare owns TLS. Cluster
  runs HTTP internally.
LESSON: Two systems enforcing HTTPS termination = redirect war.
  Decide one layer owns TLS. Disable it everywhere else.

Keep Going: Free Labs to Build On

The six errors above are the ones you'll hit most often when you're starting. But there's a lot more to encounter. These free resources give you more environments to build in, and more things to break, debug, and learn from.

The full structured path from foundations to cloud: 🔗 List of DevOps Projects - five phases, all free. Pick a specific problem or follow the whole thing.

Quick Reference: When Something Breaks, Start Here

Backend crashes on startup docker logs → Look for ECONNREFUSED A dependency isn't ready yet
Pod stuck in Pending kubectl describe pod → Look for Events: "Insufficient memory" or "Insufficient CPU."
ImagePullBackOff kubectl describe pod → Look for Events: "repository does not exist" image not imported
Ingress returning 404 kubectl get endpoints -n → = label selector mismatch
Grafana showing "No data." curl http://localhost:/metrics → 404 = app never exposed /metrics endpoint
ERR_TOO_MANY_REDIRECTS kubectl get ingress -o yaml | grep ssl → ssl-redirect: true + Cloudflare = redirect loop.

What You Have After All Six

Most people will read this and move on, but please, build something in your lab. Hit one of these errors. Come back to the right section, read what the system is telling you, fix it, and write the incident report.

Do that enough times, and the next time someone asks you to walk through a debugging incident, you won't be narrating something you read. You'll be recalling something you fixed.

That's the difference.

The six errors above are a starting point. If you want a structured system for working through failures you haven't seen before, STOP framework, intentional break scenarios, and production checklists, that's what The Kubernetes Detective is built around.

For building the full lab environment from scratch: Build Your Own DevOps Lab (V3.5)

Let's connect on LinkedIn

Every week, I share what I learned in my Newsletter. Case studies from real companies, the tactics that saved money, and the honest moments where everything broke.

Subscribe if that sounds useful.

Job hunting? Grab my Free DevOps resume template that's helped 300+ people land interviews.

Top comments (2)

Wes • Apr 3

The incident report template is a good habit to build early. Writing down detection method and root cause separately forces you to notice whether you actually investigated or just read the error message.

That said, all six of these follow the same shape: a config mismatch produces a clear error, you run describe/logs, and the answer is right there. depends_on not waiting for readiness, label selectors not matching, ImagePullBackOff on a local image. These are operational troubleshooting, not really debugging. The distinction matters because the article ends with "that's what interviewers are actually asking for." In my experience, interviewers asking about debugging want to hear about a case where the error message was misleading, or the root cause was three layers removed from the symptom. The kind of problem where you had to form a hypothesis and rule things out, not just read kubectl output.

Have any of your homelab failures hit you with something where the logs pointed you in the wrong direction entirely?

Osomudeya Zudonu • Apr 17

Hi @ticktockbent, Thanks for catching that.

The one that threw me off was in a payment microservices setup I was running locally. Grafana was empty, and the postgres-exporter kept showing “connection to database failed,” so everything pointed to Postgres. I spent time checking the DB, assuming it was down or misconfigured.

But it wasn’t Postgres.

The issue was DNS. The pod couldn’t resolve anything because a NetworkPolicy was blocking traffic to kube-dns. So the app couldn’t even find the database, but the error made it look like the database itself was the problem.

I will write a different article on this.