DEV Community

Shankar
Shankar

Posted on

My K3s Pi Cluster Died After a Reboot: A Troubleshooting War Story

The "Oh No" Moment

I have a Raspberry Pi homelab running k3s, all managed perfectly with FluxCD and SOPS for secrets. It was stable for weeks.

Then, I had to reboot my router.

When it came back up, my Pi was assigned a new IP address (it went from 192.168.1.9 to 192.168.1.10). Suddenly, my entire cluster was gone.

Running kubectl get nodes from my laptop gave me the dreaded:

The connection to the server 192.168.1.9:6443 was refused...

Ah, I thought. "Easy fix." I updated my ~/.kube/config to point to the new IP, 192.168.1.10.

I ran kubectl get nodes again... and got the same error.

The connection to the server 192.168.1.10:6443 was refused...

This meant the k3s service itself wasn't running on the Pi. This post is the story of the troubleshooting journey that followed, and the three major "fixes" it took to get it all back online.


Part 1: The Crash Loop (On the Pi)

I SSH'd into the Pi to see what was wrong.

ssh shankarpi@192.168.1.10
Enter fullscreen mode Exit fullscreen mode

First, I checked the service status. This is the #1 thing to do.

sudo systemctl status k3s.service
Enter fullscreen mode Exit fullscreen mode

The service was in a permanent crash loop:

● k3s.service - Lightweight Kubernetes
     Active: activating (auto-restart) (Result: exit-code) ...
Enter fullscreen mode Exit fullscreen mode

This means k3s is starting, failing, and systemd is trying to restart it over and over. Time to check the logs.

sudo journalctl -u k3s.service -f
Enter fullscreen mode Exit fullscreen mode

And there it was. The first smoking gun:

level=fatal msg="Failed to start networking: unable to initialize network policy controller: error getting node subnet: failed to find interface with specified node ip"

This was a double-whammy:

K3s was starting before the wlan0 (Wi-Fi) interface had time to connect and get its 192.168.1.10 IP. This is a classic race condition on reboot.

K3s was still configured internally to use the old IP.


Part 2: The Service File Fixes

The fix was to edit the systemd service file to (1) wait for the network and (2) force k3s to use the new IP for everything.

# On the Pi
sudo nano /etc/systemd/system/k3s.service
Enter fullscreen mode Exit fullscreen mode

I made four critical changes to the [Service] section:

Added ExecStartPre: This line forces systemd to wait until the wlan0 interface actually has the IP address 192.168.1.10 before trying to start k3s.

Added --node-ip: Tells k3s what IP to use internally.

Added --node-external-ip: Tells k3s what IP to advertise externally (this was the fix for the IP conflict).

Added --flannel-iface: Tells the flannel CNI which network interface to use.

The ExecStart block now looked like this:

[Service]
...
Restart=always
RestartSec=5s

# FIX 1: Wait for the wlan0 interface to have the correct IP
ExecStartPre=/bin/sh -c 'while ! ip addr show wlan0 | grep -q "inet 192.168.1.10"; do sleep 1; done'

# FIX 2: Hard-code the new IP and interface for k3s
ExecStart=/usr/local/bin/k3s \
    server \
    --node-ip=192.168.1.10 \
    --node-external-ip=192.168.1.10 \
    --flannel-iface=wlan0
...
Enter fullscreen mode Exit fullscreen mode

I reloaded systemd and restarted the service, full of confidence.

sudo systemctl daemon-reload
sudo systemctl restart k3s.service
Enter fullscreen mode Exit fullscreen mode

...and it still went into a crash loop.


Part 3: The "Aha!" Moment (The Corrupted Database)

I was stumped. The service file was perfect. The IP was correct. The Pi was waiting for the network. Why was it still crashing?

I watched the logs again (sudo journalctl -u k3s.service -f) and saw something I'd missed.

The service would start, run for about 15 seconds... and then crash. In that 15-second window, I saw this:

I1030 20:37:56 ... "Successfully retrieved node IP(s)" IPs=["192.168.1.9"]

It was still finding the old IP!

This was the "Aha!" moment. The k3s.service config flags were correct, but k3s was loading its old database, which was still full of references to the old .9 IP (like for the Traefik load balancer). It was seeing a conflict between its new config (.10) and its old database (.9), and crashing.

The database was corrupted with stale data.

The Real Fix: Nuke the database and let k3s rebuild it from scratch using the new, correct config.

# On the Pi

# 1. Stop k3s
sudo systemctl stop k3s.service

# 2. Delete the old, corrupted database
sudo rm -rf /var/lib/rancher/k3s/server/db/

# 3. Start k3s
sudo systemctl start k3s.service
Enter fullscreen mode Exit fullscreen mode

I checked the status one more time, and...

● k3s.service - Lightweight Kubernetes
     Active: active (running) since Thu 2025-10-30 20:55:03 IST; 1min 9s ago
Enter fullscreen mode Exit fullscreen mode

It was stable. It worked. The cluster was back.


Part 4: The GitOps Restoration (Flux is Gone!)

I went back to my Mac. kubectl get nodes worked!

NAME        STATUS   ROLES                  AGE   VERSION
shankarpi   Ready    control-plane,master   1m    v1.33.5+k3s1
Enter fullscreen mode Exit fullscreen mode

But when I ran flux get kustomizations, I got a new error:

✗ unable to retrieve the complete list of server APIs: kustomize.toolkit.fluxcd.io/v1: no matches for ...
Enter fullscreen mode Exit fullscreen mode

Of course. When I deleted the database, I deleted everything—including the FluxCD installation and all its API definitions (CRDs).

The cluster was healthy, but empty.

Luckily, with a GitOps setup, this is the easiest fix in the world. I just had to re-bootstrap Flux.

# On my Mac

# 1. Set my GitHub Token
export GITHUB_TOKEN="ghp_..."

# 2. Re-run the bootstrap command
flux bootstrap github \
  --owner=tiwari91 \
  --repository=pi-cluster \
  --branch=main \
  --path=./clusters/staging \
  --personal
Enter fullscreen mode Exit fullscreen mode

This re-installed Flux, and it immediately started trying to deploy my apps.


Part 5: The Final "Gotcha" (The Missing SOPS Secret)

I was so close. I ran flux get kustomizations one last time. This is what I saw:

NAME                READY   MESSAGE
apps                False   decryption failed for 'tunnel-credentials': ...
flux-system         True    Applied revision: main@sha1:784af83f
infra...            False   decryption failed for 'renovate-container-env': ...
monitoring-configs  False   decryption failed for 'grafana-tls-secret': ...
Enter fullscreen mode Exit fullscreen mode

My flux-system was running, but all my other apps were failing with decryption failed. Why?

When I reset the cluster, I also deleted the sops-age secret that Flux uses to decrypt my files.

The solution was to put that secret back.

On my Mac, I deleted the (possibly stale) secret just in case.

kubectl delete secret sops-age -n flux-system
Enter fullscreen mode Exit fullscreen mode

I re-created the secret from my local private key file. (Mine was named age.agekey)

cat age.agekey | kubectl create secret generic sops-age \
  --namespace=flux-system \
  --from-file=age.agekey=/dev/stdin
Enter fullscreen mode Exit fullscreen mode

I told Flux to try one last time.

flux reconcile kustomization apps --with-source
Enter fullscreen mode Exit fullscreen mode

Success! Flux found the key, decrypted the manifests, and all my namespaces and pods (linkding, audiobookshelf, monitoring) started spinning up.

TL;DR: The 3-Step Fix for a Dead k3s Pi
If your k3s Pi cluster dies after an IP change:

Fix k3s.service: SSH into the Pi. Edit /etc/systemd/system/k3s.service to add the ExecStartPre line to wait for your network and add the --node-ip, --node-external-ip, and --flannel-iface flags with your new static IP.

Reset the Database: The old IP is still in the database. Stop k3s, delete the DB, and restart:

sudo systemctl stop k3s.service
sudo rm -rf /var/lib/rancher/k3s/server/db/
sudo systemctl daemon-reload
sudo systemctl start k3s.service
Enter fullscreen mode Exit fullscreen mode

Restore GitOps: Your cluster is now empty.

Run flux bootstrap ... again to re-install Flux.

Re-create your sops-age secret: cat age.agekey | kubectl create secret generic sops-age -n flux-system ...

Force a reconcile: flux reconcile kustomization apps --with-source

And just like that, my cluster was back from the dead.

Top comments (0)