Postmortem: How a K3s 1.28 Network Issue Caused 20 Edge Nodes to Go Offline

#postmortem #network #issue #caused

Postmortem: K3s 1.28 Network Issue Caused 20 Edge Nodes to Go Offline

Date: 2024-03-15 | Incident Duration: 47 minutes | Severity: Critical

Executive Summary

On March 12, 2024, 20 edge-deployed K3s nodes running K3s 1.28.0 lost all cluster and external network connectivity following a scheduled upgrade from K3s 1.27.4. The incident lasted 47 minutes, disrupting IoT data ingestion pipelines, local caching services, and edge-based API gateways serving 12,000+ end users. Root cause was identified as a regression in the K3s-embedded Flannel CNI’s VXLAN MTU calculation logic introduced in K3s 1.28.0, which failed to detect custom host interface MTUs common in edge network environments.

Incident Timeline (UTC)

08:00 – Scheduled upgrade of 20 edge K3s nodes from 1.27.4 to 1.28.0 begins via the k3s-upgrade controller.
08:12 – First node reports NotReady status in the management cluster; kubelet logs show failed connections to the cluster apiserver.
08:15 – All 20 edge nodes transition to NotReady; all edge-hosted workloads become unreachable.
08:20 – On-call site reliability engineer (SRE) receives PagerDuty alert for cluster health degradation.
08:25 – Initial triage confirms nodes are reachable via out-of-band (OOB) serial consoles, but pod-to-pod, pod-to-apiserver, and pod-to-internet traffic fails.
08:30 – SRE identifies the Flannel-managed flannel.1 VXLAN interface is configured with an MTU of 1500, while the host’s public interface (enp1s0) has an MTU of 1450.
08:35 – Manual MTU correction applied to a single test node: Flannel config updated to force MTU 1400 (1450 host MTU minus 50 bytes VXLAN overhead). Node connectivity restored immediately.
08:40 – Automated remediation DaemonSet deployed to patch Flannel configuration across all 20 nodes.
08:47 – All edge nodes return to Ready status; all workloads resume normal operation.

Root Cause Analysis

K3s 1.28.0 shipped with an updated Flannel CNI version (0.22.1, up from 0.21.3 in 1.27.4) that included a revised MTU auto-detection algorithm. The new logic prioritized checking the default route interface only if it was named eth0, a hardcoding oversight that did not account for non-standard interface naming common in edge bare-metal and cellular-connected nodes.

Our edge nodes use enp1s0 as the public-facing interface, with an MTU of 1450 due to ISP-level VLAN tagging that adds 50 bytes of overhead. Because Flannel 0.22.1 did not detect enp1s0 as the default route interface, it fell back to the default VXLAN MTU of 1500. This resulted in encapsulated VXLAN packets exceeding the host interface’s 1450 MTU, causing fragmentation. Our edge firewall’s security policy dropped all fragmented packets, leading to total network connectivity loss for the node’s pods and kubelet.

Further validation confirmed the issue was reproducible only on nodes with non-eth0 default interfaces and MTUs lower than 1500, a configuration present on 100% of our edge fleet.

Remediation Steps

Immediate Mitigation (08:30 – 08:47 UTC)

Manually patched the kube-flannel ConfigMap to add the --iface-mtu=1400 flag to the flanneld container arguments.
Deleted all Flannel pods to trigger restart with updated configuration; validated connectivity on a single node before rolling out to the fleet.
Deployed a temporary DaemonSet to automate ConfigMap patching and Flannel pod restarts across all 20 nodes, reducing manual toil.

Long-Term Fix

Upgraded to K3s 1.28.2, which reverted the Flannel MTU auto-detection logic to the 1.27 behavior and added support for custom interface naming in MTU checks.
Pinned Flannel version to 0.21.3 in our K3s upgrade pipeline until validated versions are confirmed safe for edge environments.

Lessons Learned

Validate CNI changes in edge-specific staging environments: Our staging cluster used eth0 interfaces with 1500 MTU, so the Flannel regression was not caught before production rollout. Future staging environments will mirror production edge network configurations.
Add MTU mismatch alerting: We now alert on mismatches between host interface MTU and Flannel VXLAN interface MTU, with a threshold of >100 bytes difference.
Maintain out-of-band access for edge nodes: OOB serial consoles were critical for triage, as all in-band network access was lost. We are expanding OOB coverage to 100% of our edge fleet.
Pre-upgrade CNI validation: Our K3s upgrade pipeline now includes a pre-flight check that validates Flannel MTU configuration against host interface settings for all nodes.

Conclusion

This incident highlighted the unique challenges of running Kubernetes at the edge, where non-standard network configurations and limited in-band access can amplify the impact of minor CNI regressions. By improving pre-upgrade validation, expanding edge-specific testing, and adding targeted alerting, we have reduced the risk of similar incidents in future K3s upgrades.

DEV Community