What You'll Learn
- Why manually-applied Kubernetes resources can disappear after pod evictions
- How NGINX Ingress admission webhooks validate resources
- The difference between controller-managed and manually-applied resources
- Why Helm-managed resources survive node disruptions
The Problem That Started This Journey
It was a regular Monday morning until the alerts fired: Grafana was unreachable.
When GKE performed automatic node upgrades, our monitoring dashboard disappeared. The investigation that followed revealed a fascinating chain of dependencies—and ultimately led to understanding the elegant hierarchical supervision model that keeps Kubernetes running.
But first, let's solve the immediate problem.
The Incident: Why Ingress Disappeared
What Happened
The sequence of events:
- GKE automatically upgraded nodes (routine security patches)
- Nodes were drained, causing pod evictions
- NGINX Ingress Controller pod was evicted and restarted on a new node
- Grafana ingress resource disappeared
- Service became inaccessible
The puzzling part: why would an Ingress resource disappear when only pods were evicted? Ingress is a Kubernetes object stored in etcd—it shouldn't just vanish.
The Investigation
# Check if the ingress exists
kubectl get ingress -n monitoring
# No resources found
# Check the NGINX controller logs
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller | grep -i error
The logs revealed admission webhook failures during the controller restart.
Root Cause Discovery
The ingress disappeared because of a perfect storm of issues:
The chain of failures:
TLS Secret was missing — It was manually copied to the cluster months ago, not managed by any controller. When the namespace was recreated during troubleshooting, the secret didn't come back.
NGINX Admission Webhook — The NGINX Ingress Controller includes a validating webhook that checks ingress resources on creation and updates.
Validation Failed — Without the TLS secret referenced in the ingress spec, the webhook rejected the ingress as invalid.
No Reconciliation — The ingress was created via
kubectl apply(not Helm or an operator), so nothing knew to recreate it.
The "Aha" Moment
The real issue wasn't the node upgrade—it was our resource management approach:
# Our original ingress (manually applied)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grafana
namespace: monitoring
# No owner reference
# No Helm labels
# No operator management
spec:
tls:
- hosts:
- grafana.prod.example.com
secretName: grafana-tls # This secret was also manually created!
rules:
- host: grafana.prod.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: grafana
port:
number: 80
When this ingress needed to be recreated, nothing knew it should exist.
The Solution: Helm-Managed Resources
We solved this by migrating to Helm charts with native ingress support:
# Before: manually applied resources scattered across yaml files
kubectl apply -f grafana-ingress.yaml
kubectl apply -f grafana-tls-secret.yaml
# After: Helm manages everything as a single release
helm upgrade --install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set grafana.ingress.enabled=true \
--set grafana.ingress.hosts[0]=grafana.prod.example.com \
--set grafana.ingress.tls[0].secretName=grafana-tls \
--set grafana.ingress.tls[0].hosts[0]=grafana.prod.example.com
Why This Works
Helm stores release state in Kubernetes secrets:
kubectl get secrets -n monitoring -l owner=helm
# NAME TYPE DATA
# sh.helm.release.v1.monitoring.v1 helm.sh/release.v1 1
This means:
- ✅ Helm knows what resources should exist
- ✅
helm upgraderecreates missing resources - ✅ Resources are versioned and can be rolled back
- ✅ Dependencies (like TLS secrets) are managed together
For the TLS Secret
We also moved TLS management to cert-manager:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: grafana-tls
namespace: monitoring
spec:
secretName: grafana-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- grafana.prod.example.com
Now cert-manager (an operator) ensures the TLS secret always exists and stays renewed.
Key Takeaways
What Survives Pod Evictions
| Resource Type | Survives? | Why |
|---|---|---|
| Helm-managed resources | ✅ | State stored in release secrets |
| Operator-managed CRs | ✅ | Operator reconciles continuously |
| Resources with owner references | ✅ | Parent controller recreates them |
Manually kubectl apply'd resources |
⚠️ | Survives in etcd, but won't be recreated if deleted |
| Resources referencing missing dependencies | ❌ | Validation webhooks may reject them |
Best Practices
- Never manually apply production resources — Use Helm, Kustomize, or GitOps tools
- Manage secrets with operators — External Secrets, cert-manager, Sealed Secrets
- Understand admission webhooks — They validate resources on every create/update
-
Test node disruptions — Use
kubectl drainin staging regularly
The Deeper Question
This incident was resolved, but it raised a fundamental question:
How do controllers like Helm, NGINX Ingress, and cert-manager survive pod evictions themselves? What ensures THEY come back?
The answer involves a beautiful hierarchical supervision model that goes all the way down to Linux PID 1.
In Part 2, we'll explore the complete Kubernetes persistence chain—from Linux systemd to application controllers—and understand why Kubernetes is designed to assume failure is normal.
Have you experienced similar "ghost" resources disappearing in Kubernetes? Share your war stories in the comments!
Next in this series: Part 2: The Foundation — From systemd to Control Plane
Top comments (0)