Vincent Du

Posted on Jan 11

Kubernetes Persistence Series Part 1: When Our Ingress Vanished After a Node Upgrade

#kubernetes #devops #gke #sre

What You'll Learn

Why manually-applied Kubernetes resources can disappear after pod evictions
How NGINX Ingress admission webhooks validate resources
The difference between controller-managed and manually-applied resources
Why Helm-managed resources survive node disruptions

The Problem That Started This Journey

It was a regular Monday morning until the alerts fired: Grafana was unreachable.

When GKE performed automatic node upgrades, our monitoring dashboard disappeared. The investigation that followed revealed a fascinating chain of dependencies—and ultimately led to understanding the elegant hierarchical supervision model that keeps Kubernetes running.

But first, let's solve the immediate problem.

The Incident: Why Ingress Disappeared

What Happened

The sequence of events:

GKE automatically upgraded nodes (routine security patches)
Nodes were drained, causing pod evictions
NGINX Ingress Controller pod was evicted and restarted on a new node
Grafana ingress resource disappeared
Service became inaccessible

The puzzling part: why would an Ingress resource disappear when only pods were evicted? Ingress is a Kubernetes object stored in etcd—it shouldn't just vanish.

The Investigation

# Check if the ingress exists
kubectl get ingress -n monitoring
# No resources found

# Check the NGINX controller logs
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller | grep -i error

The logs revealed admission webhook failures during the controller restart.

Root Cause Discovery

The ingress disappeared because of a perfect storm of issues:

The chain of failures:

TLS Secret was missing — It was manually copied to the cluster months ago, not managed by any controller. When the namespace was recreated during troubleshooting, the secret didn't come back.
NGINX Admission Webhook — The NGINX Ingress Controller includes a validating webhook that checks ingress resources on creation and updates.
Validation Failed — Without the TLS secret referenced in the ingress spec, the webhook rejected the ingress as invalid.
No Reconciliation — The ingress was created via kubectl apply (not Helm or an operator), so nothing knew to recreate it.

The "Aha" Moment

The real issue wasn't the node upgrade—it was our resource management approach:

# Our original ingress (manually applied)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grafana
  namespace: monitoring
  # No owner reference
  # No Helm labels
  # No operator management
spec:
  tls:
  - hosts:
    - grafana.prod.example.com
    secretName: grafana-tls  # This secret was also manually created!
  rules:
  - host: grafana.prod.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: grafana
            port:
              number: 80

When this ingress needed to be recreated, nothing knew it should exist.

The Solution: Helm-Managed Resources

We solved this by migrating to Helm charts with native ingress support:

# Before: manually applied resources scattered across yaml files
kubectl apply -f grafana-ingress.yaml
kubectl apply -f grafana-tls-secret.yaml

# After: Helm manages everything as a single release
helm upgrade --install monitoring prometheus-community/kube-prometheus-stack \
    --namespace monitoring \
    --set grafana.ingress.enabled=true \
    --set grafana.ingress.hosts[0]=grafana.prod.example.com \
    --set grafana.ingress.tls[0].secretName=grafana-tls \
    --set grafana.ingress.tls[0].hosts[0]=grafana.prod.example.com

Why This Works

Helm stores release state in Kubernetes secrets:

kubectl get secrets -n monitoring -l owner=helm
# NAME                                    TYPE                 DATA
# sh.helm.release.v1.monitoring.v1       helm.sh/release.v1   1

This means:

✅ Helm knows what resources should exist
✅ helm upgrade recreates missing resources
✅ Resources are versioned and can be rolled back
✅ Dependencies (like TLS secrets) are managed together

For the TLS Secret

We also moved TLS management to cert-manager:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: grafana-tls
  namespace: monitoring
spec:
  secretName: grafana-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
  - grafana.prod.example.com

Now cert-manager (an operator) ensures the TLS secret always exists and stays renewed.

Key Takeaways

What Survives Pod Evictions

Resource Type	Survives?	Why
Helm-managed resources	✅	State stored in release secrets
Operator-managed CRs	✅	Operator reconciles continuously
Resources with owner references	✅	Parent controller recreates them
Manually `kubectl apply`'d resources	⚠️	Survives in etcd, but won't be recreated if deleted
Resources referencing missing dependencies	❌	Validation webhooks may reject them

Best Practices

Never manually apply production resources — Use Helm, Kustomize, or GitOps tools
Manage secrets with operators — External Secrets, cert-manager, Sealed Secrets
Understand admission webhooks — They validate resources on every create/update
Test node disruptions — Use kubectl drain in staging regularly

The Deeper Question

This incident was resolved, but it raised a fundamental question:

How do controllers like Helm, NGINX Ingress, and cert-manager survive pod evictions themselves? What ensures THEY come back?

The answer involves a beautiful hierarchical supervision model that goes all the way down to Linux PID 1.

In Part 2, we'll explore the complete Kubernetes persistence chain—from Linux systemd to application controllers—and understand why Kubernetes is designed to assume failure is normal.

Have you experienced similar "ghost" resources disappearing in Kubernetes? Share your war stories in the comments!

Next in this series: Part 2: The Foundation — From systemd to Control Plane

DEV Community