Background
While deploying applications on Azure Kubernetes Service (AKS) behind Azure Application Gateway Ingress Controller (AGIC), I repeatedly faced brief but noticeable downtime during deployments.
The issue was not with Kubernetes itself, but with how AGIC updates backend pool IPs when pods are replaced during a deployment.
By default, during a rollout:
- Old pods are terminated
- New pods come up with new IPs
- AGIC takes around 30 seconds (sometimes more) to detect and update these new pod IPs in Application Gateway
During this window, Application Gateway may still route traffic to terminating pods → resulting in 5xx errors or downtime.
The Root Cause
Let’s understand what was happening internally:
- Kubernetes starts terminating old pods
- Pods are removed from Endpoints
- AGIC needs time to:
- Detect endpoint changes
- Update Application Gateway backend pool
- Push config updates
- Meanwhile, traffic is still flowing
If pods terminate too fast, Application Gateway temporarily has no healthy backend, causing downtime.
The Strategy That Fixed It
After digging into this and testing a few approaches in production, I implemented three small but very effective changes in my Deployment YAML:
1. RollingUpdate Strategy
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
Why this matters:
- Ensures at least one pod is always available
- New pod comes up before old pod is removed
2. Increased Termination Grace Period
spec:
terminationGracePeriodSeconds: 300
Why this matters:
- Gives Kubernetes 5 minutes before force‑killing the pod
- Allows existing connections to complete
- Buys time for AGIC to update backend pool
3. preStop Hook to Delay Pod Shutdown
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 280"]
Why this matters:
- Pod stays in
Terminatingstate for ~280 seconds - Application Gateway continues routing traffic safely
- AGIC finishes updating new pod IPs before traffic stops
What Changed After This?
After applying this configuration in production, the difference was immediately noticeable:
✅ No more deployment‑time downtime
✅ Zero 502 / 504 errors from Application Gateway
✅ Smooth traffic transition between old and new pods
✅ AGIC gets enough time to sync backend pool changes
Even during peak traffic, deployments became completely seamless.
Why This Is Important for AGIC Users
AGIC is not instant. It operates asynchronously and depends on:
- Kubernetes endpoint updates
- ARM / Application Gateway config propagation
So fast pod termination = broken traffic.
This pattern ensures:
“Never kill a pod until the load balancer has fully learned about the new one.”
Final Thoughts (From Real Experience)
If you are using:
- AKS
- Azure Application Gateway
- AGIC
and facing deployment‑time downtime, this approach is mandatory, not optional.
Kubernetes already gives us the right tools — we just need to tune them correctly based on how cloud load balancers like Application Gateway behave in the real world.
Happy deploying 🚀
Let me know if you want a version with:
- Diagrams
- AGIC internals
- Comparison with NGINX ingress
- Real production metrics
Top comments (0)