POTHURAJU JAYAKRISHNA YADAV

Posted on Jan 23

Zero‑Downtime Deployments on AKS with Azure Application Gateway Ingress Controller (AGIC)

#azure #aks #devops #agic

Background

While deploying applications on Azure Kubernetes Service (AKS) behind Azure Application Gateway Ingress Controller (AGIC), I repeatedly faced brief but noticeable downtime during deployments.

The issue was not with Kubernetes itself, but with how AGIC updates backend pool IPs when pods are replaced during a deployment.

By default, during a rollout:

Old pods are terminated
New pods come up with new IPs
AGIC takes around 30 seconds (sometimes more) to detect and update these new pod IPs in Application Gateway

During this window, Application Gateway may still route traffic to terminating pods → resulting in 5xx errors or downtime.

The Root Cause

Let’s understand what was happening internally:

Kubernetes starts terminating old pods
Pods are removed from Endpoints
AGIC needs time to:

Detect endpoint changes
Update Application Gateway backend pool
Push config updates
1. Meanwhile, traffic is still flowing

If pods terminate too fast, Application Gateway temporarily has no healthy backend, causing downtime.

The Strategy That Fixed It

After digging into this and testing a few approaches in production, I implemented three small but very effective changes in my Deployment YAML:

1. RollingUpdate Strategy

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

Why this matters:

Ensures at least one pod is always available
New pod comes up before old pod is removed

2. Increased Termination Grace Period

spec:
  terminationGracePeriodSeconds: 300

Why this matters:

Gives Kubernetes 5 minutes before force‑killing the pod
Allows existing connections to complete
Buys time for AGIC to update backend pool

3. preStop Hook to Delay Pod Shutdown

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 280"]

Why this matters:

Pod stays in Terminating state for ~280 seconds
Application Gateway continues routing traffic safely
AGIC finishes updating new pod IPs before traffic stops

What Changed After This?

After applying this configuration in production, the difference was immediately noticeable:

✅ No more deployment‑time downtime

✅ Zero 502 / 504 errors from Application Gateway

✅ Smooth traffic transition between old and new pods

✅ AGIC gets enough time to sync backend pool changes

Even during peak traffic, deployments became completely seamless.

Why This Is Important for AGIC Users

AGIC is not instant. It operates asynchronously and depends on:

Kubernetes endpoint updates
ARM / Application Gateway config propagation

So fast pod termination = broken traffic.

This pattern ensures:

“Never kill a pod until the load balancer has fully learned about the new one.”

Final Thoughts (From Real Experience)

If you are using:

AKS
Azure Application Gateway
AGIC

and facing deployment‑time downtime, this approach is mandatory, not optional.

Kubernetes already gives us the right tools — we just need to tune them correctly based on how cloud load balancers like Application Gateway behave in the real world.

Happy deploying 🚀

Let me know if you want a version with:

Diagrams
AGIC internals
Comparison with NGINX ingress
Real production metrics

DEV Community

Zero‑Downtime Deployments on AKS with Azure Application Gateway Ingress Controller (AGIC)

Background

The Root Cause

The Strategy That Fixed It

1. RollingUpdate Strategy

2. Increased Termination Grace Period

3. preStop Hook to Delay Pod Shutdown

What Changed After This?

Why This Is Important for AGIC Users

Final Thoughts (From Real Experience)

Top comments (0)