DEV Community

POTHURAJU JAYAKRISHNA YADAV
POTHURAJU JAYAKRISHNA YADAV

Posted on

Zero‑Downtime Deployments on AKS with Azure Application Gateway Ingress Controller (AGIC)

Background

While deploying applications on Azure Kubernetes Service (AKS) behind Azure Application Gateway Ingress Controller (AGIC), I repeatedly faced brief but noticeable downtime during deployments.

The issue was not with Kubernetes itself, but with how AGIC updates backend pool IPs when pods are replaced during a deployment.

By default, during a rollout:

  • Old pods are terminated
  • New pods come up with new IPs
  • AGIC takes around 30 seconds (sometimes more) to detect and update these new pod IPs in Application Gateway

During this window, Application Gateway may still route traffic to terminating pods → resulting in 5xx errors or downtime.


The Root Cause

Let’s understand what was happening internally:

  1. Kubernetes starts terminating old pods
  2. Pods are removed from Endpoints
  3. AGIC needs time to:
  • Detect endpoint changes
  • Update Application Gateway backend pool
  • Push config updates
    1. Meanwhile, traffic is still flowing

If pods terminate too fast, Application Gateway temporarily has no healthy backend, causing downtime.


The Strategy That Fixed It

After digging into this and testing a few approaches in production, I implemented three small but very effective changes in my Deployment YAML:

1. RollingUpdate Strategy

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0
Enter fullscreen mode Exit fullscreen mode

Why this matters:

  • Ensures at least one pod is always available
  • New pod comes up before old pod is removed

2. Increased Termination Grace Period

spec:
  terminationGracePeriodSeconds: 300
Enter fullscreen mode Exit fullscreen mode

Why this matters:

  • Gives Kubernetes 5 minutes before force‑killing the pod
  • Allows existing connections to complete
  • Buys time for AGIC to update backend pool

3. preStop Hook to Delay Pod Shutdown

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 280"]
Enter fullscreen mode Exit fullscreen mode

Why this matters:

  • Pod stays in Terminating state for ~280 seconds
  • Application Gateway continues routing traffic safely
  • AGIC finishes updating new pod IPs before traffic stops

What Changed After This?

After applying this configuration in production, the difference was immediately noticeable:

✅ No more deployment‑time downtime

✅ Zero 502 / 504 errors from Application Gateway

✅ Smooth traffic transition between old and new pods

✅ AGIC gets enough time to sync backend pool changes

Even during peak traffic, deployments became completely seamless.


Why This Is Important for AGIC Users

AGIC is not instant. It operates asynchronously and depends on:

  • Kubernetes endpoint updates
  • ARM / Application Gateway config propagation

So fast pod termination = broken traffic.

This pattern ensures:

“Never kill a pod until the load balancer has fully learned about the new one.”


Final Thoughts (From Real Experience)

If you are using:

  • AKS
  • Azure Application Gateway
  • AGIC

and facing deployment‑time downtime, this approach is mandatory, not optional.

Kubernetes already gives us the right tools — we just need to tune them correctly based on how cloud load balancers like Application Gateway behave in the real world.


Happy deploying 🚀

Let me know if you want a version with:

  • Diagrams
  • AGIC internals
  • Comparison with NGINX ingress
  • Real production metrics

Top comments (0)