AKS vs VMs: what I learned scaling a multi-region AI platform on Azure

#kubernetes #cloud #devops #azure

When I started designing the infrastructure for Clynto AI an AI-first Customer Success platform launching in both the US and India at the same time the first real decision wasn't which cloud to use. It was simpler than that, and also harder: do we run this on VMs, or do we put it on Kubernetes from day one?

There's a version of this post that just says "Kubernetes wins, obviously." That's not useful to anyone actually making the call. So instead, here's what the VM path would have actually cost us, what AKS bought us, and where I think VMs would still have been the right answer if Clynto's requirements were different.

The full architecture, CI/CD pipeline, and implementation write-up for this build are in the Clynto AI case study repo if you want the complete picture beyond this comparison.

The setup

Clynto needed two regions East US and Central India each with its own production and staging environment, each carrying different compliance obligations (GDPR in one, PCI-DSS in the other), all behind three public domains routed through a single edge. It's a small platform team's worth of complexity for what's still an early-stage company.

Azure Front Door routes by domain and tenant to regional Application Gateways, which front isolated prod/staging AKS environments in East US and Central India.

That last part matters. A lot of the AKS-vs-VM debate online assumes you're either a five-person startup or a thousand-engineer org, and the answer to "use Kubernetes?" looks different at each end. Clynto sat in an uncomfortable middle: compliance and multi-region requirements of a bigger company, headcount of a much smaller one. That tension is what actually decided this, not a general preference for containers.

Where VMs would have struggled

If I'd gone with VM-based deployment say, VM Scale Sets behind a Load Balancer, which is the natural Azure equivalent here's where it would have gotten painful:

Replicating two regions by hand. With VMs, "mirror this environment in another region" usually means re-running a setup script, or maintaining two slightly-diverging sets of ARM templates / Ansible playbooks. With AKS, the cluster is just another Terraform module with a different region variable. I provisioned both regions' clusters, and both environments (prod/staging) within them, off the same module set. The VM version of this would have meant four mostly-similar-but-not-identical environments to keep in sync by hand.

Enforcing policy below the application layer. A chunk of Clynto's compliance requirement isn't "is the app secure," it's "can we prove the platform itself enforces controls." On VMs, that means OS hardening baselines, config management tooling, and manual audits of who can SSH into what. On AKS, Kyverno enforces policy at the Kubernetes API level no privileged containers, no images from outside the approved registry and that's enforced automatically on every deploy, not checked quarterly.

Zero-touch deployment. GitOps with Argo CD means there's no ssh step in the release process at all. On a VM fleet, even a well-built CI/CD pipeline usually ends in something touching the box directly a deployment script, a config push, a service restart. That's a smaller surface area difference than people think, but it's a real one when you're trying to make "no manual production access" a literal architectural guarantee instead of a policy on paper.

Bin-packing under unpredictable AI workload. Clynto's workloads sit next to Azure OpenAI calls through Microsoft Foundry, and load isn't steady it spikes with usage patterns that don't map cleanly to a fixed VM count. I used the Vertical Pod Autoscaler to right-size pod resource requests and tuned Horizontal Pod Autoscaling against real Azure Monitor metrics. Getting equivalent elasticity out of a VM Scale Set means scaling whole machines up and down, which is coarser and slower, and it's harder to right-size a VM than a pod.

Where VMs would have actually been fine or better

I want to be fair to the other side of this, because most "AKS is better" posts skip it.

If Clynto were single-region, single-environment, low-compliance-burden say, a single-tenant internal tool a couple of VMs behind a load balancer would have been less infrastructure to operate, less to learn, and genuinely simpler to reason about. Kubernetes has a real operational tax: cluster upgrades, node pool management, learning Argo CD and Kyverno in the first place. If you don't need the multi-region or policy-enforcement benefits, you're paying that tax for nothing.

If the workload were a single monolith with predictable, steady load, autoscaling pods buys you very little over a couple of right-sized VMs with a basic autoscale rule. The complexity of containerizing an app that doesn't need to scale dynamically is sometimes just complexity.

Debugging is genuinely harder on Kubernetes when something goes wrong at 2am. A VM you can SSH into and look at logs directly. A pod that's crash-looping inside a cluster with network policies and admission controllers has more layers between you and the problem. I don't think this is a reason to avoid AKS for Clynto's case, but it's a real cost, not a myth.

What actually tipped it for Clynto

Two regions, two compliance regimes, a staging environment that needed to be a faithful mirror of prod, and a one-person platform team trying to keep all of that consistent without drowning in manual config drift. That combination is where Kubernetes specifically AKS with Terraform, Argo CD, and Kyverno earns its complexity. Not because containers are inherently superior to VMs, but because the thing I actually needed wasn't "run my app," it was "run four nearly-identical, policy-enforced, audit-ready environments without doing the same setup four times by hand."

If you're making this call for your own platform, the question I'd actually ask isn't "Kubernetes or VMs" it's "how many environments do I need to keep consistent, and how much of my compliance story needs to be enforced by the platform instead of by a person checking a box."

This post is based on an independent freelance engagement as the Azure Cloud Architect for Clynto AI. Architecture and implementation details reflect the actual platform delivered.

Top comments (1)

Max Quimby • Jul 4

The "uncomfortable middle" framing is the honest part most of these posts skip, so thank you for that. The one cost I'd weight a little heavier for a small team is day-2. Two VMs behind a load balancer have a nearly flat operational surface after launch; an AKS footprint carries a standing upgrade treadmill — control-plane version churn, CVEs in the operator/driver/ingress stack, cert rotation, Kyverno policy drift as the API evolves. That's real recurring work that never shows up in the day-1 "which is cleaner?" comparison, and it's felt hardest at exactly your headcount. Your compliance and multi-region argument still wins the call — Terraform-module-per-region and API-level policy enforcement genuinely are painful to replicate on VMs. The one I'd double-check is bin-packing against Azure OpenAI: VPA/HPA on Azure Monitor metrics tends to lag bursty LLM load because the signal arrives after the spike. If the traffic is request- or queue-shaped, scaling on queue depth (KEDA) reacts sooner than CPU/memory-derived metrics. Solid write-up either way — the honesty about where VMs still win is what makes it trustworthy.