CloudQuill

Posted on Dec 1, 2025

Modernizing Legacy Workloads: KubeVirt on AKS with Azure Arc Identity

#kubernetes #devops #security #azure

TL;DR: A production-grade blueprint for running Virtual Machines on Azure Kubernetes Service (AKS). This project demonstrates how to unify container and VM operations while solving the "Identity Gap" using Azure Arc—enabling true Azure AD SSH authentication with zero manual key management.

View the Complete Project on GitHub

The Problem: Operational Fragmentation
What is KubeVirt?
Architecture Overview
The Identity Challenge: No IMDS
Multi-Tenancy & Security
Implementation Deep Dive
Deployment Guide
Technologies & Skills Demonstrated

The Problem: Operational Fragmentation {#the-problem}

The Reality of Enterprise IT

Here's a truth nobody talks about at cloud conferences: most enterprises aren't running everything in containers. They're not even close.

While we celebrate microservices and Kubernetes, the reality on the ground looks different. Organizations still depend heavily on Virtual Machines for their most critical workloads:

Legacy Databases like Oracle and SQL Server that would require months of refactoring to containerize properly
Proprietary Software with licensing tied to specific OS configurations
Compliance-bound Workloads that regulators insist must run in isolated VMs
Lift-and-Shift Migrations that moved to the cloud but never got modernized

This isn't a failure—it's pragmatism. These VMs run the systems that actually make money.

The "Two-Stack Problem"

But here's where things get messy. Organizations end up managing two completely separate infrastructure stacks:

Aspect	Container Stack	VM Stack
Orchestration	Kubernetes	vSphere, Hyper-V, Azure VMs
CI/CD Pipeline	ArgoCD, Flux, Jenkins	Separate scripts, manual deployment
Monitoring	Prometheus, Grafana	vRealize, SCOM, Azure Monitor
Networking	CNI (Calico, Cilium)	NSX, Azure VNet
Access Control	Kubernetes RBAC	AD Groups, SSH Keys

The Hidden Costs of Two-Stack Operations

Double the tooling costs in licenses, training, and maintenance
Context switching that tanks developer productivity
Security gaps where the two stacks meet
Siloed teams who don't share knowledge or practices
Inconsistent policies that create compliance headaches

The Solution: Unified Operations with KubeVirt

What if you could run your VMs on the same platform as your containers?

This is exactly what KubeVirt enables. By treating VMs as Kubernetes objects, you collapse two stacks into one:

One Pipeline: Deploy VMs with the same GitOps workflows as your microservices
One Monitoring Stack: Prometheus and Grafana for everything
One Access Model: Kubernetes RBAC governs who can create, start, and stop VMs
One Team: Platform engineers manage the whole thing

What is KubeVirt? VMs as Kubernetes Objects {#what-is-kubevirt}

The Core Idea

KubeVirt is a Kubernetes add-on that lets you run traditional Virtual Machines alongside containers. It extends the Kubernetes API with VM-specific resources like VirtualMachine, VirtualMachineInstance, and DataVolume.

Important distinction: KubeVirt doesn't emulate or containerize your VM. It runs a real KVM/QEMU hypervisor inside a Kubernetes Pod. The guest OS is a full, unmodified Linux or Windows installation.

How It Works Under the Hood

Component Breakdown

Component	Role
virt-api	Extends Kubernetes API to handle `VirtualMachine` resources
virt-controller	Manages VM lifecycle (create, start, stop, migrate)
virt-handler	DaemonSet on each node; interfaces with libvirt/QEMU
virt-launcher	Pod that hosts the actual VM; one per running VM
CDI (Containerized Data Importer)	Handles VM disk image imports from HTTP, S3, or registries

VM Lifecycle in Kubernetes

A KubeVirt VM follows a familiar Kubernetes pattern:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: my-ubuntu-vm
  namespace: student-labs
spec:
  running: true
  template:
    spec:
      domain:
        cpu:
          cores: 2
        memory:
          guest: 4Gi
        devices:
          disks:
            - name: rootdisk
              disk:
                bus: virtio
      volumes:
        - name: rootdisk
          dataVolume:
            name: my-ubuntu-vm-rootdisk

The running: true field is the desired state—the controller makes sure reality matches. DataVolumes handle disk provisioning, and the VM gets scheduled just like any other Pod, respecting taints, tolerations, and affinity rules.

Architecture Overview {#architecture}

What We're Building

This project implements a multi-tenant university lab platform with three user types:

Faculty from the Computer Science department running research VMs with generous resources
Students running lab VMs with strict quotas to prevent abuse
IT Administrators with full platform control

Node Pools Configuration

Pool	VM Size	Purpose	Special Config
System	Standard_D2s_v3	Run operators, CoreDNS	Tainted for critical add-ons only
KubeVirt	Standard_D4s_v3	Run guest VMs	Taint: `kubevirt.io/dedicated`, Label: `workload=kubevirt`

A word of caution: The KubeVirt node pool must use VM sizes that support nested virtualization. That means Dv3, Dv4, Dv5, Ev3, Ev4, or Ev5 series. Standard Bs or older Ds series won't work—I learned this the hard way.

Storage Classes

Class	SKU	Reclaim Policy	Use Case
`kv-premium-retain`	Premium_LRS	Retain	Production VM disks (data survives VM deletion)
`kv-standard`	StandardSSD_LRS	Delete	Ephemeral and test VMs

The Identity Challenge: Solving the IMDS Gap {#identity-challenge}

This is where things get interesting—and where I spent most of my debugging time.

The Problem

Every Azure VM can reach the Instance Metadata Service (IMDS) at 169.254.169.254. This service hands out managed identity tokens, instance metadata, and scheduled event notifications. Azure extensions like the AD SSH Login extension depend on it.

But KubeVirt VMs are nested inside an AKS node. When your guest VM tries to reach that link-local address, the request gets blocked by the pod's NAT layer.

The result? Your nested VM has no Azure identity. Standard Azure extensions fail silently.

The Solution: Azure Arc

Azure Arc lets you project non-Azure machines into Azure Resource Manager. That includes on-premise servers, VMs in other clouds, and—crucially for us—nested VMs that can't reach IMDS.

With Arc, your KubeVirt VM gets:

An Azure Resource Identity (a real resource ID in ARM)
Managed Identity Equivalent for authenticating to Azure services
Extension Support including the AADSSHLoginForLinux extension we need

The Registration Flow

Here's how it comes together:

The magic happens during cloud-init. The VM waits for network stability (KubeVirt NAT needs a moment), downloads the Arc agent, and registers itself using a service principal we created in Terraform. Once Arc confirms the connection, Terraform installs the SSH extension.

RBAC for SSH Access

Access control uses standard Azure roles:

Role	Permissions	Assigned To
Virtual Machine Administrator Login	SSH + sudo	Faculty, IT Admins
Virtual Machine User Login	SSH only (no sudo)	Students
Azure Connected Machine Onboarding	Register new Arc machines	Arc Service Principal

Multi-Tenancy & Security Model {#multi-tenancy}

Namespace-Based Isolation

We use Kubernetes Namespaces as the primary isolation boundary. Each tenant gets their own namespace with dedicated quotas, network policies, and RBAC bindings.

Security Controls

Control	Implementation	Purpose
ResourceQuota	Per-namespace CPU/Memory/PVC limits	Prevent resource exhaustion
LimitRange	Per-VM resource caps	Stop one VM from eating all the quota
NetworkPolicy	Ingress/Egress rules	Network isolation between tenants
RBAC (K8s)	RoleBindings to Azure AD groups	Control who can manage VMs
RBAC (Azure)	VM Login roles	Control who can SSH into VMs
Node Taints	`kubevirt.io/dedicated`	Keep VMs on dedicated nodes

Example: Student Namespace Security Configuration

apiVersion: v1
kind: ResourceQuota
metadata:
  name: student-lab-quota
  namespace: student-labs
spec:
  hard:
    requests.cpu: "8"
    requests.memory: 16Gi
    limits.cpu: "16"
    limits.memory: 32Gi
    persistentvolumeclaims: "5"

---
apiVersion: v1
kind: LimitRange
metadata:
  name: student-vm-limits
  namespace: student-labs
spec:
  limits:
    - type: Container
      max:
        cpu: "2"
        memory: 4Gi
      default:
        cpu: "1"
        memory: 2Gi

Implementation Deep Dive {#implementation}

Terraform Structure

The infrastructure breaks down into logical files:

terraform/
├── main.tf              # AKS cluster and node pools
├── providers.tf         # Azure, Kubernetes, kubectl providers
├── variables.tf         # Input variables with validation
├── outputs.tf           # Connection strings and useful outputs
├── identity.tf          # Azure AD groups, RBAC assignments
├── arc.tf               # Azure Arc SP, roles, extension installer
├── platform.tf          # KubeVirt and CDI operator deployment
├── tenancy.tf           # Namespace, quota, network policy per tenant
├── storage.tf           # StorageClass definitions
├── networking.tf        # Egress network policies for operators
├── images.tf            # VM image storage (Azure Blob)
├── virtualmachines.tf   # Demo VM definition
└── templates/
    ├── cloud-init-arc.tftpl   # Cloud-init for Arc-enabled VMs
    └── cloud-init-lab.tftpl   # Cloud-init for basic VMs

The Critical Piece: Cloud-Init

The cloud-init script handles Arc registration and needs to be robust. It must deal with:

Network delays while KubeVirt NAT stabilizes
DNS resolution for Azure endpoints
Transient API failures during registration

Key Cloud-Init Logic

wait_for_network() {
    for i in $(seq 1 60); do
        if curl -s --connect-timeout 5 https://management.azure.com > /dev/null 2>&1; then
            echo "[Arc] Network ready"
            return 0
        fi
        echo "[Arc] Waiting for network... ($i/60)"
        sleep 5
    done
    return 1
}

register_with_arc() {
    local max_retries=5
    local retry_delay=30

    for i in $(seq 1 $max_retries); do
        if azcmagent connect \
            --service-principal-id "$SP_ID" \
            --service-principal-secret "$SP_SECRET" \
            --tenant-id "$TENANT_ID" \
            --subscription-id "$SUB_ID" \
            --resource-group "$RG_NAME" \
            --location "$LOCATION" \
            --resource-name "$(hostname)"; then
            echo "[Arc] Registration successful"
            return 0
        fi
        echo "[Arc] Retrying in ${retry_delay}s... ($i/$max_retries)"
        sleep $retry_delay
        retry_delay=$((retry_delay * 2))
    done
    return 1
}

Terraform Patterns Worth Noting

Pattern	Implementation	Why It Matters
Trigger-based Recreation	`triggers` in `null_resource`	Recreate VM when cloud-init changes
Dependency Management	Explicit `depends_on` chains	Correct deployment order
Sensitive Values	`sensitive = true` on SP secrets	Keep secrets out of logs

Deployment Guide {#deployment}

What You'll Need

Tool	Version	Purpose
Azure CLI	2.40+	Azure authentication and management
Terraform	1.3+	Infrastructure provisioning
kubectl	1.24+	Kubernetes interaction
Azure Subscription	—	Owner role required for RBAC

Step-by-Step

# Clone and configure
git clone https://github.com/ykbytes/aks-kubevirt-arc-unilab.git
cd aks-kubevirt-arc-unilab
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your settings

# Deploy (takes 15-20 minutes)
az login
terraform init
terraform plan
terraform apply

# Get credentials and verify
az aks get-credentials --resource-group rg-uni-kubevirt --name aks-uni-platform
kubectl get kubevirt -n kubevirt      # Should show: Deployed
kubectl get vm -n student-labs         # Should show: lab-vm Running

# Connect with Azure AD
az ssh vm --name lab-vm --resource-group rg-uni-kubevirt

What to Expect During Deployment

Arc registration takes about 5-7 minutes. You'll see output like this:

null_resource.arc_aad_ssh_extension[0] (local-exec): [Arc] Waiting for lab-vm to connect...
null_resource.arc_aad_ssh_extension[0] (local-exec): [Arc] Status:  (attempt 1/90)
...
null_resource.arc_aad_ssh_extension[0] (local-exec): [Arc] Status:  (attempt 27/90)
null_resource.arc_aad_ssh_extension[0] (local-exec): [Arc] Machine connected

The empty status values in the first few minutes are normal—the VM is still booting and running cloud-init.

Verification Commands

# Check Arc registration
az connectedmachine show --name lab-vm --resource-group rg-uni-kubevirt \
    --query "{Name:name, Status:status}" -o table

# Check extension status
az connectedmachine extension list --machine-name lab-vm \
    --resource-group rg-uni-kubevirt \
    --query "[].{Name:name, Status:provisioningState}" -o table

# Alternative: VM console access
kubectl virt console lab-vm -n student-labs

What Success Looks Like

See a Successful Connection

$ az ssh vm --name lab-vm --resource-group rg-uni-kubevirt

Welcome to Ubuntu 22.04.5 LTS

═══════════════════════════════════════════════════════════════
 KubeVirt Lab VM - Azure Arc Enabled
═══════════════════════════════════════════════════════════════

 Azure AD Authentication:
    az ssh vm --name lab-vm --resource-group rg-uni-kubevirt

 Required RBAC Roles:
    • Virtual Machine Administrator Login - for sudo access
    • Virtual Machine User Login - for standard user access

═══════════════════════════════════════════════════════════════

user@example.com@lab-vm:~$ whoami
user@example.com

Notice that whoami returns your Azure AD email, not a local username. No SSH keys were exchanged—Azure AD generated an ephemeral certificate automatically.

This is what makes the Arc approach worthwhile: a nested VM with no direct Azure identity becomes accessible via Azure AD credentials, just like a native Azure VM.

Technologies & Skills Demonstrated {#technologies}

Cloud & Infrastructure

Technology	Usage
Azure Kubernetes Service (AKS)	Managed Kubernetes with workload identity
Azure Arc	Hybrid identity for nested VMs
Azure Blob Storage	VM image repository
Azure Managed Disks	Persistent storage for VM disks
Azure RBAC	Fine-grained SSH access control

Kubernetes & Virtualization

Technology	Usage
KubeVirt	VM orchestration on Kubernetes
CDI (Containerized Data Importer)	VM disk image management
Kubernetes RBAC	Namespace-level access control
NetworkPolicies	Tenant network isolation
ResourceQuotas & LimitRanges	Multi-tenant resource governance

DevOps & Automation

Technology	Usage
Terraform	Infrastructure as Code
Cloud-Init	VM bootstrap automation
Azure CLI	Scripted Azure operations

What This Project Demonstrates

Cloud Architecture: A scalable, multi-tenant platform on Azure
Kubernetes Depth: KubeVirt, CDI, RBAC, NetworkPolicies working together
Security Engineering: Zero-trust identity with Azure Arc
Infrastructure as Code: Production-quality Terraform with proper patterns
Problem Solving: A creative solution to the IMDS identity gap

Potential Extensions

Extension	Description
GitOps Integration	Deploy VMs via ArgoCD or Flux
GPU Passthrough	Enable NVIDIA GPU for AI/ML workloads
Live Migration	Move VMs between nodes without downtime
Backup/DR	Integrate Velero for VM backup
Cost Management	Azure Cost Management tags and budgets

About the Author

I'm a Cloud Platform Engineer focused on bridging legacy infrastructure and modern cloud-native operations. This project reflects my approach to real-world problems:

Designing complex cloud architectures that actually work
Solving identity and security challenges without overengineering
Writing Terraform that other people can maintain
Automating the tedious parts so humans can focus on interesting problems

DEV Community