DEV Community

CloudQuill
CloudQuill

Posted on

Modernizing Legacy Workloads: KubeVirt on AKS with Azure Arc Identity

TL;DR: A production-grade blueprint for running Virtual Machines on Azure Kubernetes Service (AKS). This project demonstrates how to unify container and VM operations while solving the "Identity Gap" using Azure Arc—enabling true Azure AD SSH authentication with zero manual key management.

View the Complete Project on GitHub


Table of Contents

  1. The Problem: Operational Fragmentation
  2. What is KubeVirt?
  3. Architecture Overview
  4. The Identity Challenge: No IMDS
  5. Multi-Tenancy & Security
  6. Implementation Deep Dive
  7. Deployment Guide
  8. Technologies & Skills Demonstrated

The Problem: Operational Fragmentation {#the-problem}

The Reality of Enterprise IT

Here's a truth nobody talks about at cloud conferences: most enterprises aren't running everything in containers. They're not even close.

While we celebrate microservices and Kubernetes, the reality on the ground looks different. Organizations still depend heavily on Virtual Machines for their most critical workloads:

  • Legacy Databases like Oracle and SQL Server that would require months of refactoring to containerize properly
  • Proprietary Software with licensing tied to specific OS configurations
  • Compliance-bound Workloads that regulators insist must run in isolated VMs
  • Lift-and-Shift Migrations that moved to the cloud but never got modernized

This isn't a failure—it's pragmatism. These VMs run the systems that actually make money.

The "Two-Stack Problem"

But here's where things get messy. Organizations end up managing two completely separate infrastructure stacks:

Aspect Container Stack VM Stack
Orchestration Kubernetes vSphere, Hyper-V, Azure VMs
CI/CD Pipeline ArgoCD, Flux, Jenkins Separate scripts, manual deployment
Monitoring Prometheus, Grafana vRealize, SCOM, Azure Monitor
Networking CNI (Calico, Cilium) NSX, Azure VNet
Access Control Kubernetes RBAC AD Groups, SSH Keys

The Hidden Costs of Two-Stack Operations
  • Double the tooling costs in licenses, training, and maintenance
  • Context switching that tanks developer productivity
  • Security gaps where the two stacks meet
  • Siloed teams who don't share knowledge or practices
  • Inconsistent policies that create compliance headaches

The Solution: Unified Operations with KubeVirt

What if you could run your VMs on the same platform as your containers?

This is exactly what KubeVirt enables. By treating VMs as Kubernetes objects, you collapse two stacks into one:

  • One Pipeline: Deploy VMs with the same GitOps workflows as your microservices
  • One Monitoring Stack: Prometheus and Grafana for everything
  • One Access Model: Kubernetes RBAC governs who can create, start, and stop VMs
  • One Team: Platform engineers manage the whole thing

What is KubeVirt? VMs as Kubernetes Objects {#what-is-kubevirt}

The Core Idea

KubeVirt is a Kubernetes add-on that lets you run traditional Virtual Machines alongside containers. It extends the Kubernetes API with VM-specific resources like VirtualMachine, VirtualMachineInstance, and DataVolume.

Important distinction: KubeVirt doesn't emulate or containerize your VM. It runs a real KVM/QEMU hypervisor inside a Kubernetes Pod. The guest OS is a full, unmodified Linux or Windows installation.

How It Works Under the Hood

Component Breakdown
Component Role
virt-api Extends Kubernetes API to handle VirtualMachine resources
virt-controller Manages VM lifecycle (create, start, stop, migrate)
virt-handler DaemonSet on each node; interfaces with libvirt/QEMU
virt-launcher Pod that hosts the actual VM; one per running VM
CDI (Containerized Data Importer) Handles VM disk image imports from HTTP, S3, or registries

VM Lifecycle in Kubernetes

A KubeVirt VM follows a familiar Kubernetes pattern:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: my-ubuntu-vm
  namespace: student-labs
spec:
  running: true
  template:
    spec:
      domain:
        cpu:
          cores: 2
        memory:
          guest: 4Gi
        devices:
          disks:
            - name: rootdisk
              disk:
                bus: virtio
      volumes:
        - name: rootdisk
          dataVolume:
            name: my-ubuntu-vm-rootdisk
Enter fullscreen mode Exit fullscreen mode

The running: true field is the desired state—the controller makes sure reality matches. DataVolumes handle disk provisioning, and the VM gets scheduled just like any other Pod, respecting taints, tolerations, and affinity rules.


Architecture Overview {#architecture}

What We're Building

This project implements a multi-tenant university lab platform with three user types:

  • Faculty from the Computer Science department running research VMs with generous resources
  • Students running lab VMs with strict quotas to prevent abuse
  • IT Administrators with full platform control

Node Pools Configuration

Pool VM Size Purpose Special Config
System Standard_D2s_v3 Run operators, CoreDNS Tainted for critical add-ons only
KubeVirt Standard_D4s_v3 Run guest VMs Taint: kubevirt.io/dedicated, Label: workload=kubevirt

A word of caution: The KubeVirt node pool must use VM sizes that support nested virtualization. That means Dv3, Dv4, Dv5, Ev3, Ev4, or Ev5 series. Standard Bs or older Ds series won't work—I learned this the hard way.

Storage Classes

Class SKU Reclaim Policy Use Case
kv-premium-retain Premium_LRS Retain Production VM disks (data survives VM deletion)
kv-standard StandardSSD_LRS Delete Ephemeral and test VMs

The Identity Challenge: Solving the IMDS Gap {#identity-challenge}

This is where things get interesting—and where I spent most of my debugging time.

The Problem

Every Azure VM can reach the Instance Metadata Service (IMDS) at 169.254.169.254. This service hands out managed identity tokens, instance metadata, and scheduled event notifications. Azure extensions like the AD SSH Login extension depend on it.

But KubeVirt VMs are nested inside an AKS node. When your guest VM tries to reach that link-local address, the request gets blocked by the pod's NAT layer.

The result? Your nested VM has no Azure identity. Standard Azure extensions fail silently.

The Solution: Azure Arc

Azure Arc lets you project non-Azure machines into Azure Resource Manager. That includes on-premise servers, VMs in other clouds, and—crucially for us—nested VMs that can't reach IMDS.

With Arc, your KubeVirt VM gets:

  • An Azure Resource Identity (a real resource ID in ARM)
  • Managed Identity Equivalent for authenticating to Azure services
  • Extension Support including the AADSSHLoginForLinux extension we need

The Registration Flow

Here's how it comes together:

The magic happens during cloud-init. The VM waits for network stability (KubeVirt NAT needs a moment), downloads the Arc agent, and registers itself using a service principal we created in Terraform. Once Arc confirms the connection, Terraform installs the SSH extension.

RBAC for SSH Access

Access control uses standard Azure roles:

Role Permissions Assigned To
Virtual Machine Administrator Login SSH + sudo Faculty, IT Admins
Virtual Machine User Login SSH only (no sudo) Students
Azure Connected Machine Onboarding Register new Arc machines Arc Service Principal

Multi-Tenancy & Security Model {#multi-tenancy}

Namespace-Based Isolation

We use Kubernetes Namespaces as the primary isolation boundary. Each tenant gets their own namespace with dedicated quotas, network policies, and RBAC bindings.

Security Controls

Control Implementation Purpose
ResourceQuota Per-namespace CPU/Memory/PVC limits Prevent resource exhaustion
LimitRange Per-VM resource caps Stop one VM from eating all the quota
NetworkPolicy Ingress/Egress rules Network isolation between tenants
RBAC (K8s) RoleBindings to Azure AD groups Control who can manage VMs
RBAC (Azure) VM Login roles Control who can SSH into VMs
Node Taints kubevirt.io/dedicated Keep VMs on dedicated nodes

Example: Student Namespace Security Configuration
apiVersion: v1
kind: ResourceQuota
metadata:
  name: student-lab-quota
  namespace: student-labs
spec:
  hard:
    requests.cpu: "8"
    requests.memory: 16Gi
    limits.cpu: "16"
    limits.memory: 32Gi
    persistentvolumeclaims: "5"

---
apiVersion: v1
kind: LimitRange
metadata:
  name: student-vm-limits
  namespace: student-labs
spec:
  limits:
    - type: Container
      max:
        cpu: "2"
        memory: 4Gi
      default:
        cpu: "1"
        memory: 2Gi
Enter fullscreen mode Exit fullscreen mode


Implementation Deep Dive {#implementation}

Terraform Structure

The infrastructure breaks down into logical files:

terraform/
├── main.tf              # AKS cluster and node pools
├── providers.tf         # Azure, Kubernetes, kubectl providers
├── variables.tf         # Input variables with validation
├── outputs.tf           # Connection strings and useful outputs
├── identity.tf          # Azure AD groups, RBAC assignments
├── arc.tf               # Azure Arc SP, roles, extension installer
├── platform.tf          # KubeVirt and CDI operator deployment
├── tenancy.tf           # Namespace, quota, network policy per tenant
├── storage.tf           # StorageClass definitions
├── networking.tf        # Egress network policies for operators
├── images.tf            # VM image storage (Azure Blob)
├── virtualmachines.tf   # Demo VM definition
└── templates/
    ├── cloud-init-arc.tftpl   # Cloud-init for Arc-enabled VMs
    └── cloud-init-lab.tftpl   # Cloud-init for basic VMs
Enter fullscreen mode Exit fullscreen mode

The Critical Piece: Cloud-Init

The cloud-init script handles Arc registration and needs to be robust. It must deal with:

  1. Network delays while KubeVirt NAT stabilizes
  2. DNS resolution for Azure endpoints
  3. Transient API failures during registration

Key Cloud-Init Logic
wait_for_network() {
    for i in $(seq 1 60); do
        if curl -s --connect-timeout 5 https://management.azure.com > /dev/null 2>&1; then
            echo "[Arc] Network ready"
            return 0
        fi
        echo "[Arc] Waiting for network... ($i/60)"
        sleep 5
    done
    return 1
}

register_with_arc() {
    local max_retries=5
    local retry_delay=30

    for i in $(seq 1 $max_retries); do
        if azcmagent connect \
            --service-principal-id "$SP_ID" \
            --service-principal-secret "$SP_SECRET" \
            --tenant-id "$TENANT_ID" \
            --subscription-id "$SUB_ID" \
            --resource-group "$RG_NAME" \
            --location "$LOCATION" \
            --resource-name "$(hostname)"; then
            echo "[Arc] Registration successful"
            return 0
        fi
        echo "[Arc] Retrying in ${retry_delay}s... ($i/$max_retries)"
        sleep $retry_delay
        retry_delay=$((retry_delay * 2))
    done
    return 1
}
Enter fullscreen mode Exit fullscreen mode

Terraform Patterns Worth Noting

Pattern Implementation Why It Matters
Trigger-based Recreation triggers in null_resource Recreate VM when cloud-init changes
Dependency Management Explicit depends_on chains Correct deployment order
Sensitive Values sensitive = true on SP secrets Keep secrets out of logs

Deployment Guide {#deployment}

What You'll Need

Tool Version Purpose
Azure CLI 2.40+ Azure authentication and management
Terraform 1.3+ Infrastructure provisioning
kubectl 1.24+ Kubernetes interaction
Azure Subscription Owner role required for RBAC

Step-by-Step

# Clone and configure
git clone https://github.com/ykbytes/aks-kubevirt-arc-unilab.git
cd aks-kubevirt-arc-unilab
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your settings

# Deploy (takes 15-20 minutes)
az login
terraform init
terraform plan
terraform apply

# Get credentials and verify
az aks get-credentials --resource-group rg-uni-kubevirt --name aks-uni-platform
kubectl get kubevirt -n kubevirt      # Should show: Deployed
kubectl get vm -n student-labs         # Should show: lab-vm Running

# Connect with Azure AD
az ssh vm --name lab-vm --resource-group rg-uni-kubevirt
Enter fullscreen mode Exit fullscreen mode

What to Expect During Deployment

Arc registration takes about 5-7 minutes. You'll see output like this:

null_resource.arc_aad_ssh_extension[0] (local-exec): [Arc] Waiting for lab-vm to connect...
null_resource.arc_aad_ssh_extension[0] (local-exec): [Arc] Status:  (attempt 1/90)
...
null_resource.arc_aad_ssh_extension[0] (local-exec): [Arc] Status:  (attempt 27/90)
null_resource.arc_aad_ssh_extension[0] (local-exec): [Arc] Machine connected
Enter fullscreen mode Exit fullscreen mode

The empty status values in the first few minutes are normal—the VM is still booting and running cloud-init.


Verification Commands

# Check Arc registration
az connectedmachine show --name lab-vm --resource-group rg-uni-kubevirt \
    --query "{Name:name, Status:status}" -o table

# Check extension status
az connectedmachine extension list --machine-name lab-vm \
    --resource-group rg-uni-kubevirt \
    --query "[].{Name:name, Status:provisioningState}" -o table

# Alternative: VM console access
kubectl virt console lab-vm -n student-labs
Enter fullscreen mode Exit fullscreen mode

What Success Looks Like

See a Successful Connection
$ az ssh vm --name lab-vm --resource-group rg-uni-kubevirt

Welcome to Ubuntu 22.04.5 LTS

═══════════════════════════════════════════════════════════════
 KubeVirt Lab VM - Azure Arc Enabled
═══════════════════════════════════════════════════════════════

 Azure AD Authentication:
    az ssh vm --name lab-vm --resource-group rg-uni-kubevirt

 Required RBAC Roles:
    • Virtual Machine Administrator Login - for sudo access
    • Virtual Machine User Login - for standard user access

═══════════════════════════════════════════════════════════════

user@example.com@lab-vm:~$ whoami
user@example.com
Enter fullscreen mode Exit fullscreen mode

Notice that whoami returns your Azure AD email, not a local username. No SSH keys were exchanged—Azure AD generated an ephemeral certificate automatically.

This is what makes the Arc approach worthwhile: a nested VM with no direct Azure identity becomes accessible via Azure AD credentials, just like a native Azure VM.


Technologies & Skills Demonstrated {#technologies}

Cloud & Infrastructure

Technology Usage
Azure Kubernetes Service (AKS) Managed Kubernetes with workload identity
Azure Arc Hybrid identity for nested VMs
Azure Blob Storage VM image repository
Azure Managed Disks Persistent storage for VM disks
Azure RBAC Fine-grained SSH access control

Kubernetes & Virtualization

Technology Usage
KubeVirt VM orchestration on Kubernetes
CDI (Containerized Data Importer) VM disk image management
Kubernetes RBAC Namespace-level access control
NetworkPolicies Tenant network isolation
ResourceQuotas & LimitRanges Multi-tenant resource governance

DevOps & Automation

Technology Usage
Terraform Infrastructure as Code
Cloud-Init VM bootstrap automation
Azure CLI Scripted Azure operations

What This Project Demonstrates

  • Cloud Architecture: A scalable, multi-tenant platform on Azure
  • Kubernetes Depth: KubeVirt, CDI, RBAC, NetworkPolicies working together
  • Security Engineering: Zero-trust identity with Azure Arc
  • Infrastructure as Code: Production-quality Terraform with proper patterns
  • Problem Solving: A creative solution to the IMDS identity gap

Potential Extensions

Extension Description
GitOps Integration Deploy VMs via ArgoCD or Flux
GPU Passthrough Enable NVIDIA GPU for AI/ML workloads
Live Migration Move VMs between nodes without downtime
Backup/DR Integrate Velero for VM backup
Cost Management Azure Cost Management tags and budgets

About the Author

I'm a Cloud Platform Engineer focused on bridging legacy infrastructure and modern cloud-native operations. This project reflects my approach to real-world problems:

  • Designing complex cloud architectures that actually work
  • Solving identity and security challenges without overengineering
  • Writing Terraform that other people can maintain
  • Automating the tedious parts so humans can focus on interesting problems

Top comments (0)