TL;DR: A production-grade blueprint for running Virtual Machines on Azure Kubernetes Service (AKS). This project demonstrates how to unify container and VM operations while solving the "Identity Gap" using Azure Arc—enabling true Azure AD SSH authentication with zero manual key management.
View the Complete Project on GitHub
Table of Contents
- The Problem: Operational Fragmentation
- What is KubeVirt?
- Architecture Overview
- The Identity Challenge: No IMDS
- Multi-Tenancy & Security
- Implementation Deep Dive
- Deployment Guide
- Technologies & Skills Demonstrated
The Problem: Operational Fragmentation {#the-problem}
The Reality of Enterprise IT
Here's a truth nobody talks about at cloud conferences: most enterprises aren't running everything in containers. They're not even close.
While we celebrate microservices and Kubernetes, the reality on the ground looks different. Organizations still depend heavily on Virtual Machines for their most critical workloads:
- Legacy Databases like Oracle and SQL Server that would require months of refactoring to containerize properly
- Proprietary Software with licensing tied to specific OS configurations
- Compliance-bound Workloads that regulators insist must run in isolated VMs
- Lift-and-Shift Migrations that moved to the cloud but never got modernized
This isn't a failure—it's pragmatism. These VMs run the systems that actually make money.
The "Two-Stack Problem"
But here's where things get messy. Organizations end up managing two completely separate infrastructure stacks:
| Aspect | Container Stack | VM Stack |
|---|---|---|
| Orchestration | Kubernetes | vSphere, Hyper-V, Azure VMs |
| CI/CD Pipeline | ArgoCD, Flux, Jenkins | Separate scripts, manual deployment |
| Monitoring | Prometheus, Grafana | vRealize, SCOM, Azure Monitor |
| Networking | CNI (Calico, Cilium) | NSX, Azure VNet |
| Access Control | Kubernetes RBAC | AD Groups, SSH Keys |
The Hidden Costs of Two-Stack Operations
The Solution: Unified Operations with KubeVirt
What if you could run your VMs on the same platform as your containers?
This is exactly what KubeVirt enables. By treating VMs as Kubernetes objects, you collapse two stacks into one:
- One Pipeline: Deploy VMs with the same GitOps workflows as your microservices
- One Monitoring Stack: Prometheus and Grafana for everything
- One Access Model: Kubernetes RBAC governs who can create, start, and stop VMs
- One Team: Platform engineers manage the whole thing
What is KubeVirt? VMs as Kubernetes Objects {#what-is-kubevirt}
The Core Idea
KubeVirt is a Kubernetes add-on that lets you run traditional Virtual Machines alongside containers. It extends the Kubernetes API with VM-specific resources like VirtualMachine, VirtualMachineInstance, and DataVolume.
Important distinction: KubeVirt doesn't emulate or containerize your VM. It runs a real KVM/QEMU hypervisor inside a Kubernetes Pod. The guest OS is a full, unmodified Linux or Windows installation.
How It Works Under the Hood
Component Breakdown
Component
Role
virt-api
Extends Kubernetes API to handle
VirtualMachine resources
virt-controller
Manages VM lifecycle (create, start, stop, migrate)
virt-handler
DaemonSet on each node; interfaces with libvirt/QEMU
virt-launcher
Pod that hosts the actual VM; one per running VM
CDI (Containerized Data Importer)
Handles VM disk image imports from HTTP, S3, or registries
VM Lifecycle in Kubernetes
A KubeVirt VM follows a familiar Kubernetes pattern:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: my-ubuntu-vm
namespace: student-labs
spec:
running: true
template:
spec:
domain:
cpu:
cores: 2
memory:
guest: 4Gi
devices:
disks:
- name: rootdisk
disk:
bus: virtio
volumes:
- name: rootdisk
dataVolume:
name: my-ubuntu-vm-rootdisk
The running: true field is the desired state—the controller makes sure reality matches. DataVolumes handle disk provisioning, and the VM gets scheduled just like any other Pod, respecting taints, tolerations, and affinity rules.
Architecture Overview {#architecture}
What We're Building
This project implements a multi-tenant university lab platform with three user types:
- Faculty from the Computer Science department running research VMs with generous resources
- Students running lab VMs with strict quotas to prevent abuse
- IT Administrators with full platform control
Node Pools Configuration
| Pool | VM Size | Purpose | Special Config |
|---|---|---|---|
| System | Standard_D2s_v3 | Run operators, CoreDNS | Tainted for critical add-ons only |
| KubeVirt | Standard_D4s_v3 | Run guest VMs | Taint: kubevirt.io/dedicated, Label: workload=kubevirt
|
A word of caution: The KubeVirt node pool must use VM sizes that support nested virtualization. That means Dv3, Dv4, Dv5, Ev3, Ev4, or Ev5 series. Standard Bs or older Ds series won't work—I learned this the hard way.
Storage Classes
| Class | SKU | Reclaim Policy | Use Case |
|---|---|---|---|
kv-premium-retain |
Premium_LRS | Retain | Production VM disks (data survives VM deletion) |
kv-standard |
StandardSSD_LRS | Delete | Ephemeral and test VMs |
The Identity Challenge: Solving the IMDS Gap {#identity-challenge}
This is where things get interesting—and where I spent most of my debugging time.
The Problem
Every Azure VM can reach the Instance Metadata Service (IMDS) at 169.254.169.254. This service hands out managed identity tokens, instance metadata, and scheduled event notifications. Azure extensions like the AD SSH Login extension depend on it.
But KubeVirt VMs are nested inside an AKS node. When your guest VM tries to reach that link-local address, the request gets blocked by the pod's NAT layer.
The result? Your nested VM has no Azure identity. Standard Azure extensions fail silently.
The Solution: Azure Arc
Azure Arc lets you project non-Azure machines into Azure Resource Manager. That includes on-premise servers, VMs in other clouds, and—crucially for us—nested VMs that can't reach IMDS.
With Arc, your KubeVirt VM gets:
- An Azure Resource Identity (a real resource ID in ARM)
- Managed Identity Equivalent for authenticating to Azure services
- Extension Support including the AADSSHLoginForLinux extension we need
The Registration Flow
Here's how it comes together:
The magic happens during cloud-init. The VM waits for network stability (KubeVirt NAT needs a moment), downloads the Arc agent, and registers itself using a service principal we created in Terraform. Once Arc confirms the connection, Terraform installs the SSH extension.
RBAC for SSH Access
Access control uses standard Azure roles:
| Role | Permissions | Assigned To |
|---|---|---|
| Virtual Machine Administrator Login | SSH + sudo | Faculty, IT Admins |
| Virtual Machine User Login | SSH only (no sudo) | Students |
| Azure Connected Machine Onboarding | Register new Arc machines | Arc Service Principal |
Multi-Tenancy & Security Model {#multi-tenancy}
Namespace-Based Isolation
We use Kubernetes Namespaces as the primary isolation boundary. Each tenant gets their own namespace with dedicated quotas, network policies, and RBAC bindings.
Security Controls
| Control | Implementation | Purpose |
|---|---|---|
| ResourceQuota | Per-namespace CPU/Memory/PVC limits | Prevent resource exhaustion |
| LimitRange | Per-VM resource caps | Stop one VM from eating all the quota |
| NetworkPolicy | Ingress/Egress rules | Network isolation between tenants |
| RBAC (K8s) | RoleBindings to Azure AD groups | Control who can manage VMs |
| RBAC (Azure) | VM Login roles | Control who can SSH into VMs |
| Node Taints | kubevirt.io/dedicated |
Keep VMs on dedicated nodes |
Example: Student Namespace Security Configuration
apiVersion: v1
kind: ResourceQuota
metadata:
name: student-lab-quota
namespace: student-labs
spec:
hard:
requests.cpu: "8"
requests.memory: 16Gi
limits.cpu: "16"
limits.memory: 32Gi
persistentvolumeclaims: "5"
---
apiVersion: v1
kind: LimitRange
metadata:
name: student-vm-limits
namespace: student-labs
spec:
limits:
- type: Container
max:
cpu: "2"
memory: 4Gi
default:
cpu: "1"
memory: 2Gi
Implementation Deep Dive {#implementation}
Terraform Structure
The infrastructure breaks down into logical files:
terraform/
├── main.tf # AKS cluster and node pools
├── providers.tf # Azure, Kubernetes, kubectl providers
├── variables.tf # Input variables with validation
├── outputs.tf # Connection strings and useful outputs
├── identity.tf # Azure AD groups, RBAC assignments
├── arc.tf # Azure Arc SP, roles, extension installer
├── platform.tf # KubeVirt and CDI operator deployment
├── tenancy.tf # Namespace, quota, network policy per tenant
├── storage.tf # StorageClass definitions
├── networking.tf # Egress network policies for operators
├── images.tf # VM image storage (Azure Blob)
├── virtualmachines.tf # Demo VM definition
└── templates/
├── cloud-init-arc.tftpl # Cloud-init for Arc-enabled VMs
└── cloud-init-lab.tftpl # Cloud-init for basic VMs
The Critical Piece: Cloud-Init
The cloud-init script handles Arc registration and needs to be robust. It must deal with:
- Network delays while KubeVirt NAT stabilizes
- DNS resolution for Azure endpoints
- Transient API failures during registration
Key Cloud-Init Logic
wait_for_network() {
for i in $(seq 1 60); do
if curl -s --connect-timeout 5 https://management.azure.com > /dev/null 2>&1; then
echo "[Arc] Network ready"
return 0
fi
echo "[Arc] Waiting for network... ($i/60)"
sleep 5
done
return 1
}
register_with_arc() {
local max_retries=5
local retry_delay=30
for i in $(seq 1 $max_retries); do
if azcmagent connect \
--service-principal-id "$SP_ID" \
--service-principal-secret "$SP_SECRET" \
--tenant-id "$TENANT_ID" \
--subscription-id "$SUB_ID" \
--resource-group "$RG_NAME" \
--location "$LOCATION" \
--resource-name "$(hostname)"; then
echo "[Arc] Registration successful"
return 0
fi
echo "[Arc] Retrying in ${retry_delay}s... ($i/$max_retries)"
sleep $retry_delay
retry_delay=$((retry_delay * 2))
done
return 1
}
Terraform Patterns Worth Noting
| Pattern | Implementation | Why It Matters |
|---|---|---|
| Trigger-based Recreation |
triggers in null_resource
|
Recreate VM when cloud-init changes |
| Dependency Management | Explicit depends_on chains |
Correct deployment order |
| Sensitive Values |
sensitive = true on SP secrets |
Keep secrets out of logs |
Deployment Guide {#deployment}
What You'll Need
| Tool | Version | Purpose |
|---|---|---|
| Azure CLI | 2.40+ | Azure authentication and management |
| Terraform | 1.3+ | Infrastructure provisioning |
| kubectl | 1.24+ | Kubernetes interaction |
| Azure Subscription | — | Owner role required for RBAC |
Step-by-Step
# Clone and configure
git clone https://github.com/ykbytes/aks-kubevirt-arc-unilab.git
cd aks-kubevirt-arc-unilab
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your settings
# Deploy (takes 15-20 minutes)
az login
terraform init
terraform plan
terraform apply
# Get credentials and verify
az aks get-credentials --resource-group rg-uni-kubevirt --name aks-uni-platform
kubectl get kubevirt -n kubevirt # Should show: Deployed
kubectl get vm -n student-labs # Should show: lab-vm Running
# Connect with Azure AD
az ssh vm --name lab-vm --resource-group rg-uni-kubevirt
Arc registration takes about 5-7 minutes. You'll see output like this: The empty status values in the first few minutes are normal—the VM is still booting and running cloud-init.What to Expect During Deployment
null_resource.arc_aad_ssh_extension[0] (local-exec): [Arc] Waiting for lab-vm to connect...
null_resource.arc_aad_ssh_extension[0] (local-exec): [Arc] Status: (attempt 1/90)
...
null_resource.arc_aad_ssh_extension[0] (local-exec): [Arc] Status: (attempt 27/90)
null_resource.arc_aad_ssh_extension[0] (local-exec): [Arc] Machine connected
Verification Commands
# Check Arc registration
az connectedmachine show --name lab-vm --resource-group rg-uni-kubevirt \
--query "{Name:name, Status:status}" -o table
# Check extension status
az connectedmachine extension list --machine-name lab-vm \
--resource-group rg-uni-kubevirt \
--query "[].{Name:name, Status:provisioningState}" -o table
# Alternative: VM console access
kubectl virt console lab-vm -n student-labs
What Success Looks Like
Notice that See a Successful Connection
$ az ssh vm --name lab-vm --resource-group rg-uni-kubevirt
Welcome to Ubuntu 22.04.5 LTS
═══════════════════════════════════════════════════════════════
KubeVirt Lab VM - Azure Arc Enabled
═══════════════════════════════════════════════════════════════
Azure AD Authentication:
az ssh vm --name lab-vm --resource-group rg-uni-kubevirt
Required RBAC Roles:
• Virtual Machine Administrator Login - for sudo access
• Virtual Machine User Login - for standard user access
═══════════════════════════════════════════════════════════════
user@example.com@lab-vm:~$ whoami
user@example.com
whoami returns your Azure AD email, not a local username. No SSH keys were exchanged—Azure AD generated an ephemeral certificate automatically.
This is what makes the Arc approach worthwhile: a nested VM with no direct Azure identity becomes accessible via Azure AD credentials, just like a native Azure VM.
Technologies & Skills Demonstrated {#technologies}
Cloud & Infrastructure
| Technology | Usage |
|---|---|
| Azure Kubernetes Service (AKS) | Managed Kubernetes with workload identity |
| Azure Arc | Hybrid identity for nested VMs |
| Azure Blob Storage | VM image repository |
| Azure Managed Disks | Persistent storage for VM disks |
| Azure RBAC | Fine-grained SSH access control |
Kubernetes & Virtualization
| Technology | Usage |
|---|---|
| KubeVirt | VM orchestration on Kubernetes |
| CDI (Containerized Data Importer) | VM disk image management |
| Kubernetes RBAC | Namespace-level access control |
| NetworkPolicies | Tenant network isolation |
| ResourceQuotas & LimitRanges | Multi-tenant resource governance |
DevOps & Automation
| Technology | Usage |
|---|---|
| Terraform | Infrastructure as Code |
| Cloud-Init | VM bootstrap automation |
| Azure CLI | Scripted Azure operations |
What This Project Demonstrates
- Cloud Architecture: A scalable, multi-tenant platform on Azure
- Kubernetes Depth: KubeVirt, CDI, RBAC, NetworkPolicies working together
- Security Engineering: Zero-trust identity with Azure Arc
- Infrastructure as Code: Production-quality Terraform with proper patterns
- Problem Solving: A creative solution to the IMDS identity gap
Potential Extensions
| Extension | Description |
|---|---|
| GitOps Integration | Deploy VMs via ArgoCD or Flux |
| GPU Passthrough | Enable NVIDIA GPU for AI/ML workloads |
| Live Migration | Move VMs between nodes without downtime |
| Backup/DR | Integrate Velero for VM backup |
| Cost Management | Azure Cost Management tags and budgets |
About the Author
I'm a Cloud Platform Engineer focused on bridging legacy infrastructure and modern cloud-native operations. This project reflects my approach to real-world problems:
- Designing complex cloud architectures that actually work
- Solving identity and security challenges without overengineering
- Writing Terraform that other people can maintain
- Automating the tedious parts so humans can focus on interesting problems
Top comments (0)