Oluwagbade Odimayo

Posted on Jun 4

From EKS to AKS: I Rebuilt My AWS Pipeline on Azure in 6 Hours. Here is What Actually Happened.

#kubernetes #azure #terraform #devops

I have been building on AWS for a while. EKS, ECR, IAM, GitHub Actions -- I know that world reasonably well. But almost every DevOps role I looked at in the UK specified Azure. Not AWS. Azure DevOps. AKS. Key Vault. Terraform against the azurerm provider.

So I gave myself a weekend afternoon and a target: build a production-grade deployment platform on Azure from scratch -- infrastructure, CI/CD, secrets management, observability -- and document everything honestly. Not a tutorial. Not a sanitised walkthrough. Everything that broke, everything that surprised me, and what I would do differently.

The result is MediFlow -- a clinical data ingestion API running on AKS, deployed through a three-stage Azure DevOps pipeline, with secrets pulled from Key Vault at runtime and Prometheus scraping both pods. Six hours, start to finish.

Here is what actually happened.

The application (five minutes, intentionally)

MediFlow is a FastAPI service. Three endpoints: POST a clinical trial submission, GET it back by UUID, GET /health for the readiness probe. Pydantic validation, in-memory store, 15 pytest test cases. I spent five minutes on the application because the application is not the point. The infrastructure is the point.

@app.post("/records", response_model=RecordOut, status_code=201)
def create_record(payload: RecordIn):
    record_id = str(uuid.uuid4())
    record = RecordOut(id=record_id, **payload.model_dump())
    store[record_id] = record
    return record

The validation model enforces real clinical data constraints -- site IDs alphanumeric only, trial phase a strict enum (I, II, III, IV), patient count between 1 and 10,000. Realistic enough to be interesting, simple enough that it does not distract from the infrastructure work.

Terraform: familiar structure, unfamiliar details

I write Terraform regularly. Switching from the aws provider to azurerm felt like switching keyboards -- same layout, some keys in different places.

The first thing that caught me: Azure needs two providers. azurerm for infrastructure, azuread for identity and role assignments. You cannot do one without the other if you want proper RBAC wiring.

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
    azuread = {
      source  = "hashicorp/azuread"
      version = "~> 3.0"
    }
  }
}

Then I hit two problems before a single resource was created.

Kubernetes version. I specified kubernetes_version = "1.30". Azure rejected it. In UK South, version 1.30 is LTS-only and requires the Premium cluster tier. The standard tier versions have full patch numbers -- 1.35.4, not 1.35. Running az aks get-versions --location uksouth shows the difference. Once I specified the full patch version it worked immediately.

VM size quota. I used Standard_B2s -- a reasonable general-purpose size on AWS equivalents. Azure rejected it too. That SKU is not available in my subscription in UK South. The error message returned the full list of available SKUs, which was actually helpful. I picked Standard_D2ps_v6: 2 vCPUs, ARM-based, cost-efficient, available. That ARM decision came back to bite me later in an interesting way.

Once those were sorted, terraform apply provisioned 12 resources: resource group, VNet with two subnets, AKS cluster, ACR, Key Vault, Log Analytics workspace, and the role assignments wiring them together.

On ACR vs ECR: The pull access model is different. On AWS you attach an IAM policy to the node group role. On Azure you create an AcrPull role assignment scoped to the registry, assigned to the AKS kubelet identity. Admin access on the registry stays disabled -- that is the production-correct approach.

resource "azurerm_role_assignment" "aks_acr_pull" {
  principal_id         = azurerm_kubernetes_cluster.main.kubelet_identity[0].object_id
  role_definition_name = "AcrPull"
  scope                = azurerm_container_registry.main.id
}

Clean. No stored credentials. The kubelet identity pulls images by virtue of its role, nothing else.

Azure DevOps: the pipeline that refused to run

I have used GitHub Actions extensively. Azure DevOps Pipelines is structurally similar -- YAML, stages, jobs, steps -- but the surrounding ecosystem is different. You work within organisations and projects, you create service connections to external resources, and you reference those connections by name in the pipeline YAML.

The pipeline I wrote has three stages:

stages:
  - stage: Test
    jobs:
      - job: Pytest
        steps:
          - script: python3 -m pytest tests/ -v --tb=short

  - stage: Build
    dependsOn: Test
    condition: succeeded()
    jobs:
      - job: BuildPush
        steps:
          - task: Docker@2
            inputs:
              command: buildAndPush
              containerRegistry: acr-mediflow

  - stage: Deploy
    dependsOn: Build
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
    jobs:
      - deployment: HelmDeploy
        environment: dev
        strategy:
          runOnce:
            deploy:
              steps:
                - task: HelmDeploy@1

Straightforward enough. Except the pipeline would not run.

The agent parallelism gate. Microsoft-hosted agents require purchased parallel jobs for private projects in new Azure DevOps organisations -- even on pay-as-you-go Azure subscriptions. The free grant that used to exist for new organisations was removed in 2023. My pipeline sat in the queue indefinitely.

The fix: a self-hosted agent running in Docker on my local machine. I built a custom Ubuntu 22.04 image with Docker CLI, Helm, kubectl, and the Azure CLI pre-installed, then mounted the Docker socket so the agent could build images:

docker run \
  -e AZP_URL="https://dev.azure.com/myorg" \
  -e AZP_TOKEN="<pat>" \
  -e AGENT_ALLOW_RUNASROOT=1 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v $HOME/.kube:/root/.kube \
  azdevops-agent:local

AGENT_ALLOW_RUNASROOT=1 is required because Docker containers run as root and the Azure DevOps agent refuses to start as root without it. Once the agent was running and registered to the Default pool, the pipeline picked up jobs immediately.

The ARM64 problem I should have seen coming

First deployment. Two minutes of watching the pipeline go green. Then this:

Failed to pull image: no match for platform in manifest

My Docker image was built for linux/amd64. My AKS nodes are Standard_D2ps_v6 -- ARM64. The architectures did not match. I had made the ARM64 node choice for cost reasons and then promptly forgotten about it when writing the Dockerfile.

The fix is docker buildx with multi-platform support. Instead of a standard docker build, the pipeline runs:

docker buildx create --use --name multiarch
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --tag acrmediflowdev.azurecr.io/mediflow:${BUILD_ID} \
  --push .

This produces a manifest list -- one image tag that serves the correct architecture layer to whatever platform pulls it. AKS ARM64 nodes get the arm64 layer. Any amd64 machine gets the amd64 layer. This is the correct production approach regardless of your current node architecture. If your node pool ever changes, the image handles it automatically.

Build time increased from about 45 seconds to around 90 seconds. Worth it.

Key Vault secret injection: the part that actually required thinking

This is where I spent most of my debugging time. The concept is simple: store a secret in Azure Key Vault, have the CSI driver pull it into the pod at runtime, expose it as an environment variable. No secrets in code, Helm values, environment files, or pipeline variables.

The implementation involves a SecretProviderClass that tells the CSI driver what to fetch:

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: mediflow-keyvault
spec:
  provider: azure
  secretObjects:
    - secretName: mediflow-secrets
      type: Opaque
      data:
        - objectName: mediflow-api-key
          key: API_KEY
  parameters:
    clientID: "<csi-driver-client-id>"
    keyvaultName: "kv-mediflow-dev"
    tenantId: "<tenant-id>"
    objects: |
      array:
        - |
          objectName: mediflow-api-key
          objectType: secret

The deployment mounts the volume and reads the synced secret:

env:
  - name: API_KEY
    valueFrom:
      secretKeyRef:
        name: mediflow-secrets
        key: API_KEY
volumeMounts:
  - name: secrets-store
    mountPath: "/mnt/secrets"
    readOnly: true

This looks correct. The pod stayed in ContainerCreating for ten minutes.

Failure 1: no federated identity credentials. The error was AADSTS70025. When workload identity is enabled on AKS, service accounts authenticate via OIDC token exchange. The CSI driver's managed identity needs a federated credential wired to the cluster's OIDC issuer. Without it, every authentication attempt returns 401.

az identity federated-credential create \
  --name mediflow-csi-federated \
  --identity-name azurekeyvaultsecretsprovider-aks-mediflow-dev \
  --resource-group MC_rg-mediflow-dev_aks-mediflow-dev_uksouth \
  --issuer "<oidc-issuer-url>" \
  --subject system:serviceaccount:mediflow:default \
  --audience api://AzureADTokenExchange

Failure 2: wrong subject. I initially set the subject to system:serviceaccount:kube-system:azurekeyvaultsecretsprovider -- the CSI driver's own service account. The error told me the actual subject being presented was system:serviceaccount:mediflow:default -- the pod's service account in the mediflow namespace. The federated credential subject must match exactly what the pod presents when it requests a token. One character off and the whole chain breaks.

After fixing the subject and granting the CSI identity Key Vault Secrets User on the vault, the pod started and the secret was live:

kubectl exec -n mediflow deploy/mediflow -- env | grep API_KEY
API_KEY=placeholder-rotate-before-production

Failure 3: Helm timeout. Even after everything worked, the pipeline was timing out at 5 minutes because the CSI mount adds startup latency to the pod. --timeout 10m on the Helm upgrade fixed it.

Three separate failures for one feature. Each one had a clear fix once you understood the error. The Azure documentation on workload identity federation is good but assumes you already know the mental model. If you are coming from AWS IRSA, the concepts are similar but the specific steps differ enough to trip you up.

Observability: the easy part

After Key Vault, Prometheus felt straightforward.

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=<password> \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --wait --timeout 10m

The serviceMonitorSelectorNilUsesHelmValues=false flag matters -- without it, Prometheus only scrapes ServiceMonitors with a specific Helm release label. Setting it to false means it picks up all ServiceMonitors across all namespaces.

A ServiceMonitor in the mediflow namespace registers the application:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: mediflow
  labels:
    release: monitoring
spec:
  selector:
    matchLabels:
      app: mediflow
  endpoints:
    - port: http
      path: /health
      interval: 30s

Both pods appeared as Prometheus targets within 30 seconds. 20 active targets total across the cluster. Grafana accessible at localhost:3000. Done.

Azure vs AWS: what actually differs

Having built similar infrastructure on both platforms, here is my honest read:

Identity is more explicit on Azure. AWS IAM is powerful but the mental model requires you to think in terms of policies, roles, and trust relationships across account boundaries. Azure managed identities with RBAC role assignments are more granular -- you scope a permission to a specific resource, not a resource type. The workload identity federation model is the Azure equivalent of AWS IRSA, and once you understand the OIDC token exchange pattern, it clicks. Getting to that understanding takes longer on Azure because there are more moving parts to wire up.

Networking is comparable. Azure CNI gives pods real VNet IP addresses, same as AWS VPC CNI. The Terraform configuration is more explicit -- you specify subnet IDs, service CIDRs, and DNS IPs separately rather than having defaults inferred. More control, more things to get right.

CI/CD tooling reflects enterprise vs developer priorities. GitHub Actions is faster to set up and better for individual projects. Azure DevOps is built for teams inside the Microsoft ecosystem -- the service connection model, environment approvals, and audit trail are genuinely better for regulated environments. For a pharma or financial services context, Azure DevOps is the right choice.

Managed Kubernetes is managed Kubernetes. Once you have a kubeconfig, kubectl works identically. The control plane differences are invisible day to day.

What six hours taught me

The infrastructure itself was not complicated -- Terraform, Helm, and Azure DevOps follow patterns I already knew. What took time was learning where Azure puts things differently: the two-provider Terraform setup, the full Kubernetes version strings, the VM SKU availability, the federated identity subject matching, the agent parallelism gate.

None of those are difficult problems. They are friction points that exist because the Azure ecosystem has different conventions from AWS, and those conventions are not always obvious from the documentation alone.

The most important thing I learned is that the Key Vault CSI driver workload identity chain -- OIDC issuer, federated credential, service account, subject -- is unforgiving. Every link must be correct. But once it is correct, it is genuinely elegant. No credentials stored anywhere. The cluster proves its identity cryptographically and the vault responds with the secret. That is how secrets management should work.

Push a commit. Four minutes later: tests pass, multi-arch image in ACR, two pods running on AKS ARM64 nodes, Key Vault secret injected, Prometheus scraping both pods, Grafana ready.

That is what production-grade looks like on Azure.

Repo: github.com/gbadedata/mediflow

Previously: From Minikube to AWS EKS: How I Built a Zero-Downtime Blue-Green Deployment Pipeline for ShopSwift