Anderson Leite

Posted on Aug 1, 2024

From ground to production: Deploying Workload Identities on AKS

#terraform #sre #security #kubernetes

Introduction

By default, people don't like to think too much about security while developing a software project or testing a infrastructure new cool tool.

That's nothing wrong about it while you are developing that personal project you think will change your and humanity future, but when it's time to put in to production and face the real world, it must be changed.

In my current job, our team tries to always focus in the "what could go wrong" approach, and add security as much as we can. One of the things we are using for a good while was access control using Service Principals (with the Client ID/Client Secret pair), and this could be considered good enough to the vast majority of the cases, right ?

In our case, it started to become a nightmare, since:

We manage everything as code (thanks Hashicorp for almighty terraform)
- Our secrets are rotated every 90 days, and
Some applications relies on the client secret at runtime to authenticate to other managed services (e.g. DocumentDB's, etc)

So, every time we need to rotate these credentials, we need to coordinate the action with the engineering teams, in order to ensure the applications receive a fresh secret and continue to work as intended.

How can we solve it you can ask me ? Well, some approaches we considered:

Instrument the applications with a /refresh endpoint, in order to when a secret got rotated this endpoint can be called:
- Or configure a "cron" which calls this endpoint time to time in order to ensure we always have the current version of the secret available. The problem with this is it would introduce a overhead in the application and besides it, if we use a client secret to authenticate against the KeyVault, when it got rotated, this permissions vanishes too.
Use Workload Identities: This bring us two advantages in comparison with client_id/client_secrets:
- First, we don't need to have the current version of the current client secret anymore (yes, we still use client secrets in other parts of our infra, but this are not the topic here)
- Second, we don't need to store a sensitive value as a secret on the cluster anymore. When the application starts, it requests a token and using this token it tries to authenticates against all services it needs to work, if the user identity has permissions on the services, all good, access granted, application up and running.

In a nutshell, it works like this:

Source of image: https://azure.github.io/AKS-DevSecOps-Workshop/assets/images/module1/diagram.png

Or in a different view, like this:

Source of image: https://learn.microsoft.com/en-us/azure/aks/media/workload-identity-overview/aks-workload-identity-model.png#lightbox

Or like this, in a more detailed view "inside" the cluster:

Source of image: https://azure.github.io/azure-workload-identity/docs/images/flow-diagram.png

While looking for documentation, I start to (naively) think "Wow, this will be a walk in the park" since I found a guide which looks very complete and also a lab guide, part of a AKS DevSecOps workshop, both provided by Microsoft.

I also found this post from an Azure MVP and started to think "how difficult it would be, since it's so well documented?" and, oh boy, I couldn't be more wrong. Buckle up and follow me!

What they don't tell you

I had a hard time to make this work while ensuring I do not break my existing cluster in the process (yes, its a QA cluster, but I care about it anyway). So, I decided to compile a list of things the guides do not mention as prerequisites, to help you to not have the same issues I had while testing this thing. Here it goes:

Basic informations:

I'm using terraform to deploy everything I used in this article, so I will show only the tf code to deploy what is new here.
- You can deploy everything here just using az commands and helm, in this case I would say to check the links from the original articles, to get the commands, and follow this to know what was not mentioned there :)
The cluster is running Kubernetes v1.29.4
I'm using service principals

Other pre-requisites (those are mentioned on the guides linked before):

Your AKS cluster must have OIDC and Workload Identities enabled
- If your cluster was created without those, it can be updated to have this capabilities, issue a az aks update --resource-group <YOUR RG NAME> --name <YOUR AKS CLUSTER NAME> --enable-oidc-issuer --enable-workload-identity and you should be good to go after it completes.

What are not mentioned and I discovered in the hard way during my implementation:

The user identities which will be federated to your service principal/app registration must be created at the same resource group of your AKS cluster (at least, this was the only way I managed to make this thing work, after days of research and tests).
The user identities will be the ones with the permissions from now one on the target resources (e.g. a Keyvault or a Database), instead of your Service Principal.
You MUST install this helm chart in order to use the Workload Identities.
- And here I need to make a personal note: Remember I mentioned here we are trying to do the things in the most secure way possible ? Well, we are running Kyverno on our clusters, and one of the policies we are using, prevent a service account to be created if it don't have the automountServiceAccountToken parameter set to false. After this policy in place, every time you need to use a service account on your workloads, and need to consume their credentials (token/certificates) mounted on the pods, you explicitly need to set this very same automountServiceAccountToken parameter to true inside your deployments.
- The official helm chart linked above do not cover this scenario, so I pushed an PR to extend the helm chart to have it in place, lets see if the good folks from MS review and approve it :)

How we deployed it ?

In this part, I will only show the TF code I added on top of our existing code, so I'm assuming you already have your infrastructure up and running (at least on the basic level).

Here goes the TF code I used to deploy this thing on our infrastructure.

Workload Identity Webhook

# Install the mutating webhook for the Azure Workload Identity
resource "helm_release" "aad_workload_identity_webhook" {
  name             = "workload-identity-webhook"
  chart            = "./helm_charts/azure-workload-identity-webhook"
  namespace        = "azure-workload-identity-system"
  create_namespace = true
  set {
    name  = "azureTenantID"
    value = var.tenant_id
  }
}

Remember, in my case I had to do changes on the helm chart in order to be able to install it while complying with our admission control policies, if you want to use the default helm chart, your code would be something like this:

# Install the mutating webhook for the Azure Workload Identity
resource "helm_release" "aad_workload_identity_webhook" {
  name             = "workload-identity-webhook"
  repository       = "https://azure.github.io/azure-workload-identity/charts"
  chart            = "workload-identity-webhook"
  namespace        = "azure-workload-identity-system"
  create_namespace = true
  set {
    name  = "azureTenantID"
    value = var.tenant_id
  }
}

User assigned identity

resource "azurerm_user_assigned_identity" "user_identity_qa_cluster" {
  location            = data.azurerm_resource_group.myqacluster.location
  name                = "qa-cluster-aad-user-identity-for-applications"
  resource_group_name = data.azurerm_resource_group.myqacluster.name
}

Service account

resource "kubernetes_service_account" "service_account_qa_cluster" {
  metadata {
    name      = "qa-cluster-app1-sa"
    namespace = "test-namespace"
    annotations = {
      "azure.workload.identity/client-id" = azurerm_user_assigned_identity.user_identity_qa_cluster.client_id # Here we need to set the client ID of our User Assigned Identity created above
    }
    labels = {
      "azure.workload.identity/use" = "true"
    }
  }

  automount_service_account_token = false # Do not mount the token in the pods if not explicit set

  depends_on = [
    azurerm_user_assigned_identity.user_identity_qa_cluster
  ]
}

Federated identity credential

resource "azurerm_federated_identity_credential" "federated_credential_app1_qa_cluster" {
  name                = "qa-cluster-app1-federated-identity"
  resource_group_name = data.azurerm_resource_group.myqacluster.name
  audience            = ["api://AzureADTokenExchange"]
  issuer              = data.azurerm_kubernetes_cluster.myqacluster.oidc_issuer_url
  parent_id           = azurerm_user_assigned_identity.user_identity_qa_cluster.id
  subject             = "system:serviceaccount:test-namespace:qa-cluster-app1-sa"

  depends_on = [
    azurerm_user_assigned_identity.user_identity_qa_cluster,
    kubernetes_service_account.service_account_qa_cluster
  ]
}

This cover the basics, however, we still need to give permissions to the resources where the user identity will connect to, in this example lets give permissions to it be able to fetch secrets from a Azure Key Vault:

# Permissions for the User Assigned Identity to access the KeyVault
resource "azurerm_key_vault_access_policy" "aks_permissions_app1_qa_cluster" {
  key_vault_id = data.azurerm_key_vault.mykeyvault.id
  tenant_id    = data.azurerm_subscription.mysubscription.tenant_id
  object_id    = azurerm_user_assigned_identity.user_identity_qa_cluster.principal_id

  secret_permissions = [
    "Get",
    "List"
  ]

  depends_on = [
    azurerm_user_assigned_identity.user_identity_qa_cluster
  ]
}

Testing it

To test it, you can deploy a example application, like showed on the Lab guide linked at the beginning of this article, something like:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: quick-start-test-workload-identities
  namespace: test-namespace
  labels:
    azure.workload.identity/use: "true"
spec:
  serviceAccountName: qa-cluster-app1-sa
  automountServiceAccountToken: true
  containers:
    - image: ghcr.io/azure/azure-workload-identity/msal-net
      name: oidc
      env:
      - name: KEYVAULT_URL
        value: https://my-kevault-address.vault.azure.net/
      - name: SECRET_NAME
        value: testsecret
EOF

When the pod starts, you should see something like this in the logs:

What about production ? How it should work

Well, in my case here, we are replacing the triad:

Tenant ID
Client ID
Client Secret

Stored in a secret and being consumed as env vars (like CLIENT_ID) per the Workload Identity, to do so, you just need to:

Add a azure.workload.identity/use: "true" label at the pod spec section.
Add the service account you want to use, like serviceAccountName: qa-cluster-app1-sa, and
If you are tightening the controls with admission policies like us for the service accounts, include the automountServiceAccountToken: true alongside to the serviceAccountName definition

Springboot applications using the Azure SDK will try by default all authentications methods before fail the startup, with this changes in place you can delete your old secret (and the references to mount it on your Deployment definition) and the new ReplicaSet should show to you something similar to this in the logs:

{"message":"Azure Identity => Attempted credential EnvironmentCredential is unavailable.","timestamp":1722516170180,"thread.name":"cosmos-parallel-1","log.level":"INFO","logger.name":"com.azure.identity.ChainedTokenCredential","class.name":"com.azure.core.util.logging.ClientLogger","method.name":"performLogging","line.number":517,"entity.guid":"MzEzNjQyMXxBUE18QVBQTElDQ7576714NjI3ODIyNjY1OA==","hostname":"my-application-7d6bd64ccd-q4k42","span.id":"","trace.id":"","entity.type":"SERVICE","entity.name":"my-application"}

{"message":"Azure Identity => Attempted credential WorkloadIdentityCredential returns a token","timestamp":1722516170276,"thread.name":"az-identity-1","log.level":"INFO","logger.name":"com.azure.identity.ChainedTokenCredential","class.name":"com.azure.core.util.logging.ClientLogger","method.name":"performLogging","line.number":517,"entity.guid":"MzEzNjQyMXxBUE18QVBQTElDQ7576714NjI3ODIyNjY1OA==","hostname":"my-application-7d6bd64ccd-q4k42","span.id":"","trace.id":"","entity.type":"SERVICE","entity.name":"my-application"}

{"message":"{\"az.sdk.message\":\"Acquired a new access token\"}","timestamp":1722516170278,"thread.name":"az-identity-1","log.level":"INFO","logger.name":"com.azure.core.credential.SimpleTokenCache","class.name":"com.azure.core.util.logging.LoggingEventBuilder","method.name":"performLogging","line.number":371,"entity.guid":"MzEzNjQyMXxBUE18QVBQTElDQ7576714NjI3ODIyNjY1OA==","hostname":"my-application-7d6bd64ccd-q4k42","span.id":"","trace.id":"","entity.type":"SERVICE","entity.name":"my-application"}

Uncovered scenarios

The helm chart with the webhook mutation also includes the capability of inject a sidecar container to the pod, to retrieve the secrets, but in our usecase, this was not needed, since the automount of the service account already did the trick, but if you need to use it, you need to add an annotation to your deployment spec, like: azure.workload.identity/inject-proxy-sidecar: true

Conclusion

Hopefully this aggregated steps will save you a day (or two) while deploying this thing!

Top comments (1)

Piya • Aug 1 '24

Great read, very well explained!