Introduction
By default, people don't like to think too much about security while developing a software project or testing a infrastructure new cool tool.
That's nothing wrong about it while you are developing that personal project you think will change your and humanity future, but when it's time to put in to production and face the real world, it must be changed.
In my current job, our team tries to always focus in the "what could go wrong" approach, and add security as much as we can. One of the things we are using for a good while was access control using Service Principals (with the Client ID/Client Secret pair), and this could be considered good enough to the vast majority of the cases, right ?
In our case, it started to become a nightmare, since:
- We manage everything as code (thanks Hashicorp for almighty terraform)
- Our secrets are rotated every 90 days, and
- Some applications relies on the client secret at runtime to authenticate to other managed services (e.g. DocumentDB's, etc)
So, every time we need to rotate these credentials, we need to coordinate the action with the engineering teams, in order to ensure the applications receive a fresh secret and continue to work as intended.
How can we solve it you can ask me ? Well, some approaches we considered:
- Instrument the applications with a
/refresh
endpoint, in order to when a secret got rotated this endpoint can be called:- Or configure a "cron" which calls this endpoint time to time in order to ensure we always have the current version of the secret available. The problem with this is it would introduce a overhead in the application and besides it, if we use a client secret to authenticate against the KeyVault, when it got rotated, this permissions vanishes too.
- Use Workload Identities: This bring us two advantages in comparison with client_id/client_secrets:
- First, we don't need to have the current version of the current client secret anymore (yes, we still use client secrets in other parts of our infra, but this are not the topic here)
- Second, we don't need to store a sensitive value as a secret on the cluster anymore. When the application starts, it requests a token and using this token it tries to authenticates against all services it needs to work, if the user identity has permissions on the services, all good, access granted, application up and running.
In a nutshell, it works like this:
Source of image: https://azure.github.io/AKS-DevSecOps-Workshop/assets/images/module1/diagram.png
Or in a different view, like this:
Source of image: https://learn.microsoft.com/en-us/azure/aks/media/workload-identity-overview/aks-workload-identity-model.png#lightbox
Or like this, in a more detailed view "inside" the cluster:
Source of image: https://azure.github.io/azure-workload-identity/docs/images/flow-diagram.png
While looking for documentation, I start to (naively) think "Wow, this will be a walk in the park" since I found a guide which looks very complete and also a lab guide, part of a AKS DevSecOps workshop, both provided by Microsoft.
I also found this post from an Azure MVP and started to think "how difficult it would be, since it's so well documented?" and, oh boy, I couldn't be more wrong. Buckle up and follow me!
What they don't tell you
I had a hard time to make this work while ensuring I do not break my existing cluster in the process (yes, its a QA cluster, but I care about it anyway). So, I decided to compile a list of things the guides do not mention as prerequisites, to help you to not have the same issues I had while testing this thing. Here it goes:
Basic informations:
- I'm using terraform to deploy everything I used in this article, so I will show only the tf code to deploy what is new here.
- You can deploy everything here just using
az
commands and helm, in this case I would say to check the links from the original articles, to get the commands, and follow this to know what was not mentioned there :)
- You can deploy everything here just using
- The cluster is running Kubernetes v1.29.4
- I'm using service principals
Other pre-requisites (those are mentioned on the guides linked before):
- Your AKS cluster must have OIDC and Workload Identities enabled
- If your cluster was created without those, it can be updated to have this capabilities, issue a
az aks update --resource-group <YOUR RG NAME> --name <YOUR AKS CLUSTER NAME> --enable-oidc-issuer --enable-workload-identity
and you should be good to go after it completes.
- If your cluster was created without those, it can be updated to have this capabilities, issue a
What are not mentioned and I discovered in the hard way during my implementation:
- The user identities which will be federated to your service principal/app registration must be created at the same resource group of your AKS cluster (at least, this was the only way I managed to make this thing work, after days of research and tests).
- The user identities will be the ones with the permissions from now one on the target resources (e.g. a Keyvault or a Database), instead of your Service Principal.
- You MUST install this helm chart in order to use the Workload Identities.
- And here I need to make a personal note: Remember I mentioned here we are trying to do the things in the most secure way possible ? Well, we are running Kyverno on our clusters, and one of the policies we are using, prevent a service account to be created if it don't have the
automountServiceAccountToken
parameter set tofalse
. After this policy in place, every time you need to use a service account on your workloads, and need to consume their credentials (token/certificates) mounted on the pods, you explicitly need to set this very sameautomountServiceAccountToken
parameter totrue
inside your deployments. - The official helm chart linked above do not cover this scenario, so I pushed an PR to extend the helm chart to have it in place, lets see if the good folks from MS review and approve it :)
- And here I need to make a personal note: Remember I mentioned here we are trying to do the things in the most secure way possible ? Well, we are running Kyverno on our clusters, and one of the policies we are using, prevent a service account to be created if it don't have the
How we deployed it ?
In this part, I will only show the TF code I added on top of our existing code, so I'm assuming you already have your infrastructure up and running (at least on the basic level).
Here goes the TF code I used to deploy this thing on our infrastructure.
- Workload Identity Webhook
# Install the mutating webhook for the Azure Workload Identity
resource "helm_release" "aad_workload_identity_webhook" {
name = "workload-identity-webhook"
chart = "./helm_charts/azure-workload-identity-webhook"
namespace = "azure-workload-identity-system"
create_namespace = true
set {
name = "azureTenantID"
value = var.tenant_id
}
}
Remember, in my case I had to do changes on the helm chart in order to be able to install it while complying with our admission control policies, if you want to use the default helm chart, your code would be something like this:
# Install the mutating webhook for the Azure Workload Identity
resource "helm_release" "aad_workload_identity_webhook" {
name = "workload-identity-webhook"
repository = "https://azure.github.io/azure-workload-identity/charts"
chart = "workload-identity-webhook"
namespace = "azure-workload-identity-system"
create_namespace = true
set {
name = "azureTenantID"
value = var.tenant_id
}
}
- User assigned identity
resource "azurerm_user_assigned_identity" "user_identity_qa_cluster" {
location = data.azurerm_resource_group.myqacluster.location
name = "qa-cluster-aad-user-identity-for-applications"
resource_group_name = data.azurerm_resource_group.myqacluster.name
}
- Service account
resource "kubernetes_service_account" "service_account_qa_cluster" {
metadata {
name = "qa-cluster-app1-sa"
namespace = "test-namespace"
annotations = {
"azure.workload.identity/client-id" = azurerm_user_assigned_identity.user_identity_qa_cluster.client_id # Here we need to set the client ID of our User Assigned Identity created above
}
labels = {
"azure.workload.identity/use" = "true"
}
}
automount_service_account_token = false # Do not mount the token in the pods if not explicit set
depends_on = [
azurerm_user_assigned_identity.user_identity_qa_cluster
]
}
- Federated identity credential
resource "azurerm_federated_identity_credential" "federated_credential_app1_qa_cluster" {
name = "qa-cluster-app1-federated-identity"
resource_group_name = data.azurerm_resource_group.myqacluster.name
audience = ["api://AzureADTokenExchange"]
issuer = data.azurerm_kubernetes_cluster.myqacluster.oidc_issuer_url
parent_id = azurerm_user_assigned_identity.user_identity_qa_cluster.id
subject = "system:serviceaccount:test-namespace:qa-cluster-app1-sa"
depends_on = [
azurerm_user_assigned_identity.user_identity_qa_cluster,
kubernetes_service_account.service_account_qa_cluster
]
}
This cover the basics, however, we still need to give permissions to the resources where the user identity will connect to, in this example lets give permissions to it be able to fetch secrets from a Azure Key Vault:
# Permissions for the User Assigned Identity to access the KeyVault
resource "azurerm_key_vault_access_policy" "aks_permissions_app1_qa_cluster" {
key_vault_id = data.azurerm_key_vault.mykeyvault.id
tenant_id = data.azurerm_subscription.mysubscription.tenant_id
object_id = azurerm_user_assigned_identity.user_identity_qa_cluster.principal_id
secret_permissions = [
"Get",
"List"
]
depends_on = [
azurerm_user_assigned_identity.user_identity_qa_cluster
]
}
Testing it
To test it, you can deploy a example application, like showed on the Lab guide linked at the beginning of this article, something like:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: quick-start-test-workload-identities
namespace: test-namespace
labels:
azure.workload.identity/use: "true"
spec:
serviceAccountName: qa-cluster-app1-sa
automountServiceAccountToken: true
containers:
- image: ghcr.io/azure/azure-workload-identity/msal-net
name: oidc
env:
- name: KEYVAULT_URL
value: https://my-kevault-address.vault.azure.net/
- name: SECRET_NAME
value: testsecret
EOF
When the pod starts, you should see something like this in the logs:
What about production ? How it should work
Well, in my case here, we are replacing the triad:
- Tenant ID
- Client ID
- Client Secret
Stored in a secret and being consumed as env vars (like CLIENT_ID
) per the Workload Identity, to do so, you just need to:
- Add a
azure.workload.identity/use: "true"
label at the podspec
section. - Add the service account you want to use, like
serviceAccountName: qa-cluster-app1-sa
, and - If you are tightening the controls with admission policies like us for the service accounts, include the
automountServiceAccountToken: true
alongside to theserviceAccountName
definition
Springboot applications using the Azure SDK will try by default all authentications methods before fail the startup, with this changes in place you can delete your old secret (and the references to mount it on your Deployment
definition) and the new ReplicaSet should show to you something similar to this in the logs:
{"message":"Azure Identity => Attempted credential EnvironmentCredential is unavailable.","timestamp":1722516170180,"thread.name":"cosmos-parallel-1","log.level":"INFO","logger.name":"com.azure.identity.ChainedTokenCredential","class.name":"com.azure.core.util.logging.ClientLogger","method.name":"performLogging","line.number":517,"entity.guid":"MzEzNjQyMXxBUE18QVBQTElDQ7576714NjI3ODIyNjY1OA==","hostname":"my-application-7d6bd64ccd-q4k42","span.id":"","trace.id":"","entity.type":"SERVICE","entity.name":"my-application"}
{"message":"Azure Identity => Attempted credential WorkloadIdentityCredential returns a token","timestamp":1722516170276,"thread.name":"az-identity-1","log.level":"INFO","logger.name":"com.azure.identity.ChainedTokenCredential","class.name":"com.azure.core.util.logging.ClientLogger","method.name":"performLogging","line.number":517,"entity.guid":"MzEzNjQyMXxBUE18QVBQTElDQ7576714NjI3ODIyNjY1OA==","hostname":"my-application-7d6bd64ccd-q4k42","span.id":"","trace.id":"","entity.type":"SERVICE","entity.name":"my-application"}
{"message":"{\"az.sdk.message\":\"Acquired a new access token\"}","timestamp":1722516170278,"thread.name":"az-identity-1","log.level":"INFO","logger.name":"com.azure.core.credential.SimpleTokenCache","class.name":"com.azure.core.util.logging.LoggingEventBuilder","method.name":"performLogging","line.number":371,"entity.guid":"MzEzNjQyMXxBUE18QVBQTElDQ7576714NjI3ODIyNjY1OA==","hostname":"my-application-7d6bd64ccd-q4k42","span.id":"","trace.id":"","entity.type":"SERVICE","entity.name":"my-application"}
Uncovered scenarios
The helm chart with the webhook mutation also includes the capability of inject a sidecar container to the pod, to retrieve the secrets, but in our usecase, this was not needed, since the automount of the service account already did the trick, but if you need to use it, you need to add an annotation to your deployment spec, like: azure.workload.identity/inject-proxy-sidecar: true
Conclusion
Hopefully this aggregated steps will save you a day (or two) while deploying this thing!
Top comments (1)
Great read, very well explained!