We Had Secrets in Kubernetes. Then We Got Audited.

#azure #kubernetes #devops #security

For the first two years of running workloads on AKS we stored secrets the way most teams do when they're moving fast. We created Kubernetes secrets, base64 encoded the values, committed the manifests to a private repo and told ourselves we'd clean it up later. Then a security audit flagged it and we had four weeks to fix it.

This is the story of that migration and what we learned doing it under pressure.

The Problem With Kubernetes Secrets

Kubernetes secrets are not actually secret in any meaningful security sense. They are base64 encoded which is encoding not encryption. Anyone with read access to the namespace can decode them in seconds. If your etcd is not encrypted at rest and someone gets access to a snapshot they have all your secrets in plaintext. And if a developer accidentally commits a secret manifest to a repo before the gitignore catches it the value is now in git history forever.

We knew all of this. We just hadn't prioritised fixing it until the audit made it urgent.

The audit finding was specific. We had database connection strings, third party API keys and internal service tokens stored as Kubernetes secrets across eight namespaces on three clusters. The auditor wanted to see secrets stored in a dedicated secrets manager with audit logging, rotation support and access policies tied to identities rather than cluster-level permissions.

Azure Key Vault was the obvious answer. The question was how to get secrets from Key Vault into our pods without rebuilding how our applications consumed them.

What We Evaluated

Our applications expected secrets either as environment variables or as files mounted into the container. We didn't want to change application code as part of this migration. That constraint ruled out a few approaches.

The first option was to have applications call the Key Vault SDK directly. This would work but it meant changing code in a dozen services and introducing a new dependency and failure mode into each one. Not something we wanted to do under a four week deadline.

The second option was to use the Azure Key Vault Provider for Secrets Store CSI Driver. This runs as a DaemonSet on your nodes and lets you define a SecretProviderClass resource that maps Key Vault secrets to files mounted into your pods. Optionally it can sync those secrets into Kubernetes secrets so applications that read environment variables still work without code changes. This was the right fit for us.

The Architecture

The setup has three components working together.

The Secrets Store CSI Driver handles mounting secrets as volumes into pods. The Azure Key Vault Provider is the plugin that knows how to talk to Key Vault specifically. Workload Identity is how the pod authenticates to Key Vault without any credentials stored anywhere in the cluster.

Workload Identity is worth pausing on. The way it works is that a Kubernetes service account is federated with an Azure Managed Identity. When a pod using that service account makes a request to Key Vault, Azure verifies the federation and grants access based on the Key Vault access policy attached to the Managed Identity. No secrets are involved in the authentication. No tokens to rotate. No credentials to leak.

Setting it up looks like this.

First you enable the CSI driver and Workload Identity on your cluster.

az aks update \
  --name my-cluster \
  --resource-group my-rg \
  --enable-oidc-issuer \
  --enable-workload-identity \
  --enable-addons azure-keyvault-secrets-provider

Then you create a Managed Identity and give it access to Key Vault.

az identity create \
  --name my-workload-identity \
  --resource-group my-rg

az keyvault set-policy \
  --name my-keyvault \
  --object-id <identity-principal-id> \
  --secret-permissions get list

Then you federate the Kubernetes service account with the Managed Identity.

az identity federated-credential create \
  --name my-fed-credential \
  --identity-name my-workload-identity \
  --resource-group my-rg \
  --issuer <oidc-issuer-url> \
  --subject system:serviceaccount:my-namespace:my-service-account

Then you define a SecretProviderClass that maps which Key Vault secrets you want and how they should appear in the pod.

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: my-app-secrets
  namespace: my-namespace
spec:
  provider: azure
  parameters:
    usePodIdentity: "false"
    clientID: <managed-identity-client-id>
    keyvaultName: my-keyvault
    tenantID: <tenant-id>
    objects: |
      array:
        - |
          objectName: db-connection-string
          objectType: secret
        - |
          objectName: api-key-external-service
          objectType: secret
  secretObjects:
    - secretName: my-app-secrets
      type: Opaque
      data:
        - objectName: db-connection-string
          key: DB_CONNECTION_STRING
        - objectName: api-key-external-service
          key: EXTERNAL_API_KEY

The secretObjects block is what creates the Kubernetes secret from the Key Vault values. Your pod can then reference it as an environment variable the same way it always did. No application changes required.

What Broke During Migration

We migrated eight namespaces across three clusters over ten days. Here is what went wrong.

The sync delay we didn't know about

The CSI driver syncs secret values from Key Vault on a polling interval. The default is two minutes. We discovered this when we rotated a Key Vault secret during testing and the running pods kept using the old value for two minutes after the rotation. This is expected behaviour but it meant our rotation procedure needed to account for this lag. If you have a hard dependency on immediate propagation after rotation you need to either reduce the sync interval or plan for a rolling restart of affected pods after rotation completes.

We settled on a rotation runbook that updates the Key Vault secret, waits for the sync interval and then triggers a rollout restart on the affected deployments. Not fully automated yet but reliable.

Pods fail to start if Key Vault is unreachable

This one caught us in staging. The CSI driver mounts the secret as a volume at pod startup. If Key Vault is unreachable when the pod starts the mount fails and the pod does not start at all. This is different from the old behaviour where the Kubernetes secret was already in the cluster and pod startup had no external dependency.

During a brief Key Vault connectivity issue we had pods failing to restart after a node recycling event. The pods that were already running were fine. Any pod that needed to start fresh was stuck.

The mitigation is to make sure your AKS clusters access Key Vault over a Private Endpoint rather than the public endpoint so the network path is more reliable and doesn't traverse the public internet. We had meant to do this from the start but hadn't gotten to it. This incident moved it up the priority list.

Access policy gaps we found late

When you grant a Managed Identity access to Key Vault you scope it to specific secret names or use a wildcard. We initially used wildcards to move fast. The auditor came back and asked us to scope permissions to specific secret names per workload. Going back and tightening those policies across all our Key Vault instances without breaking running workloads was tedious. We should have done it right the first time.

The lesson is to treat Key Vault access policies like you treat Kubernetes RBAC. Least privilege from day one is much easier than retrofitting it later.

What the Audit Outcome Looked Like

Four weeks after the finding we had all secrets migrated to Key Vault, Workload Identity configured on all clusters, Private Endpoints in place for Key Vault access and audit logging enabled in Azure Monitor so we could show exactly which identity accessed which secret and when. The auditor closed the finding.

The audit log piece turned out to be more useful than expected. We surfaced a service account that was pulling a secret it had no business accessing because a developer had copy-pasted a service account name from another namespace. We wouldn't have caught that without the logs.

What I'd Tell Someone Starting This Today

Don't wait for an audit. Set this up before you have secrets in Kubernetes at all. It takes a day to configure properly on a fresh cluster and it saves you the pain of migrating running workloads later.

Enable Private Endpoints for Key Vault before you go to production. The public endpoint works but any network dependency at pod startup is a reliability risk you don't need.

Scope access policies to specific secrets per workload from the beginning. The wildcard shortcut costs you later.

Set up alerts on Key Vault diagnostic logs for denied access attempts. It's the fastest way to catch misconfigured identities and the occasional developer testing something they shouldn't be.

And document your rotation procedure before you need it. The worst time to figure out how rotation works end-to-end across your clusters is when you're rotating a secret because it leaked.