DEV Community

Cover image for Secrets Management Across Multi-Cloud Pipelines
Nerav Doshi
Nerav Doshi

Posted on • Originally published at pipelineandprompts.com

Secrets Management Across Multi-Cloud Pipelines

πŸ› οΈ Pipelines in the Wild #3

Pipeline & Prompts | Byte size guides on DevOps, Cloud and AI


⚑ Byte Size Summary

  • Secret management failures are invisible until they cause a production incident β€” start with RBAC and namespace isolation before the first workload goes live
  • Storing secrets in a central vault solves the sprawl problem but introduces a new failure mode: rotation lag between the vault and the namespace-level Kubernetes secret
  • The real unsolved problem is not technical β€” it is knowing who owns the approval and escalation path when a credential rotates at 2 AM across a multi-timezone team

The Story

The deployment had been running fine in dev for two days. Same manifests, same pipeline, same container images. We promoted to production and the pods went straight into ImagePullBackOff.

Not a misconfigured resource limit. Not a broken liveness probe. A pull secret that existed in the dev namespace and nowhere else.

The registry was internal. The credential was real. Nobody had thought to check whether the secret had been created in the production namespace β€” because it had been created ad hoc during initial testing, stored on a local notepad, and everyone assumed someone else had handled it for prod.

What followed was several hours of degraded production, a delayed platform release, and five or six people across multiple time zones working from memory and Slack threads with no runbook in sight. The fix, once identified, took minutes. Finding the fix took hours.

That incident was the starting point of a long education in secret management. The immediate problem was a missing pull secret in the wrong namespace. The real problem ran deeper β€” and it took an audit, an enterprise approval process, a failed secret rotation, and one very sharp observation from a more experienced engineer to understand what it actually was.


The Problem

In the early stages of a Kubernetes adoption, secrets are almost always an afterthought. The team is focused on getting workloads running, learning the platform, and delivering against commitments. Secrets get created when something fails, stored wherever is convenient, and recreated from memory the next time something breaks.

This works until it doesn't.

The failure mode is not just operational β€” a wrong namespace, a stale credential, a missed rotation. The deeper failure is structural. Kubernetes base64 encoding is not encryption. Any service account with read access to a namespace can retrieve every secret in that namespace and decode the values in seconds. Without RBAC, dev service accounts can read prod database credentials. Without namespace isolation, a misconfigured workload in one environment can inadvertently consume secrets intended for another.

Platform engineers moving into multi-cloud environments compound this problem. Each cloud has its own native secrets service. Each pipeline has its own credential requirements. Each environment has its own namespace structure. Without a deliberate architecture, secrets sprawl across notepads, environment variables, ConfigMaps used as secret storage, and Git commits that are very hard to fully expunge once they are pushed.

The incident cost was one day's delay on a significant platform release, discovered manually by a human checking on a deployment that had been quietly failing for hours. There was no alert. No monitor. No automated detection. Just someone who happened to look.


Why Existing Approaches Fall Short

Ad hoc secret creation per namespace

The natural first step. Create the secret where you need it, when you need it. Fast to start, impossible to maintain. Secrets diverge between environments, rotation becomes manual per namespace, and the source of truth is whoever created the secret last.

Kubernetes Secrets without RBAC

Kubernetes Secrets are base64 encoded, not encrypted at rest by default on vanilla Kubernetes. OpenShift 4.x enables etcd encryption for Secrets by default β€” but without RBAC, any pod's service account with namespace access can still read any secret in that namespace. In a shared cluster with dev and prod namespaces side by side, this is not a theoretical risk β€” it is a standing exposure that an audit will find immediately.

Cluster separation as a security boundary

Separating prod and dev onto different clusters contains blast radius but does not fix the underlying problem. Ad hoc secrets still get created. Rotation is still manual. Tribal knowledge still owns the recovery path. The incident can no longer cross environments, but within each environment, the same exposure exists.

Cloud-native secrets managers without a sync strategy

Centralizing secrets in a cloud-native vault is the right architectural move. But it introduces a new failure mode that most documentation does not cover: the sync gap. When a secret rotates in the vault, the namespace-level Kubernetes Secret object is a separate artifact. If the sync between vault and namespace fails β€” or if the pod is not restarted after a successful sync β€” the running workload is using a stale credential. The vault shows the rotation succeeded. The pod disagrees.


The Architecture

Secret Management Architecture β€” Trust Boundaries and Sync Flow

The diagram above proves one thing: secret management is a routing problem with two distinct failure points β€” the trust boundary between namespaces, and the sync gap between the central vault and the Kubernetes Secret object.

The architecture has three layers.

Layer 1 β€” Central Secrets Store

A cloud-native or self-hosted secrets manager holds the canonical value for every credential. Access to this layer is controlled by service account tokens scoped per environment. No developer has direct write access to production secrets in the central store. The CI/CD pipeline has read-only access, scoped to the secrets it needs for the environment it is deploying to. Human write access to prod secrets requires a break-glass process outside of automated rotation.

Layer 2 β€” Sync Operator

The External Secrets Operator (ESO) runs inside the cluster and watches for changes in the central store. When a rotation event occurs, ESO reconciles the namespace-level Kubernetes Secret objects. This is the critical seam. If the operator fails, is misconfigured, or runs behind its refresh interval, the Kubernetes secret is stale even though the vault value is current. ESO must be monitored and alerted on β€” it is a critical path dependency, not background infrastructure.

Layer 3 β€” Namespace Isolation with RBAC

Prod and dev namespaces are isolated with explicit RBAC. Service accounts are scoped to their namespace. The prod service account cannot read dev secrets. The dev service account cannot read prod secrets. This is enforced at the API server level, not by convention.

The rotation lag problem is architectural, not operational. A pod that started before a secret rotation uses the credential that was mounted at pod startup. Restarting the pod after a confirmed sync is the only way to guarantee the running workload is using the current credential. Without a process that enforces this, rotation and running workload credential state are eventually consistent at best.


How It Works: Step by Step

Prerequisites

  • OpenShift 4.12+ or Kubernetes 1.26+
  • Helm 3.x installed locally
  • A central secrets manager β€” this article covers AWS Secrets Manager (IRSA via STS), Azure Key Vault (Workload Identity), and HashiCorp Vault (Kubernetes auth)
  • Cluster-admin access to install the ESO operator and configure RBAC

Step 1 β€” Install the External Secrets Operator

# Add the External Secrets Operator Helm repository
helm repo add external-secrets https://charts.external-secrets.io
helm repo update

# Install ESO 0.10.0+ into its own namespace
# [AUTHOR TO VALIDATE] β€” confirm latest stable chart version before repo build
helm install external-secrets \
  external-secrets/external-secrets \
  --namespace external-secrets \
  --create-namespace \
  --set installCRDs=true \
  --version 0.10.0
Enter fullscreen mode Exit fullscreen mode

Verify the operator is running before proceeding:

oc get pods -n external-secrets
# All pods should show Running status before applying any SecretStore or ExternalSecret
Enter fullscreen mode Exit fullscreen mode

Step 2 β€” Create a SecretStore scoped to each namespace

A SecretStore is namespace-scoped. Prod and dev each get their own β€” they never share one. Choose the provider block that matches your environment.

AWS Secrets Manager β€” IRSA via STS

# prod-secretstore-aws.yaml
apiVersion: external-secrets.io/v1
kind: SecretStore
metadata:
  name: prod-secretstore
  namespace: prod
spec:
  provider:
    aws:
      service: SecretsManager
      region: eu-west-1  # [AUTHOR TO VALIDATE] β€” set your region
      auth:
        jwt:
          serviceAccountRef:
            name: prod-workload-sa
            # This SA must carry the IAM role annotation β€” see Step 4
Enter fullscreen mode Exit fullscreen mode

Annotate the service account with the IAM role ARN:

oc annotate serviceaccount prod-workload-sa \
  -n prod \
  eks.amazonaws.com/role-arn=arn:aws:iam::123456789012:role/prod-secrets-reader
  # [AUTHOR TO VALIDATE] β€” replace account ID and role name
Enter fullscreen mode Exit fullscreen mode

The IAM role requires a trust policy scoped to the cluster OIDC provider and a permissions policy granting secretsmanager:GetSecretValue against specific secret ARNs β€” not *.

Azure Key Vault β€” Workload Identity

# prod-secretstore-azure.yaml
apiVersion: external-secrets.io/v1
kind: SecretStore
metadata:
  name: prod-secretstore
  namespace: prod
spec:
  provider:
    azurekv:
      authType: WorkloadIdentity
      vaultUrl: "https://<YOUR-KEYVAULT-NAME>.vault.azure.net"
      # [AUTHOR TO VALIDATE] β€” replace with your Key Vault URL
      serviceAccountRef:
        name: prod-workload-sa
        # This SA must carry the Workload Identity annotation β€” see Step 4
Enter fullscreen mode Exit fullscreen mode

Annotate the service account with the managed identity client ID:

oc annotate serviceaccount prod-workload-sa \
  -n prod \
  azure.workload.identity/client-id=<MANAGED_IDENTITY_CLIENT_ID>
  # [AUTHOR TO VALIDATE] β€” replace with your managed identity client ID
Enter fullscreen mode Exit fullscreen mode

The managed identity needs the Key Vault Secrets User role scoped to the specific Key Vault β€” not the subscription. The pod spec also requires this label in the Deployment's pod template metadata:

labels:
  azure.workload.identity/use: "true"
Enter fullscreen mode Exit fullscreen mode

HashiCorp Vault β€” Kubernetes Auth

Kubernetes auth is the recommended starting point for Vault in an OpenShift environment. It uses the pod's projected service account token to authenticate β€” no static credentials stored anywhere.

# prod-secretstore-vault.yaml
apiVersion: external-secrets.io/v1
kind: SecretStore
metadata:
  name: prod-secretstore
  namespace: prod
spec:
  provider:
    vault:
      server: "https://vault.internal:8200"
      # [AUTHOR TO VALIDATE] β€” replace with your Vault server URL
      path: "secret"
      version: "v2"  # KV v2 is the current default secrets engine
      auth:
        kubernetes:
          mountPath: "kubernetes"
          role: "prod-secret-reader"
          # [AUTHOR TO VALIDATE] β€” replace with your Vault role name
          serviceAccountRef:
            name: prod-workload-sa
Enter fullscreen mode Exit fullscreen mode

Configure the Kubernetes auth backend on Vault once per cluster:

# Run against your Vault instance β€” not inside OpenShift
vault auth enable kubernetes

vault write auth/kubernetes/config \
  kubernetes_host="https://<OPENSHIFT_API_SERVER>:6443"
  # [AUTHOR TO VALIDATE] β€” replace with your OpenShift API server URL

vault write auth/kubernetes/role/prod-secret-reader \
  bound_service_account_names=prod-workload-sa \
  bound_service_account_namespaces=prod \
  policies=prod-secrets-policy \
  ttl=1h
Enter fullscreen mode Exit fullscreen mode

Create a minimal Vault policy scoped to the specific secret path β€” never use wildcards in prod:

# prod-secrets-policy.hcl
path "secret/data/prod/registry/pull-secret" {
  capabilities = ["read"]
}
Enter fullscreen mode Exit fullscreen mode

Apply the SecretStore manifest for your provider:

oc apply -f prod-secretstore-aws.yaml    # if using AWS
oc apply -f prod-secretstore-azure.yaml  # if using Azure
oc apply -f prod-secretstore-vault.yaml  # if using Vault
Enter fullscreen mode Exit fullscreen mode

Step 3 β€” Define an ExternalSecret to sync the pull secret

The ExternalSecret fetches individual credential fields from the vault and assembles them into a valid kubernetes.io/dockerconfigjson secret in the namespace. The template below works for all three providers β€” only the secretStoreRef name changes per provider.

# prod-pull-secret-external.yaml
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: registry-pull-secret
  namespace: prod
spec:
  refreshInterval: 1h
  # Note: 1h means up to 60 minutes rotation lag before the
  # namespace Secret reflects a vault change. Reduce for
  # time-sensitive credentials. Minimum recommended: 15m.
  secretStoreRef:
    name: prod-secretstore   # matches whichever SecretStore you applied in Step 2
    kind: SecretStore
  target:
    name: registry-pull-secret
    creationPolicy: Owner
    # Owner means ESO controls the lifecycle of this Secret.
    # If this ExternalSecret is deleted, the Secret is deleted with it.
    # Do not delete ExternalSecrets without understanding this behavior.
    template:
      type: kubernetes.io/dockerconfigjson
      data:
        .dockerconfigjson: |
          {
            "auths": {
              "{{ .registryHost }}": {
                "username": "{{ .registryUsername }}",
                "password": "{{ .registryPassword }}",
                "auth": "{{ printf "%s:%s" .registryUsername .registryPassword | b64enc }}"
              }
            }
          }
  data:
    - secretKey: registryHost
      remoteRef:
        key: prod/registry/pull-secret    # [AUTHOR TO VALIDATE] β€” Vault path to your secret
        property: host                    # [AUTHOR TO VALIDATE] β€” field name for registry hostname
    - secretKey: registryUsername
      remoteRef:
        key: prod/registry/pull-secret
        property: username                # [AUTHOR TO VALIDATE] β€” field name for username
    - secretKey: registryPassword
      remoteRef:
        key: prod/registry/pull-secret
        property: password                # [AUTHOR TO VALIDATE] β€” field name for password
Enter fullscreen mode Exit fullscreen mode
oc apply -f prod-pull-secret-external.yaml
Enter fullscreen mode Exit fullscreen mode

Verify the sync completed and the Secret was created:

oc get externalsecret registry-pull-secret -n prod
# STATUS column must show: SecretSynced
# READY column must show: True

# Confirm the Secret exists and is correctly typed
oc get secret registry-pull-secret -n prod -o jsonpath='{.type}'
# Expected output: kubernetes.io/dockerconfigjson
Enter fullscreen mode Exit fullscreen mode

If STATUS shows SecretSyncedError, check the ESO operator logs:

oc logs -n external-secrets \
  -l app.kubernetes.io/name=external-secrets \
  --tail=50
Enter fullscreen mode Exit fullscreen mode

Step 4 β€” Apply RBAC to lock down namespace secret access

# prod-secret-rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: secret-reader
  namespace: prod
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list"]
    resourceNames: ["registry-pull-secret"]
    # Scoped to the named secret only β€” not wildcard access to all secrets
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: workload-secret-reader
  namespace: prod
subjects:
  - kind: ServiceAccount
    name: prod-workload-sa
    namespace: prod
roleRef:
  kind: Role
  name: secret-reader
  apiGroup: rbac.authorization.k8s.io
Enter fullscreen mode Exit fullscreen mode
oc apply -f prod-secret-rbac.yaml
Enter fullscreen mode Exit fullscreen mode

This scopes the prod service account to read only the specific named secret it needs. Apply the equivalent for the dev namespace, scoped to dev secrets only. Neither service account should have cross-namespace access.

Step 5 β€” Reference the secret in your workload

# prod-deployment.yaml (relevant section)
spec:
  template:
    metadata:
      labels:
        azure.workload.identity/use: "true"  # include only if using Azure Workload Identity
    spec:
      imagePullSecrets:
        - name: registry-pull-secret
      serviceAccountName: prod-workload-sa
      containers:
        - name: app
          image: registry.internal/org/app:latest
Enter fullscreen mode Exit fullscreen mode

Step 6 β€” Handle rotation explicitly

When a credential rotates in the central store, the ExternalSecret will re-sync within the refreshInterval. The running pod will not automatically pick up the new credential β€” it uses the value that was mounted at startup. A rollout restart is required after every confirmed sync.

# Confirm the sync has completed before restarting
oc get externalsecret registry-pull-secret -n prod
# Confirm: STATUS = SecretSynced and READY = True

# Restart the deployment to pick up the rotated credential
oc rollout restart deployment/app -n prod

# Verify the rollout completes cleanly
oc rollout status deployment/app -n prod
Enter fullscreen mode Exit fullscreen mode

Add this as an explicit named step in your rotation runbook β€” not a footnote. It is not optional and it is not automatic.

Rollback consideration

If a rotation introduces a bad credential β€” wrong value, wrong format, access not yet propagated in the provider β€” roll back the deployment to the previous revision first, then investigate:

oc rollout undo deployment/app -n prod
oc rollout status deployment/app -n prod
Enter fullscreen mode Exit fullscreen mode

Note that oc rollout undo rolls back the deployment configuration, not the secret value. If the vault value itself is wrong, rolling back the deployment buys time but does not fix the underlying problem. Correct the value in the vault first, wait for ESO to re-sync, then trigger a new rollout. Do not attempt to fix the secret in place while the deployment is actively failing.


Security and Operational Considerations

RBAC is the first thing to configure, not the last

Kubernetes Secrets are base64 encoded. Any service account with get or list access to secrets in a namespace can retrieve and decode every credential stored there. OpenShift 4.x enables etcd encryption for Secrets by default β€” vanilla Kubernetes does not. Verify your cluster's encryption at rest configuration before assuming the storage layer is protected. Apply Role and RoleBinding before the first secret is created in any namespace, and scope them to named resources, not wildcard access.

The sync operator is a critical dependency β€” treat it as one

Once ESO is part of your architecture it is a critical path component. Monitor it. Alert on sync failures. ESO exposes the externalsecret_sync_calls_error metric β€” wire this to your alerting platform. A silent sync failure means your workload is running with a stale credential and you will not know until something breaks.

# Check ESO sync status across all ExternalSecrets in a namespace
oc get externalsecret -n prod
# Any STATUS other than SecretSynced needs immediate investigation
Enter fullscreen mode Exit fullscreen mode

The central secrets store itself needs RBAC

If the engineering team has full read/write access to the secrets manager, the blast radius of a compromised account is the entire vault. Separate write access from read access. Human write access to prod secrets should require a break-glass process outside of automated rotation. Document who holds that access and review it quarterly.

creationPolicy: Owner has a destructive side effect

When ESO owns a Secret's lifecycle, deleting the ExternalSecret deletes the Secret with it. In a multi-team environment, a developer deleting what appears to be a stale or misconfigured ExternalSecret will drop the credential from the namespace immediately. Make sure your team understands this behavior before granting delete access to ExternalSecret resources.

Define the rotation approval path before you need it

This is the thing that documentation does not cover. When a credential rotates at 2 AM in a multi-cloud environment with a team spread across time zones, who has the authority to approve the rotation in the central store? Who runs the oc rollout restart? Who confirms the rollout completed cleanly and signs off that prod is healthy?

Write this down before it happens. Name the people, define the escalation path, and put it somewhere a new team member can find it without a Slack thread.

Audit logs need active review, not passive collection

Most secrets managers generate audit logs for every read and write operation. These logs are only useful if someone is reviewing them. Wire secret access events into your SIEM or log aggregator and create alerts for anomalous patterns β€” unexpected reads, access from unrecognized service accounts, bulk secret reads that do not match a known pipeline run.


What Breaks at Scale

Rotation lag multiplies across namespaces

With one namespace and one workload, a manual oc rollout restart after rotation is manageable. With ten namespaces, thirty deployments, and a rotation event that cascades across dependent credentials, it does not scale. You need a rotation event handler β€” a pipeline step or operator webhook that triggers a rolling restart of affected workloads automatically after a confirmed sync. This is not a day-one problem. It becomes one at day ninety when the first coordinated rotation happens and nobody has automated the downstream restart.

Cross-cloud secret identity is unsolved by most teams

In a true multi-cloud deployment β€” workloads on AWS, Azure, and an on-premises OpenShift cluster all consuming secrets β€” each cloud has its own identity model for authenticating to the central store. The pipeline service account on AWS uses an IAM role. The OpenShift cluster on-premises uses a service account token projected via OIDC. Keeping these identity bindings consistent, rotated, and auditable across three clouds is an operational challenge that most tooling handles partially at best.

The 2 AM problem at scale

With one team and one cluster, Slack and tribal knowledge is expensive but survivable. With multiple teams, multiple clusters, and a secrets manager that is a shared dependency, a rotation failure at 2 AM is a cross-team incident. The human routing problem β€” who owns the approval, who runs the restart, who confirms health across environments β€” does not get easier with scale. It gets harder. The runbook is not optional at this point. It is the difference between a thirty-minute recovery and a three-hour incident bridge.

Regulated environments add approval gates to the rotation path

In financial services or healthcare environments, credential rotation often requires a change approval before the rotation runs, not just after. This means the automated rotation flow needs to integrate with your change management tooling β€” a ServiceNow ticket, a Jira issue, an approval gate in the pipeline. The technical implementation is straightforward. Getting it through the approval process for a new tooling integration is the actual work.


What I'd Do Differently

Start with encrypted Git secrets before the first workload enters a namespace. Not as the end state β€” as the minimum bar that establishes the habit. Leaked Git history is incredibly difficult to clean completely. An encrypted Git secret is easy to upgrade to an enterprise vault later. And it builds a security-first mindset within the engineering team from day one, before there is an incident to justify it.

The harder lesson: define the rotation runbook before the first secret is created in prod, not after the first rotation failure. The technical architecture is the easy part. Knowing who clicks approve at 2 AM is what breaks in production β€” and no documentation covers it because it is a people and process problem, not a Kubernetes problem.


Quick Recap

  • RBAC first, secrets second β€” configure namespace-level RBAC before the first secret is created; base64 encoding is not access control, and etcd encryption at rest is not enabled by default on vanilla Kubernetes
  • The sync gap is the rotation failure β€” a successful rotation in your central vault does not mean running pods are using the new credential; an explicit rollout restart after a confirmed ESO sync is required and must be in the runbook
  • Secret management is a human routing problem β€” the technical architecture is solvable; who owns the 2 AM approval and the cross-timezone escalation path is what breaks in production

GitHub Repo

Full implementation with working manifests for all three providers, RBAC templates, and rotation runbook:

[PLACEHOLDER β€” repo content in progress: pipelineandprompts-labs/secrets-management-multi-cloud]


What's Next?

Secret management is one half of the pipeline security conversation. The other half is what happens when the pipeline itself is the attack surface β€” supply chain security, signed commits, and verifying that the image running in prod is exactly the image that passed your tests.

Next in Pipelines in the Wild: Pipeline Supply Chain Security β€” Signing, Provenance, and Why Your CI/CD Pipeline is a Target.


Written by Pipeline & Prompts | Byte size guides on DevOps, Cloud and AI

Found this useful? Share it with the engineer on your team who is still creating secrets manually β€” and forward it to whoever owns the rotation runbook. If there is no rotation runbook, this article is for them.

Top comments (0)