GKE Security: Fix Secrets & Control Plane Misconfigurations

#gke #kubernetessecurity #cloudsecurity #soc2

GKE Security Theater: Why Your Secrets and Control Plane Access Are Probably Misconfigured

Most GKE clusters I audit have the same two problems: secrets that aren't actually secret, and control plane access policies that stopped making sense six months ago. Neither shows up as a critical finding until audit time. Both are fixable in a day — but the migration planning takes longer than the actual implementation.

The Real Problem Isn't Ignorance

Teams know environment variables aren't the right place for database credentials. They know IP-based allowlists for control plane access drift over time. The problem is that both configurations work fine right up until they don't.

A base64-encoded Kubernetes Secret functions correctly. Your CI/CD pipeline with a hardcoded IP allowlist deploys code. Nothing breaks — until someone rotates a secret and the pod fails to start, or your VPN provider changes IP ranges and suddenly your deployment pipeline can't reach the cluster.

These aren't security problems in the abstract. They're operational time bombs with audit implications.

In my experience working with SaaS companies across North America preparing for SOC 2, these two issues account for more remediation work than almost any other GKE finding. Not because they're difficult to fix, but because nobody planned for the migration.

What I Actually See in Production

Here's the pattern I encounter repeatedly:

Secrets stored as Kubernetes Secrets with no encryption configuration. The team assumes "Kubernetes Secrets are encrypted" because they're not plaintext. They're base64 encoded — which is encoding, not encryption. Without explicit CMEK (Customer-Managed Encryption Keys) configuration, those secrets sit in etcd with Google-managed encryption that you can't audit or control.

External Secrets Operator deployed but not monitored. ESO is a solid tool, but I've seen clusters where secret sync failures went undetected for weeks. The pod kept running on the old secret value cached in memory. When it restarted during a routine node upgrade, it couldn't pull the current secret and the application crashed. The incident report said "deployment failure" but the root cause was secret sync monitoring nobody configured.

Control plane IP allowlists that grow but never shrink. A developer needed access from a new location, so someone added a CIDR block. The VPN provider changed, so someone added another. Six months later, the allowlist includes ranges the team can't even identify. The "fix" is usually making it more permissive because nobody wants to break production access.

GKE's native Secret Manager add-on exists but teams don't know about it. This went GA in 2024, but I still encounter teams running custom ESO setups for GCP-only secret management because nobody told them there's a simpler option now.

My Take: These Are Lifecycle Problems, Not Security Problems

The real issue here isn't that teams chose the wrong tool. It's that secrets management and control plane access are treated as one-time configuration decisions instead of operational systems that need ongoing governance.

In the SCALE framework I use for GCP platform architecture, this falls squarely in the Lifecycle Operations stage. Security-by-design handles the initial architecture — but if you don't build operational practices around secret rotation, sync monitoring, and access policy review, the security posture degrades over time.

Most teams configure secrets injection once and never revisit it. The same is true for control plane access. Both need periodic review cycles built into platform operations.

The Practical Path Forward

For secrets, you have three realistic options:

Environment variables — avoid entirely. No encryption, visible in pod specs, logged in debug output.
Kubernetes Secrets with CMEK — acceptable if you configure encryption properly and your threat model doesn't require external secret management.
Secret Manager integration — preferred. Either via the native GKE add-on or External Secrets Operator.

For single-cloud GCP deployments, the native Secret Manager add-on is now the right choice:

gcloud container clusters update CLUSTER \
  --enable-secret-manager-addon \
  --location=REGION

The SecretProviderClass configuration is straightforward:

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: app-secrets
spec:
  provider: gcp
  parameters:
    secrets: |
      - resourceName: projects/PROJECT/secrets/db-password/versions/latest
        fileName: db-password

For control plane access, DNS-based endpoints with IAM replace the IP allowlist entirely:

gcloud container clusters update CLUSTER \
  --enable-dns-access \
  --location=REGION

The key IAM permission is container.clusters.connect — scope it to specific service accounts and user identities, not broad groups.

Trade-offs You Need to Plan For

Native add-on vs ESO: The native add-on is simpler and has fewer moving parts. ESO gives you multi-backend support and more flexibility for complex secret routing. For GCP-only shops, the native add-on wins on operational simplicity. Multi-cloud environments still need ESO.

DNS control plane access breaks existing kubeconfig files. This is the migration step teams skip. Every CI/CD pipeline, developer workstation, and operations runbook that uses the IP endpoint needs updating before you switch. I've seen teams enable DNS access without a migration plan and break their deployment pipeline during a release window.

Secret volume mounts don't auto-refresh in running pods. Rotated secrets require a pod restart to pick up new values. Either design your applications to handle restart-on-rotation, or implement a sidecar that detects secret changes and triggers graceful restarts.

The Business Reality

Secrets in environment variables and stale IP allowlists are two of the most common findings in GKE security reviews. Neither creates an immediate breach — but both extend audit remediation timelines and increase operational risk during incidents.

The time to address them is before the audit, not during it. A planned migration takes a day of engineering work. An unplanned migration during audit remediation takes a week of firefighting while your compliance deadline slips.

I've seen teams defer these changes through multiple audit cycles because they seem low-priority. Then they hit a secret rotation failure during a production incident and suddenly it's an all-hands emergency.

Fix the configuration once. Build the operational review cycle. Move on to harder problems.

What secret management pattern has caused the most unexpected downtime in your GKE clusters?

Work with a GCP specialist — book a free discovery call

Amit Malhotra, Principal GCP Architect, Buoyant Cloud Inc

Work with a GCP specialist — book a free discovery call → https://buoyantcloudtech.com