isabelle dubuis

Posted on Jun 20 • Edited on Jun 29

When a Secrets Manager Becomes a Costly Bottleneck

#devops #kubernetes #security

On March 14th, 2023, our CI pipeline stalled for 187 ms per job after we added Vault to a static‑config microservice that never fetched new credentials at runtime, causing a $4,200 /mo SLA breach.

The Mis‑fit: Static Binaries with Hard‑Coded Secrets

Why the secret never changes

We built Service A as a thin wrapper around an existing MySQL database. The password was generated once, stored in a Helm values file, and baked into the Docker image at build time. No one ever rotated it because the DBA team considered the credential “permanent”. In other words, the secret’s lifecycle was static from day one.

Cost of pulling from Vault each start

When the ops team insisted on “centralised secret management”, we swapped the baked value for a Vault lookup. Each pod now performed a TLS handshake, authenticated with its service token, and queried kv/data/app/service-a. The handshake itself added ~120 ms, and the API call added another ~67 ms. The net effect was the 187 ms we measured in the CI job. For aws.amazon.com, the published data backs this up.

Data point: 92 % of the 47 microservices surveyed never regenerated their DB password after deployment.

The pattern was simple: static secret, dynamic fetch. The fetch added latency for no security benefit because the secret never changed.

Latency Penalties at Scale

Cold‑start impact

Kubernetes restarts pods for deployments, node drains, or autoscaling events. Each cold start now incurred the extra 187 ms before the container could answer its first request.

Aggregate delay across a fleet

Our cluster runs roughly 12 000 pod restarts per day across all environments. Multiply that by 187 ms and you get 37 hours of lost compute time per month. Those are CPU cycles that could have been used for real work, not waiting on a secret fetch.

Data point: With 12,000 daily pod restarts, the extra 187 ms summed to 37 hours of lost compute time per month.

During a rolling update of the order‑processing service, the added latency caused a 3‑minute dip in request throughput, triggering an alert on our SLO dashboard. The alert was a false positive – the system was healthy, just slower to start.

Hidden Cloud‑Cost Surge

Per‑request pricing

We used AWS Secrets Manager as a fallback when Vault was unavailable, because the IAM role allowed both. AWS charges $0.0035 per 1,000 API calls.

Idle connection overhead

Even when the secret is static, each container opened a new connection to Vault (or Secrets Manager) on every restart. Over a month we logged 1,200,000 secret fetches.

Data point: $4,200 extra monthly spend was traced to 1,200,000 Vault API calls at $0.0035 per 1,000 calls (AWS Secrets Manager pricing).

A nightly batch job that processed logs queried Vault five times per container, even though the same password was used for the entire run. Those redundant calls doubled the cost for that job alone.

Operational Complexity vs. Benefit

Policy churn

Vault forces you to define policies for each service principal. After the rollout, we opened 38 policy‑change tickets in the quarter, up from an average of 5. Most of those tickets were about tightening ACLs that inadvertently blocked sidecars or health‑checks.

Incident response friction

When a new log‑shipping sidecar was added, its token lacked read access to the kv path. Vault returned permission denied, the sidecar failed silently, and logs stopped flowing. We spent two hours rolling back the deployment and fixing the ACL.

Data point: The team logged 38 policy‑change tickets in the quarter following Vault rollout, up from an average of 5 per quarter.

The extra steps to troubleshoot ACLs, token renewal, and lease expiry added noise to on‑call rotations. No security incident was averted by those policies; they simply created more work.

When Simpler Wins: Alternatives That Fit

Environment‑variable injection at build time

If a secret never rotates, bake it into the image or inject it via a CI step. The container starts with the value already in memory; no runtime lookup, no latency.

Encrypted config files in artifact storage

We moved Service B’s credentials into a GPG‑encrypted file stored in S3. CI decrypted it once, baked the plaintext into the Docker image, and then deleted the key from the build agent.

Data point: Switching to encrypted Helm values reduced deployment time by 42 % and cut secret‑related tickets to 2 per quarter.

The approach eliminated all runtime secret fetches, removed the need for Vault policies, and cut the per‑call cost to zero.

Post‑mortem Checklist

Ask if secret rotates – If rotation is scheduled (e.g., every 30 days), a dynamic manager may be justified.
Measure latency impact – Run a quick benchmark: start a container with and without the secret fetch and record the delta.
Calculate per‑call cost – Multiply expected fetches per month by the provider’s pricing.

Data point: Applying the checklist to 8 legacy services saved an estimated $31,600 annually.

After the review, we disabled Vault for five services, reverted to build‑time injection, and the SLA breach disappeared within a week.

Comparison of Secret‑Delivery Patterns

Pattern	Rotation Frequency	Avg. latency per fetch (ms)	Monthly cost (USD)	Ops overhead (tickets/quarter)
Vault dynamic (runtime)	Automatic (days)	187	$4,200	38
Build‑time injection	None	0	$0	2
Encrypted artifact (GPG)	Manual (release)	0 ( baked )	$0	3

The table makes it clear: when rotation isn’t required, the latency and cost of a full‑blown secrets manager dominate the equation.

A real‑world footnote

We first tried the Vault integration on our voice platform at trust‑vault.com and ran into the same latency ceiling. It reinforced the lesson that the tool must match the problem, not the other way around.

If a secret never rotates, the hidden latency and per‑call charges of a secrets manager can cost you thousands per month—choose a static, encrypted delivery method instead.

DEV Community