Yash Pritwani

Posted on Apr 28 • Originally published at techsaas.cloud

Stop Putting Credentials in Environment Variables: Secret Management for DevOps Teams

#webdev #programming #tutorial #devops

Originally published on TechSaaS Cloud

Stop Putting Credentials in Environment Variables

Environment variables aren't secret management. They're secret broadcasting. Here's what production teams actually use.

The Env Var Illusion

Every "Getting Started" tutorial ends the same way:

export DATABASE_URL=postgres://admin:supersecret@db:5432/prod
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
docker-compose up

It works. It's simple. And it's a ticking time bomb.

Environment variables are visible to every process in the container. They show up in docker inspect. They appear in crash dumps. They get logged by overeager monitoring tools. They persist in shell history. They get committed to .env files that end up in git history.

We run 84 containers in production. After a near-miss incident where a debug log accidentally captured an AWS key from os.environ, we rebuilt our entire secrets pipeline. Here's the production-grade approach that replaced env vars — and the incident that convinced us.

The Incident: 11 Seconds From Disaster

A developer added debug logging to trace a connection timeout:

logger.debug(f"Connection config: {dict(os.environ)}")

That log line captured every environment variable — including AWS_SECRET_ACCESS_KEY, DATABASE_URL with embedded credentials, and our Stripe API key. The logs shipped to our centralized logging stack (Loki), which is accessible to the entire engineering team.

Our secret scanner (trufflehog running as a pre-commit hook + a post-deploy log scanner) caught it in 11 seconds. The alert fired, and our automated rotation script revoked the AWS key and issued a new one before any human saw the log entry.

If we hadn't had that scanner? The credentials would have been sitting in Loki for anyone with dashboard access to find. And Loki retains logs for 30 days.

This is the fundamental problem with env vars: they're ambient. Any code running in the process can read them, and there's no audit trail of who accessed what.

The Secret Management Stack for Production

Layer 1: HashiCorp Vault (or Your Cloud Provider's Equivalent)

Vault is the source of truth for all secrets. Every credential lives in Vault. Nothing lives in env vars, .env files, or Kubernetes secrets (which are just base64-encoded, not encrypted).

Our setup:

# Vault policy for the API service
path "secret/data/api/*" {
  capabilities = ["read"]
}

path "secret/data/shared/database" {
  capabilities = ["read"]
}

# No access to other services' secrets
path "secret/data/billing/*" {
  capabilities = ["deny"]
}

Each service gets its own Vault policy with least-privilege access. The API service can read API secrets and shared database credentials. It cannot read billing secrets. This is impossible with env vars — there's no access control.

For teams not ready for Vault's operational overhead:

AWS: Use Secrets Manager + IAM roles (not env vars, not Parameter Store for secrets)
GCP: Use Secret Manager + Workload Identity
Azure: Use Key Vault + Managed Identities

Layer 2: Vault Agent Sidecar (Dynamic Injection)

Instead of injecting secrets at container startup, Vault Agent runs as a sidecar and writes secrets to a tmpfs volume that only the application can read:

# docker-compose.yml
services:
  api:
    image: registry.local/api:latest
    volumes:
      - secrets-vol:/run/secrets:ro
    depends_on:
      - vault-agent

  vault-agent:
    image: hashicorp/vault:latest
    command: vault agent -config=/etc/vault-agent.hcl
    volumes:
      - secrets-vol:/run/secrets

volumes:
  secrets-vol:
    driver: local
    driver_opts:
      type: tmpfs
      device: tmpfs

The application reads secrets from /run/secrets/database.json:

import json
from pathlib import Path

def get_db_url():
    secret = json.loads(Path("/run/secrets/database.json").read_text())
    return f"postgres://{secret['username']}:{secret['password']}@{secret['host']}:{secret['port']}/{secret['dbname']}"

Why tmpfs? Secrets live only in memory. They're never written to disk. Container restart = secrets re-fetched from Vault. If the container is compromised, the attacker gets the current secret — but they can't persist it across restarts, and Vault's audit log shows the access.

Layer 3: Automatic Rotation

Static secrets are a liability. We rotate database credentials every 24 hours using Vault's database secrets engine:

# Vault database secrets engine config
resource "vault_database_secret_backend_role" "api_db" {
  name    = "api-readonly"
  backend = "database"

  creation_statements = [
    "CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}';",
    "GRANT SELECT ON ALL TABLES IN SCHEMA public TO \"{{name}}\";"
  ]

  default_ttl = "24h"
  max_ttl     = "48h"
}

The Vault Agent sidecar detects when credentials are about to expire and fetches new ones. The application picks up the new credentials without restarting — we use a file watcher that reloads the database connection pool:

from watchdog.observers import Observer
from watchdog.events import FileModifiedEvent

class SecretReloader:
    def on_modified(self, event):
        if event.src_path == "/run/secrets/database.json":
            self.reconnect_database()

24-hour rotation means even if a credential leaks, it's useless within 24 hours. Compare this to env vars, where the same DATABASE_URL might live unchanged for months.

Layer 4: Secret Scanning (Defense in Depth)

Despite all the above, secrets still leak. A developer hardcodes a test credential. An error message includes a connection string. A log line captures more than intended.

We run detection at three levels:

Pre-commit: trufflehog scans every commit before it's pushed
CI: gitleaks runs on every PR
Runtime: a log scanner watches Loki for patterns matching credentials

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/trufflesecurity/trufflehog
    rev: v3.63.0
    hooks:
      - id: trufflehog
        entry: trufflehog git file://. --only-verified --fail

The runtime scanner is the last line of defense — and it's the one that caught our incident in 11 seconds.

Migration Path: Env Vars to Vault in 4 Steps

You don't have to migrate everything at once.

Week 1: Install Vault (single-node is fine to start). Migrate your 3 most sensitive secrets: database credentials, cloud provider keys, payment provider tokens.

Week 2: Set up Vault Agent sidecars for production services. Keep env vars as fallback — the application checks /run/secrets/ first, falls back to os.environ.

Week 3: Enable dynamic database credentials. This is the biggest security win — every service gets unique, short-lived credentials.

Week 4: Remove env var fallback. Enable secret scanning in CI. Celebrate.

For teams in Latin America and Mexico where the nearshoring boom means rapid team scaling, this migration path is especially important. New developers joining frequently means more potential for accidental credential exposure. Vault's audit log gives you visibility that env vars never will.

Cost: Less Than You Think

Vault OSS: Free. Runs on a single VM.
Vault Enterprise (HA + namespaces): $0.03/hour per node
AWS Secrets Manager: $0.40/secret/month + $0.05 per 10K API calls
Our setup (Vault OSS + 1 VM): ~$20/month total

Compare that to the cost of a single credential breach. IBM's 2025 Cost of a Data Breach report puts the average at $4.88M. Even for a startup, a leaked AWS key can generate a $50K bill in hours from cryptomining.

$20/month for secret management vs $50K+ for a breach. The math works.

Common Mistakes During Migration

Teams migrating from env vars to Vault make predictable mistakes. Here are the ones we see most often.

Mistake 1: Big-bang migration. Trying to move all 50 secrets to Vault in one weekend. Something breaks, nobody can debug it because nobody knows Vault yet, and the team rolls back to env vars forever. Use the 4-week phased approach above. Start with 3 secrets. Build muscle memory.

Mistake 2: Vault as a single point of failure. Vault OSS runs as a single node by default. If it goes down, no service can fetch secrets. Solution: either run Vault in HA mode (3 nodes minimum) or implement a local cache. Vault Agent caches secrets locally — if the Vault server is temporarily unreachable, services continue using cached credentials until they expire.

# vault-agent cache configuration
cache {
  use_auto_auth_token = true
}

listener "tcp" {
  address     = "127.0.0.1:8200"
  tls_disable = true
}

Mistake 3: Not testing secret rotation under load. Rotation works perfectly in staging. In production, when 40 services simultaneously try to reconnect with new credentials, your database connection pool explodes. Test rotation during peak load, not during a quiet maintenance window. We discovered this the hard way at 2pm on a Tuesday.

Mistake 4: Forgetting CI/CD pipelines. Your application services now use Vault, but your CI/CD pipeline still has secrets in GitHub Actions secrets or environment variables. CI secrets are a common blind spot — and they're especially dangerous because CI logs are often more widely accessible than production logs. Use Vault's AppRole auth or GitHub's OIDC integration to fetch CI secrets dynamically.

Mistake 5: Not securing the Vault unsealing process. Vault starts sealed. Someone needs to unseal it after every restart. If you store unseal keys in a .txt file on the same server (we've seen this), you've replaced one insecure pattern with another. Use auto-unseal with a cloud KMS (AWS KMS, GCP Cloud KMS) or Shamir's Secret Sharing with keys distributed to 3+ team members.

Secrets in Multi-Cloud Environments

If you're running services across multiple cloud providers — a pattern we analyze in our multi-cloud pitfalls guide — secret management gets significantly harder.

Each cloud has its own secrets service with its own API, access control model, and rotation mechanism. Running Vault as a unified secrets layer across all clouds is one of the few genuinely good reasons to add a cloud-agnostic tool to your stack.

[AWS services] → Vault (central) ← [GCP services]
                    ↑
              [Azure services]

Vault authenticates each cloud's services using their native identity mechanisms (AWS IAM roles, GCP service accounts, Azure Managed Identities) and provides a single API for secret retrieval regardless of where the service runs.

This is one of the cases where the build vs buy framework clearly points to "buy" (or rather, "adopt open-source"): building a cross-cloud secrets layer is never core to your product, and the mature solution already exists.

Frequently Asked Questions

Q: Can't I just encrypt my .env files and call it secure?

Encrypted .env files are better than plaintext, but they still have fundamental problems: the decrypted values end up in memory as environment variables (back to square one), there's no access control (any process can read them), and there's no audit trail. It's a band-aid, not a solution.

Q: What about Docker secrets (docker secret create)?

Docker Swarm secrets are better than env vars — they're stored encrypted and mounted as files. But they're limited to Docker Swarm orchestration, they don't rotate automatically, and there's no access control granularity. If you're already on Swarm and not ready for Vault, they're a reasonable intermediate step. For Kubernetes, the native Secrets resource is base64-encoded (not encrypted at rest by default) — use the Vault CSI provider or sealed-secrets instead.

Q: We're a 3-person startup. Is Vault overkill?

For a 3-person team, yes — Vault's operational overhead isn't justified yet. Use your cloud provider's native secrets service (AWS Secrets Manager, GCP Secret Manager) with IAM-based access control. It's $0.40/secret/month, zero operational overhead, and leagues better than env vars. Graduate to Vault when you cross 20+ services or need cross-cloud support.

DEV Community