TheProdSDE

Posted on Jun 24 • Originally published at Medium

Zero Trust Security in Production: Identity, OPA, Vault, mTLS & Audit Logging — A Complete Reference

#devops #security #kubernetes #architecture

This is **Part 5* of the Building a Zero-Trust Security Architecture series. Parts 1–4 covered secrets fundamentals, HashiCorp Vault, cloud secret managers (Azure Key Vault, AWS Secrets Manager), and why Kubernetes Secrets are not a full secret management platform. This part brings it all together into one production reference design.*

TL;DR — What This Part Actually Delivers

Why authentication, authorization, secret management, encryption, audit, and network control are six different jobs — and why collapsing them is the #1 architecture mistake
A full reference architecture connecting an IdP, OPA, a service mesh, Vault/cloud secret managers, and a SIEM
A corrected, OPA 1.0-compatible Rego policy example (with the rego.v1 + if fix most tutorials miss)
A current Istio mTLS + AuthorizationPolicy configuration using security.istio.io/v1
A security maturity model (Level 0 → Level 5) and decision matrix
A real incident-response runbook for a compromised service

The #1 Architecture Mistake

Most breaches don't happen because a team lacked some exotic tool.

They happen because basic separation of concerns broke down.

The JWT that tries to carry every permission. The database password that doubles as every service's identity. The "temporary" admin token nobody ever revoked.

Zero Trust isn't one system — it's six distinct responsibilities:

Layer	Responsibility
🔐 Authentication	Who are you?
✅ Authorization	What are you allowed to do?
🔑 Secret Management	How do services prove identity without static credentials?
🔒 Encryption	Is data protected at rest and in transit?
📋 Audit Logging	Can we prove what happened and by whom?
🌐 Network Controls	How do we limit blast radius when something goes wrong?

Collapse any two of these into one system and you've introduced a future incident.

Step 1 — User Authentication

Users authenticate through an identity provider: Keycloak, Okta, Auth0, or Microsoft Entra ID.

At this point, identity is established. Nothing more.

JWT validation is NOT authorization

A valid JWT proves who the user is. It does not prove what they are allowed to do.

That's a critical distinction — and where policy evaluation enters the design.

Step 2 — Authorization Through OPA

OPA (Open Policy Agent) becomes the centralized authorization engine. Applications query it for decisions instead of hardcoding complex rules scattered across services.

A Realistic Policy Example

A production policy should consider action, resource, ownership, and environment — not just a flat role string:

package authz

import rego.v1

default allow := false

allow if {
  input.user.role == "analyst"
  input.action == "read"
  input.resource.owner_team == input.user.team
  input.resource.environment != "production"
}

⚠️ OPA 1.0 note: Every current OPA release requires the if keyword in rule bodies and treats rego.v1 semantics as standard. The older bare allow { ... } style will fail opa check on a modern install. If you're copying Rego from older blog posts, run opa fmt --rego-v1 on it before trusting it in CI.

Anti-pattern: JWT contains all permissions

Putting every permission in the token creates staleness and revocation problems — you can't un-issue a JWT that's already in someone's hands.

✅ Prefer: tokens for identity claims + OPA for dynamic authorization decisions.

Step 3 — Service Authentication (Workload Identity)

User authentication alone is not enough. Microservices also need strong identity.

Common anti-patterns to avoid:

Shared API keys
Shared service passwords
Long-lived tokens copied across services

Kubernetes Workload Identity — Two Distinct Paths

It's worth being precise here, since "OIDC provider" gets used loosely in many guides:

Path A — Vault Kubernetes Auth Method
Vault validates the pod's projected service-account token directly against the cluster's API/JWKS endpoint. No external OIDC provider required.

Path B — Cloud-Native Federation
(AWS IRSA, GCP Workload Identity Federation, Azure AD Workload Identity)
The cluster's OIDC issuer federates with a cloud IAM provider, letting a pod assume a cloud role without any static keys.

Both achieve the same outcome — no long-lived secrets sitting in a pod — through different plumbing. Know which one you're actually running.

Step 4 — Secret Retrieval and Dynamic Credentials

After authentication, services retrieve secrets or dynamic credentials on demand.

What changes in the modern model:

Traditional	Modern
Hardcoded passwords	No hardcoded passwords
Shared, permanent credentials	Unique credentials per workload
Manual rotation	Short-lifetime, auto-rotated
Broad blast radius	Identity-scoped, limited exposure

Each application receives unique credentials, a unique identity, and a short lifetime.

Step 5 — mTLS and Service Mesh

Service-to-service authentication becomes operationally manageable through a service mesh. Saying "we use mTLS" without an implementation path doesn't help anyone.

Istio Example (current stable API)

apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: default
  namespace: prod
spec:
  mtls:
    mode: STRICT

apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: service-b-allow-service-a
  namespace: prod
spec:
  selector:
    matchLabels:
      app: service-b
  action: ALLOW
  rules:
  - from:
    - source:
        principals: ["cluster.local/ns/prod/sa/service-a"]

Two real corrections baked in here:

security.istio.io/v1 is the current stable API group (v1beta1 is the legacy version still floating around in older tutorials)

action: ALLOW is written explicitly. Istio defaults to ALLOW when omitted, but spelling it out makes the policy's intent obvious in a code review six months from now — and obvious at 2 a.m. during an incident

Does a service mesh replace Vault?

No. They solve different problems:

Mesh (Istio, Linkerd): service-to-service transport security and identity via mTLS
Secret manager (Vault, AWS Secrets Manager, Azure Key Vault): credential issuance, rotation, and storage for database passwords, API keys, etc.

Most mature architectures run both.

Step 6 — Encryption Architecture

A common mistake: letting applications own long-lived encryption keys. That creates key sprawl and inconsistent key handling.

Prefer a central cryptographic service (Vault Transit engine or a cloud KMS equivalent).

Keys remain protected because they never live inside the application process as ordinary static assets. The application calls the cryptographic service; the keys never leave it.

Step 7 — Network Policies

Zero Trust is incomplete if every pod can talk to every other pod by default.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-service-a-to-db
  namespace: prod
spec:
  podSelector:
    matchLabels:
      app: database
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: service-a
    ports:
    - protocol: TCP
      port: 5432

NetworkPolicy vs. service mesh:
NetworkPolicy controls reachability at the IP/port level. It doesn't provide encryption or cryptographic identity. Use both: NetworkPolicy to shrink the attack surface, mTLS/mesh to verify identity on the connections that are allowed.

Step 8 — Audit Architecture

Security without auditing is incomplete. Every meaningful decision should be traceable.

Questions your platform should answer quickly:

Who accessed data?
When?
Which policy allowed it?
Which secret or lease was used?
Which workload identity made the call?

Useful detections to build

A sudden spike in secret reads
Access to secrets outside normal namespace or service patterns
Lease revocation followed by repeated failed usage
Cross-region secret access anomalies

Incident Response Runbook — Compromised Service

Suppose Service A is compromised.

Traditional architecture often means: shared credentials, permanent access, weak attribution — a bad day.

Modern architecture enables this runbook:

Identify the workload identity, token accessor, or Vault lease tied to Service A
Revoke the affected Vault token or leases
Trigger dynamic credential rotation where needed
Query audit logs for every action performed by that identity
Tighten policy or disable the role
Redeploy the workload with corrected configuration
Review whether network policy and mesh policy limited the blast radius

This is where short-lived credentials and strong audit trails become operational advantages, not just design principles on a slide.

Security Maturity Model

Level	Description
0	Passwords in source code
1	Environment variables and manually managed credentials
2	Kubernetes Secrets and basic secret segregation
3	Secret manager + scheduled rotation
4	Identity-based access, dynamic credentials, workload identity
5	Zero Trust, policy as code, continuous verification, strong audit analytics, mesh identity, reduced lateral movement

The model exists so teams can choose the next practical step instead of trying to jump directly to Level 5 — which mostly produces a half-finished Vault deployment and a very tired platform team.

Decision Matrix

Small Startup

Recommended pattern: OIDC + OPA + Cloud secret manager + Namespace-scoped RBAC + Network policies
Benefits: Centralized authorization, better governance, better traceability

Large Enterprise

Recommended pattern: OIDC + OPA + Vault + PKI + mTLS + Dynamic credentials + SIEM + Formal runbooks
Benefits: Fine-grained authorization, stronger incident containment, better compliance posture, separation of duties

Common Anti-Patterns

Anti-pattern	Why it fails
JWT contains every permission	Permissions become stale and hard to revoke
Database password shared across services	No accountability, large blast radius
Secrets stored in Git	Permanent exposure risk — history doesn't forget, and neither do forks
Long-lived cloud access keys	Credential theft risk; prefer managed identity / workload identity
Single shared Vault admin token across teams	Broken accountability and dangerous privilege concentration

Quick Self-Check

Before reading further — how many of these apply to your stack right now?

[ ] JWTs contain all user permissions
[ ] A database password is shared across multiple services
[ ] Secrets are stored (or ever were) in Git
[ ] Long-lived cloud access keys are in use
[ ] A single Vault admin token is shared across teams

If you checked any box, this series was written for exactly where you are.

FAQ

Is OPA the same as Kubernetes RBAC?
No. Kubernetes RBAC controls access to the Kubernetes API itself (who can create a Deployment, read a Secret). OPA is a general-purpose policy engine that your applications call for their own authorization decisions. It can also evaluate Kubernetes admission requests via Gatekeeper — but its scope goes well beyond the cluster API.

Do I need Vault if I'm a small team?
Usually not on day one. A cloud-native secret manager paired with an OIDC provider covers most early-stage needs with far less operational overhead. Vault earns its complexity once you have multiple teams, multiple clouds, or compliance requirements that demand fine-grained, auditable access.

What's the single highest-leverage change at Level 1 or 2?
Move from static, shared credentials to short-lived, identity-bound ones — even before introducing OPA or a mesh. Almost every other improvement in this series compounds on top of that one change.

Read the Full Article

The complete reference architecture diagrams, Vault PKI + cert-manager integration details, compliance mapping (SOC 2, ISO 27001, PCI DSS, HIPAA, NIST CSF), and the full SIEM integration guide are in the original article:

👉 Building a Zero-Trust Security Architecture — Part 5

A Note of Thanks

This series is published in Towards AI — one of the leading publications for AI, ML, and engineering content. A huge thank you to the Towards AI team for their support in helping this work reach engineers and architects who are building these systems right now. If you're not following them yet, you should be.

Where Is Your Stack?

The series covers a maturity model from Level 0 to Level 5.

What level is your team actually at — and what's the biggest blocker to moving up?

Drop a comment below. If something in here matches — or doesn't match — what you're running in production, I'd genuinely like to hear about it. 👇

Part of the Building a Zero-Trust Security Architecture series.
Part 1 · Part 2 · Part 3 · Part 4 · **Part 5 — you are here**

DEV Community