This is **Part 5* of the Building a Zero-Trust Security Architecture series. Parts 1–4 covered secrets fundamentals, HashiCorp Vault, cloud secret managers (Azure Key Vault, AWS Secrets Manager), and why Kubernetes Secrets are not a full secret management platform. This part brings it all together into one production reference design.*
TL;DR — What This Part Actually Delivers
- Why authentication, authorization, secret management, encryption, audit, and network control are six different jobs — and why collapsing them is the #1 architecture mistake
- A full reference architecture connecting an IdP, OPA, a service mesh, Vault/cloud secret managers, and a SIEM
- A corrected, OPA 1.0-compatible Rego policy example (with the
rego.v1+iffix most tutorials miss) - A current Istio mTLS + AuthorizationPolicy configuration using
security.istio.io/v1 - A security maturity model (Level 0 → Level 5) and decision matrix
- A real incident-response runbook for a compromised service
The #1 Architecture Mistake
Most breaches don't happen because a team lacked some exotic tool.
They happen because basic separation of concerns broke down.
The JWT that tries to carry every permission. The database password that doubles as every service's identity. The "temporary" admin token nobody ever revoked.
Zero Trust isn't one system — it's six distinct responsibilities:
| Layer | Responsibility |
|---|---|
| 🔐 Authentication | Who are you? |
| ✅ Authorization | What are you allowed to do? |
| 🔑 Secret Management | How do services prove identity without static credentials? |
| 🔒 Encryption | Is data protected at rest and in transit? |
| 📋 Audit Logging | Can we prove what happened and by whom? |
| 🌐 Network Controls | How do we limit blast radius when something goes wrong? |
Collapse any two of these into one system and you've introduced a future incident.
Step 1 — User Authentication
Users authenticate through an identity provider: Keycloak, Okta, Auth0, or Microsoft Entra ID.
At this point, identity is established. Nothing more.
JWT validation is NOT authorization
A valid JWT proves who the user is. It does not prove what they are allowed to do.
That's a critical distinction — and where policy evaluation enters the design.
Step 2 — Authorization Through OPA
OPA (Open Policy Agent) becomes the centralized authorization engine. Applications query it for decisions instead of hardcoding complex rules scattered across services.
A Realistic Policy Example
A production policy should consider action, resource, ownership, and environment — not just a flat role string:
package authz
import rego.v1
default allow := false
allow if {
input.user.role == "analyst"
input.action == "read"
input.resource.owner_team == input.user.team
input.resource.environment != "production"
}
⚠️ OPA 1.0 note: Every current OPA release requires the
ifkeyword in rule bodies and treatsrego.v1semantics as standard. The older bareallow { ... }style will failopa checkon a modern install. If you're copying Rego from older blog posts, runopa fmt --rego-v1on it before trusting it in CI.
Anti-pattern: JWT contains all permissions
Putting every permission in the token creates staleness and revocation problems — you can't un-issue a JWT that's already in someone's hands.
✅ Prefer: tokens for identity claims + OPA for dynamic authorization decisions.
Step 3 — Service Authentication (Workload Identity)
User authentication alone is not enough. Microservices also need strong identity.
Common anti-patterns to avoid:
- Shared API keys
- Shared service passwords
- Long-lived tokens copied across services
Kubernetes Workload Identity — Two Distinct Paths
It's worth being precise here, since "OIDC provider" gets used loosely in many guides:
Path A — Vault Kubernetes Auth Method
Vault validates the pod's projected service-account token directly against the cluster's API/JWKS endpoint. No external OIDC provider required.
Path B — Cloud-Native Federation
(AWS IRSA, GCP Workload Identity Federation, Azure AD Workload Identity)
The cluster's OIDC issuer federates with a cloud IAM provider, letting a pod assume a cloud role without any static keys.
Both achieve the same outcome — no long-lived secrets sitting in a pod — through different plumbing. Know which one you're actually running.
Step 4 — Secret Retrieval and Dynamic Credentials
After authentication, services retrieve secrets or dynamic credentials on demand.
What changes in the modern model:
| Traditional | Modern |
|---|---|
| Hardcoded passwords | No hardcoded passwords |
| Shared, permanent credentials | Unique credentials per workload |
| Manual rotation | Short-lifetime, auto-rotated |
| Broad blast radius | Identity-scoped, limited exposure |
Each application receives unique credentials, a unique identity, and a short lifetime.
Step 5 — mTLS and Service Mesh
Service-to-service authentication becomes operationally manageable through a service mesh. Saying "we use mTLS" without an implementation path doesn't help anyone.
Istio Example (current stable API)
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
name: default
namespace: prod
spec:
mtls:
mode: STRICT
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
name: service-b-allow-service-a
namespace: prod
spec:
selector:
matchLabels:
app: service-b
action: ALLOW
rules:
- from:
- source:
principals: ["cluster.local/ns/prod/sa/service-a"]
Two real corrections baked in here:
security.istio.io/v1is the current stable API group (v1beta1is the legacy version still floating around in older tutorials)action: ALLOWis written explicitly. Istio defaults to ALLOW when omitted, but spelling it out makes the policy's intent obvious in a code review six months from now — and obvious at 2 a.m. during an incident
Does a service mesh replace Vault?
No. They solve different problems:
- Mesh (Istio, Linkerd): service-to-service transport security and identity via mTLS
- Secret manager (Vault, AWS Secrets Manager, Azure Key Vault): credential issuance, rotation, and storage for database passwords, API keys, etc.
Most mature architectures run both.
Step 6 — Encryption Architecture
A common mistake: letting applications own long-lived encryption keys. That creates key sprawl and inconsistent key handling.
Prefer a central cryptographic service (Vault Transit engine or a cloud KMS equivalent).
Keys remain protected because they never live inside the application process as ordinary static assets. The application calls the cryptographic service; the keys never leave it.
Step 7 — Network Policies
Zero Trust is incomplete if every pod can talk to every other pod by default.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-service-a-to-db
namespace: prod
spec:
podSelector:
matchLabels:
app: database
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: service-a
ports:
- protocol: TCP
port: 5432
NetworkPolicy vs. service mesh:
NetworkPolicy controls reachability at the IP/port level. It doesn't provide encryption or cryptographic identity. Use both: NetworkPolicy to shrink the attack surface, mTLS/mesh to verify identity on the connections that are allowed.
Step 8 — Audit Architecture
Security without auditing is incomplete. Every meaningful decision should be traceable.
Questions your platform should answer quickly:
- Who accessed data?
- When?
- Which policy allowed it?
- Which secret or lease was used?
- Which workload identity made the call?
Useful detections to build
- A sudden spike in secret reads
- Access to secrets outside normal namespace or service patterns
- Lease revocation followed by repeated failed usage
- Cross-region secret access anomalies
Incident Response Runbook — Compromised Service
Suppose Service A is compromised.
Traditional architecture often means: shared credentials, permanent access, weak attribution — a bad day.
Modern architecture enables this runbook:
- Identify the workload identity, token accessor, or Vault lease tied to Service A
- Revoke the affected Vault token or leases
- Trigger dynamic credential rotation where needed
- Query audit logs for every action performed by that identity
- Tighten policy or disable the role
- Redeploy the workload with corrected configuration
- Review whether network policy and mesh policy limited the blast radius
This is where short-lived credentials and strong audit trails become operational advantages, not just design principles on a slide.
Security Maturity Model
| Level | Description |
|---|---|
| 0 | Passwords in source code |
| 1 | Environment variables and manually managed credentials |
| 2 | Kubernetes Secrets and basic secret segregation |
| 3 | Secret manager + scheduled rotation |
| 4 | Identity-based access, dynamic credentials, workload identity |
| 5 | Zero Trust, policy as code, continuous verification, strong audit analytics, mesh identity, reduced lateral movement |
The model exists so teams can choose the next practical step instead of trying to jump directly to Level 5 — which mostly produces a half-finished Vault deployment and a very tired platform team.
Decision Matrix
Small Startup
Recommended pattern: OIDC + OPA + Cloud secret manager + Namespace-scoped RBAC + Network policies
Benefits: Centralized authorization, better governance, better traceability
Large Enterprise
Recommended pattern: OIDC + OPA + Vault + PKI + mTLS + Dynamic credentials + SIEM + Formal runbooks
Benefits: Fine-grained authorization, stronger incident containment, better compliance posture, separation of duties
Common Anti-Patterns
| Anti-pattern | Why it fails |
|---|---|
| JWT contains every permission | Permissions become stale and hard to revoke |
| Database password shared across services | No accountability, large blast radius |
| Secrets stored in Git | Permanent exposure risk — history doesn't forget, and neither do forks |
| Long-lived cloud access keys | Credential theft risk; prefer managed identity / workload identity |
| Single shared Vault admin token across teams | Broken accountability and dangerous privilege concentration |
Quick Self-Check
Before reading further — how many of these apply to your stack right now?
- [ ] JWTs contain all user permissions
- [ ] A database password is shared across multiple services
- [ ] Secrets are stored (or ever were) in Git
- [ ] Long-lived cloud access keys are in use
- [ ] A single Vault admin token is shared across teams
If you checked any box, this series was written for exactly where you are.
FAQ
Is OPA the same as Kubernetes RBAC?
No. Kubernetes RBAC controls access to the Kubernetes API itself (who can create a Deployment, read a Secret). OPA is a general-purpose policy engine that your applications call for their own authorization decisions. It can also evaluate Kubernetes admission requests via Gatekeeper — but its scope goes well beyond the cluster API.
Do I need Vault if I'm a small team?
Usually not on day one. A cloud-native secret manager paired with an OIDC provider covers most early-stage needs with far less operational overhead. Vault earns its complexity once you have multiple teams, multiple clouds, or compliance requirements that demand fine-grained, auditable access.
What's the single highest-leverage change at Level 1 or 2?
Move from static, shared credentials to short-lived, identity-bound ones — even before introducing OPA or a mesh. Almost every other improvement in this series compounds on top of that one change.
Read the Full Article
The complete reference architecture diagrams, Vault PKI + cert-manager integration details, compliance mapping (SOC 2, ISO 27001, PCI DSS, HIPAA, NIST CSF), and the full SIEM integration guide are in the original article:
👉 Building a Zero-Trust Security Architecture — Part 5
A Note of Thanks
This series is published in Towards AI — one of the leading publications for AI, ML, and engineering content. A huge thank you to the Towards AI team for their support in helping this work reach engineers and architects who are building these systems right now. If you're not following them yet, you should be.
Where Is Your Stack?
The series covers a maturity model from Level 0 to Level 5.
What level is your team actually at — and what's the biggest blocker to moving up?
Drop a comment below. If something in here matches — or doesn't match — what you're running in production, I'd genuinely like to hear about it. 👇
Part of the Building a Zero-Trust Security Architecture series.
Part 1 · Part 2 · Part 3 · Part 4 · **Part 5 — you are here**
Top comments (0)