Sibasish Mohanty

Posted on Sep 11

Don’t Trust, Just Verify: Auth, Faults, and Monitoring

#systemdesign #authjs #monitoring #interview

In part 6 of our System Design series, we’ll tackle the critical pillars of system reliability and security that keep services running smoothly and securely.

We’ll cover:

AuthN & AuthZ – OAuth2, JWT, RBAC/ABAC
Resilience – Circuit breakers, timeouts, retries
Observability – Logs, metrics, traces, SLI/SLO
Health Checks – Detect failures, auto-replace
Redundancy – Avoid single points of failure (SPOFs); multi-AZ/region

1. AuthN & AuthZ

TL;DR: Authentication proves identity, Authorization checks permissions.

OAuth2: Delegated authorization, useful for third-party apps.
JWT (JSON Web Tokens): Stateless, signed tokens for identity verification.
RBAC (Role-Based Access Control): Permissions based on roles.
ABAC (Attribute-Based Access Control): Fine-grained control based on attributes.

👉 Example: User logs in → receives a JWT → Access to /orders endpoint is verified using claims in the JWT.

👉 Interview tie-in: "How would you design auth for a multi-tenant SaaS app?" — Use OAuth2 + scoped JWTs.

2. Resilience

TL;DR: Systems must fail gracefully.

Circuit Breakers: Stop calling a failing service to prevent cascading failure.
Timeouts: Don’t wait forever for a response.
Retries: Retry transient failures with exponential backoff.

👉 Example: Payment service calls third-party gateway; use circuit breaker to avoid repeated failures.

👉 Interview tie-in: "What happens when a downstream service is down?" — Circuit breaker opens, fallback or error is returned.

3. Observability

TL;DR: If you can’t measure it, you can’t improve it.

Logs: For detailed debugging
Metrics: Quantitative measures (e.g., QPS, error rate)
Traces: End-to-end request flow (distributed tracing)
SLI (Service Level Indicator): Measured metric (e.g., 99.9% requests < 200ms)
SLO (Service Level Objective): Target (e.g., 99.9% of requests < 200ms)

👉 Example: Grafana dashboard shows error rates and latency percentiles.

👉 Interview tie-in: "How do you monitor a microservice-based architecture?" — Combine logs, metrics, and tracing.

4. Health Checks

TL;DR: Detect problems early and recover automatically.

Liveness Check: Is the process alive?
Readiness Check: Can it handle requests?

👉 Example: Kubernetes probes services every few seconds and restarts if unhealthy.

👉 Interview tie-in: "How do you prevent traffic to unhealthy services?" — Use readiness probes + service discovery.

5. Redundancy

TL;DR: No SPOFs. Multi-AZ or Multi-Region deployments for resilience.

👉 Example: Primary DB in us-east-1, replica in us-west-1.

👉 Interview tie-in: "How do you handle data center failure?" — Multi-region replication, failover mechanisms.

✅ Takeaways

Design auth flows with security and scalability in mind
Build resilience using circuit breakers, timeouts, retries
Implement observability: logs, metrics, traces, SLI/SLO
Health checks prevent serving requests from unhealthy instances
Use redundancy to avoid SPOFs and enable disaster recovery

💡 Practice Question:

"Design an auth system for an API gateway in a microservices architecture. How do you enforce per-service access control and trace requests end-to-end?"

DEV Community