DEV Community

Sibasish Mohanty
Sibasish Mohanty

Posted on

Don’t Trust, Just Verify: Auth, Faults, and Monitoring

In part 6 of our System Design series, we’ll tackle the critical pillars of system reliability and security that keep services running smoothly and securely.

We’ll cover:

  1. AuthN & AuthZ – OAuth2, JWT, RBAC/ABAC
  2. Resilience – Circuit breakers, timeouts, retries
  3. Observability – Logs, metrics, traces, SLI/SLO
  4. Health Checks – Detect failures, auto-replace
  5. Redundancy – Avoid single points of failure (SPOFs); multi-AZ/region

1. AuthN & AuthZ

TL;DR: Authentication proves identity, Authorization checks permissions.

  • OAuth2: Delegated authorization, useful for third-party apps.
  • JWT (JSON Web Tokens): Stateless, signed tokens for identity verification.
  • RBAC (Role-Based Access Control): Permissions based on roles.
  • ABAC (Attribute-Based Access Control): Fine-grained control based on attributes.

👉 Example: User logs in → receives a JWT → Access to /orders endpoint is verified using claims in the JWT.

👉 Interview tie-in: "How would you design auth for a multi-tenant SaaS app?" — Use OAuth2 + scoped JWTs.


2. Resilience

TL;DR: Systems must fail gracefully.

  • Circuit Breakers: Stop calling a failing service to prevent cascading failure.
  • Timeouts: Don’t wait forever for a response.
  • Retries: Retry transient failures with exponential backoff.

👉 Example: Payment service calls third-party gateway; use circuit breaker to avoid repeated failures.

👉 Interview tie-in: "What happens when a downstream service is down?" — Circuit breaker opens, fallback or error is returned.


3. Observability

TL;DR: If you can’t measure it, you can’t improve it.

  • Logs: For detailed debugging
  • Metrics: Quantitative measures (e.g., QPS, error rate)
  • Traces: End-to-end request flow (distributed tracing)
  • SLI (Service Level Indicator): Measured metric (e.g., 99.9% requests < 200ms)
  • SLO (Service Level Objective): Target (e.g., 99.9% of requests < 200ms)

👉 Example: Grafana dashboard shows error rates and latency percentiles.

👉 Interview tie-in: "How do you monitor a microservice-based architecture?" — Combine logs, metrics, and tracing.


4. Health Checks

TL;DR: Detect problems early and recover automatically.

  • Liveness Check: Is the process alive?
  • Readiness Check: Can it handle requests?

👉 Example: Kubernetes probes services every few seconds and restarts if unhealthy.

👉 Interview tie-in: "How do you prevent traffic to unhealthy services?" — Use readiness probes + service discovery.


5. Redundancy

TL;DR: No SPOFs. Multi-AZ or Multi-Region deployments for resilience.

👉 Example: Primary DB in us-east-1, replica in us-west-1.

👉 Interview tie-in: "How do you handle data center failure?" — Multi-region replication, failover mechanisms.


✅ Takeaways

  • Design auth flows with security and scalability in mind
  • Build resilience using circuit breakers, timeouts, retries
  • Implement observability: logs, metrics, traces, SLI/SLO
  • Health checks prevent serving requests from unhealthy instances
  • Use redundancy to avoid SPOFs and enable disaster recovery

💡 Practice Question:

"Design an auth system for an API gateway in a microservices architecture. How do you enforce per-service access control and trace requests end-to-end?"

Top comments (0)