In part 6 of our System Design series, we’ll tackle the critical pillars of system reliability and security that keep services running smoothly and securely.
We’ll cover:
- AuthN & AuthZ – OAuth2, JWT, RBAC/ABAC
- Resilience – Circuit breakers, timeouts, retries
- Observability – Logs, metrics, traces, SLI/SLO
- Health Checks – Detect failures, auto-replace
- Redundancy – Avoid single points of failure (SPOFs); multi-AZ/region
1. AuthN & AuthZ
TL;DR: Authentication proves identity, Authorization checks permissions.
- OAuth2: Delegated authorization, useful for third-party apps.
- JWT (JSON Web Tokens): Stateless, signed tokens for identity verification.
- RBAC (Role-Based Access Control): Permissions based on roles.
- ABAC (Attribute-Based Access Control): Fine-grained control based on attributes.
👉 Example: User logs in → receives a JWT → Access to /orders
endpoint is verified using claims in the JWT.
👉 Interview tie-in: "How would you design auth for a multi-tenant SaaS app?" — Use OAuth2 + scoped JWTs.
2. Resilience
TL;DR: Systems must fail gracefully.
- Circuit Breakers: Stop calling a failing service to prevent cascading failure.
- Timeouts: Don’t wait forever for a response.
- Retries: Retry transient failures with exponential backoff.
👉 Example: Payment service calls third-party gateway; use circuit breaker to avoid repeated failures.
👉 Interview tie-in: "What happens when a downstream service is down?" — Circuit breaker opens, fallback or error is returned.
3. Observability
TL;DR: If you can’t measure it, you can’t improve it.
- Logs: For detailed debugging
- Metrics: Quantitative measures (e.g., QPS, error rate)
- Traces: End-to-end request flow (distributed tracing)
- SLI (Service Level Indicator): Measured metric (e.g., 99.9% requests < 200ms)
- SLO (Service Level Objective): Target (e.g., 99.9% of requests < 200ms)
👉 Example: Grafana dashboard shows error rates and latency percentiles.
👉 Interview tie-in: "How do you monitor a microservice-based architecture?" — Combine logs, metrics, and tracing.
4. Health Checks
TL;DR: Detect problems early and recover automatically.
- Liveness Check: Is the process alive?
- Readiness Check: Can it handle requests?
👉 Example: Kubernetes probes services every few seconds and restarts if unhealthy.
👉 Interview tie-in: "How do you prevent traffic to unhealthy services?" — Use readiness probes + service discovery.
5. Redundancy
TL;DR: No SPOFs. Multi-AZ or Multi-Region deployments for resilience.
👉 Example: Primary DB in us-east-1, replica in us-west-1.
👉 Interview tie-in: "How do you handle data center failure?" — Multi-region replication, failover mechanisms.
✅ Takeaways
- Design auth flows with security and scalability in mind
- Build resilience using circuit breakers, timeouts, retries
- Implement observability: logs, metrics, traces, SLI/SLO
- Health checks prevent serving requests from unhealthy instances
- Use redundancy to avoid SPOFs and enable disaster recovery
💡 Practice Question:
"Design an auth system for an API gateway in a microservices architecture. How do you enforce per-service access control and trace requests end-to-end?"
Top comments (0)