KALPESH

Posted on Apr 20

End-To-End DevOps + AIOps Project- 1

#devops #distributedsystems #microservices #systemdesign

Why System Design Matters for DevOps

1. Distributed Systems

A distributed system splits workloads across multiple machines. Instead of one powerful server doing everything, many smaller services collaborate — each handling a piece of the work.

Why it matters: No single point of failure. If one node goes down, others keep running. This is the foundation of all modern cloud architecture.

2. Monolith vs Microservices

         MONOLITH                        MICROSERVICES
  ┌──────────────────────┐        ┌────────┐  ┌────────┐
  │                      │        │  Cart  │  │ Orders │
  │  UI + Auth + Cart +  │        └────┬───┘  └────┬───┘
  │  Orders + Payments   │             │            │
  │  + Notifications...  │        ┌────┴───┐  ┌────┴──────┐
  │                      │        │Payments│  │  Notifs   │
  └──────────────────────┘        └────────┘  └───────────┘
    One giant deployable             Each service deploys
    unit — scale all or nothing      & scales independently

	Monolith	Microservices
Deploy	One unit	Independent services
Scale	Whole app	Per service
Failure	One bug = full outage	Isolated failures
Best for	Small teams, MVPs	Large, evolving systems

3. API Communication

Services talk to each other via APIs. Three key patterns:

REST — Stateless HTTP calls, great for client-server communication
gRPC — High-performance, binary protocol ideal for internal service-to-service calls
Event-driven (Kafka/SQS) — Async messaging that decouples services and absorbs traffic spikes > Rule of thumb: Use REST for external APIs, gRPC for internal performance-critical calls, and events for async workflows.

4. Service Discovery

  ┌─────────────┐     "Where is cart-service?"     ┌──────────────────┐
  │  Checkout   │ ────────────────────────────────► │  Service Registry│
  │  Service    │ ◄────────────────────────────────  │  (CoreDNS)       │
  └─────────────┘     "cart-service:3000"           └──────────────────┘
         │                                                    ▲
         │  connects to                              registers │
         ▼                                                    │
  ┌─────────────┐                                   ┌──────────────────┐
  │    Cart     │ ─────────────────────────────────► │  cart-service    │
  │   Service   │                                   │  pod (dynamic IP)│
  └─────────────┘                                   └──────────────────┘

When services scale dynamically, hardcoded IPs break. Service discovery lets services find each other automatically.

In Kubernetes, this happens natively via CoreDNS — every service gets a stable DNS name regardless of how many pods are running or where they live.

5. Load Balancing

          Incoming Traffic
               │
               ▼
   ┌───────────────────────┐
   │      Load Balancer    │
   │  (AWS ALB / Ingress)  │
   └───────┬───────┬───────┘
           │       │       │
           ▼       ▼       ▼
      ┌────────┐ ┌────────┐ ┌────────┐
      │  Pod 1 │ │  Pod 2 │ │  Pod 3 │
      └────────┘ └────────┘ └────────┘
       Layer 4: routes by IP/port
       Layer 7: routes by path/headers

Load balancers distribute traffic across instances so no single server gets overwhelmed.

Layer 4 — Routes by IP/port (fast, lower overhead)
Layer 7 — Routes by HTTP path, headers, or cookies (smart, flexible) On AWS EKS, the AWS Load Balancer Controller + Kubernetes Ingress handles this automatically.

6. High Availability

HA means the system stays up even when parts of it fail. Key techniques:

Multi-AZ deployments — Spread workloads across Availability Zones
Replication — Keep multiple copies of data and services
Circuit breakers — Stop cascading failures between services

- Kubernetes self-healing — Failed pods restart automatically

7. Autoscaling

  CPU: 80% 🔺 (threshold: 70%)
         │
         ▼
  ┌─────────────┐     scale out      ┌──────────────────────────┐
  │     HPA     │ ─────────────────► │  Pod 1 │ Pod 2 │ Pod 3   │
  │ (autoscaler)│                    │        + Pod 4 + Pod 5   │
  └─────────────┘                    └──────────────────────────┘

  CPU: 20% 🔻 (below threshold)
         │
         ▼
  ┌─────────────┐     scale in       ┌────────────────┐
  │     HPA     │ ─────────────────► │ Pod 1 │ Pod 2  │
  └─────────────┘                    └────────────────┘

Manual scaling doesn't work in production. Kubernetes offers multiple autoscaling tools:

Tool	What it does
HPA	Scales pod count based on CPU/memory
VPA	Adjusts resource requests per pod
KEDA	Event-driven scaling (e.g., queue depth)
Cluster Autoscaler	Adds/removes nodes from the cluster

8. Reliability with Kubernetes

Kubernetes has built-in reliability primitives every DevOps engineer should know:

Liveness probes — Restart containers that hang or crash
Readiness probes — Remove unhealthy pods from the load balancer
Pod Disruption Budgets — Guarantee minimum replicas during rolling updates

- Resource quotas — Prevent one service from starving others

9. Security by Design

Security isn't an afterthought — it's architecture. Core principles:

Least privilege — IAM roles + Kubernetes RBAC limit what each service can do
Network policies — Restrict pod-to-pod traffic
Secrets management — AWS Secrets Manager or Vault (never hardcode credentials)
Image scanning — Tools like Trivy scan containers before they deploy

- mTLS — Encrypt all service-to-service traffic (via Istio or a service mesh)

10. Observability

  Your System
  ┌──────────────────────────────────────────────────┐
  │  Microservice A  ──►  Microservice B  ──►  DB    │
  └──────────┬──────────────────┬────────────────────┘
             │                  │
      ┌──────▼──────┐   ┌───────▼────────┐   ┌──────────────┐
      │    LOGS     │   │    METRICS     │   │   TRACES     │
      │  (what      │   │  (how much /   │   │  (where did  │
      │  happened)  │   │   how fast)    │   │  it go?)     │
      │  Loki /     │   │  Prometheus /  │   │  Jaeger /    │
      │  CloudWatch │   │  Grafana       │   │  X-Ray       │
      └─────────────┘   └────────────────┘   └──────────────┘
                  └──────────────┬──────────────┘
                                 ▼
                         AIOps Dashboard
                    (Anomaly Detection + Alerts)

You can't fix what you can't see. Observability is built on three pillars:

Pillar	Tools	Purpose
Logs	Fluentd, CloudWatch, Loki	Detailed event records
Metrics	Prometheus, Grafana	Time-series measurements
Traces	Jaeger, AWS X-Ray	Request flow across services

11. Deployment Strategies

  ROLLING UPDATE          BLUE / GREEN              CANARY
  ┌───┬───┬───┐          ┌───────────┐           ┌────────────┐
  │v1 │v1 │v1 │  step 1  │  BLUE(v1) │◄─ 100%    │   v1       │◄─ 90%
  └───┴───┴───┘   ──►    └───────────┘            └────────────┘
  ┌───┬───┬───┐          ┌───────────┐           ┌────────────┐
  │v2 │v1 │v1 │  step 2  │ GREEN(v2) │◄─  0%     │   v2       │◄─ 10%
  └───┴───┴───┘          └───────────┘  switch!   └────────────┘
  ┌───┬───┬───┐               ↕                    gradually shift
  │v2 │v2 │v2 │  done    flip traffic              to 100% if ok
  └───┴───┴───┘

Deploying safely means choosing the right strategy:

Rolling update — Gradually replace old pods with new ones (Kubernetes default)
Blue/Green — Two identical environments; switch traffic instantly with zero downtime
Canary — Route a small % of traffic to the new version first, then roll out fully

- Feature flags — Toggle features without redeploying

12. GitOps

  Developer
      │
      │  git push / pull request
      ▼
  ┌──────────────┐
  │   Git Repo   │  ◄── single source of truth
  │  (GitHub)    │
  └──────┬───────┘
         │  watches for changes
         ▼
  ┌──────────────┐
  │   ArgoCD /   │  detects drift between
  │    Flux      │  Git state ↔ cluster state
  └──────┬───────┘
         │  syncs automatically
         ▼
  ┌──────────────────────┐
  │   Kubernetes Cluster │
  │   (AWS EKS)          │
  └──────────────────────┘
  Rollback = revert a commit ↩

GitOps makes Git the single source of truth for infrastructure and application state.

All changes go through pull requests — reviewed, audited, version-controlled
An operator like ArgoCD or Flux continuously syncs the cluster to match what's in Git
Rollback = revert a commit Benefits: Full audit trail, consistent environments, and deployments that are always reproducible.

The Big Picture: How It All Connects

Developer pushes code
        ↓
  GitHub (GitOps)
        ↓
  CI/CD Pipeline
        ↓
  AWS EKS Cluster
  ┌─────────────────────────────┐
  │  Microservices (Kubernetes) │
  │  - Cart  - Orders           │
  │  - Checkout  - Payments     │
  └─────────────────────────────┘
        ↓
  Observability Stack
  (Prometheus + Grafana + Loki)
        ↓
  AIOps Layer
  (Anomaly Detection + Auto-remediation)

Key Takeaways

Distributed systems and microservices enable independent scaling and fault isolation
Kubernetes provides built-in resilience, self-healing, and safe deployments
Security and observability must be designed in — not added later
GitOps brings auditability and consistency to infrastructure changes
AIOps closes the loop: observability data drives intelligent automation

DEV Community