DEV Community

S, Sanjay
S, Sanjay

Posted on

I Built a 130+ File Kubernetes Repository — Here's Everything I Learned About Container Orchestration

There's a moment every engineer hits. You've deployed containers manually, you've docker run -d'd your way through development, and then someone says: "We need this in production. With auto-scaling. And zero downtime. And it needs to self-heal. Oh, and make it secure."

That's the moment you need Kubernetes.

But here's the problem — Kubernetes is enormous. The official documentation alone is thousands of pages. YouTube tutorials teach you pieces but never the whole picture. Blog posts cover single topics but miss how everything connects.

So I built something different: a 53-module, 130+ file repository that takes you from kubectl get pods to running production-grade, multi-cluster, GitOps-driven infrastructure — with every concept explained, every YAML annotated, and every decision justified.

This blog is a condensed tour of what's inside, what I learned building it, and the mental models that actually make Kubernetes click.


🧠 The Mental Model That Changes Everything

Before diving into commands and YAML, here's the single most important thing to understand about Kubernetes:

Kubernetes is a declarative state reconciliation engine.

You don't tell Kubernetes what to do. You tell it what you want, and it figures out how to get there. This is fundamentally different from writing shell scripts or Ansible playbooks.

Traditional (Imperative):          Kubernetes (Declarative):
─────────────────────────          ─────────────────────────
"Start 3 nginx containers"        "I want 3 nginx pods running"
"If one dies, start another"       ← K8s does this automatically
"Put them behind a load balancer"  "I want a Service of type LB"
"Update to v2 one at a time"      "Change image tag to v2"
                                    ← K8s rolls out gradually
Enter fullscreen mode Exit fullscreen mode

Every controller in Kubernetes runs an infinite loop:

while true:
    desired = read_from_etcd()      # What the user declared
    actual  = observe_cluster()     # What's actually running
    if actual != desired:
        take_action(actual, desired) # Reconcile the difference
    sleep(interval)
Enter fullscreen mode Exit fullscreen mode

Once this clicks, everything else — Deployments, StatefulSets, Operators, GitOps — is just variations of the same pattern.


🏗️ Architecture: The 10,000-Foot View

Every Kubernetes cluster has two planes:

┌─────────────────────────────────────────────────────────────────┐
│                    KUBERNETES CLUSTER                            │
│                                                                  │
│  ┌────────────────────────────┐  ┌───────────────────────────┐  │
│  │     CONTROL PLANE          │  │      WORKER NODE           │  │
│  │                            │  │                            │  │
│  │  API Server ← the ONLY    │  │  kubelet ← talks to API   │  │
│  │    entry point for ALL     │  │    Server, manages pods    │  │
│  │    cluster operations      │  │                            │  │
│  │                            │  │  kube-proxy ← manages     │  │
│  │  etcd ← the "brain"       │  │    network rules for       │  │
│  │    stores ALL cluster      │  │    Service routing         │  │
│  │    state as key-value      │  │                            │  │
│  │    pairs                   │  │  Container Runtime         │  │
│  │                            │  │    (containerd) ← runs     │  │
│  │  Scheduler ← decides      │  │    actual containers       │  │
│  │    WHERE pods run          │  │                            │  │
│  │                            │  │  ┌──────┐ ┌──────┐        │  │
│  │  Controller Manager ←     │  │  │ Pod  │ │ Pod  │        │  │
│  │    runs reconciliation    │  │  │  A   │ │  B   │        │  │
│  │    loops                   │  │  └──────┘ └──────┘        │  │
│  └────────────────────────────┘  └───────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The one thing people get wrong: They think kubectl talks directly to nodes. It doesn't. Every single command goes through the API Server, which authenticates you, authorizes the request, runs admission controllers, and then stores the result in etcd. The kubelet on each node watches etcd (via the API Server) and makes reality match the desired state.

etcd: The Most Important Component Nobody Talks About

If your API Server goes down, you can't make changes — but existing workloads keep running. If etcd goes down and you have no backup, your entire cluster state is gone. Every pod definition, every secret, every service configuration — everything.

# etcd backup — do this or regret it later
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
Enter fullscreen mode Exit fullscreen mode

Production rule: Back up etcd every hour. Test the restore process quarterly. Store backups off-cluster in encrypted object storage.


📦 Pods: The Atomic Unit (But You'll Rarely Create Them Directly)

A Pod is the smallest deployable unit — but here's what beginners miss: you almost never create Pods directly. You create Deployments (which create ReplicaSets, which create Pods). Direct Pod creation means no self-healing, no scaling, and no rolling updates.

That said, understanding Pods is critical because everything builds on them:

apiVersion: v1
kind: Pod
metadata:
  name: production-pod
  labels:
    app: api-server
spec:
  # Init containers run FIRST, in order, one at a time.
  # The main containers don't start until ALL init containers succeed.
  initContainers:
    - name: wait-for-database
      image: busybox:1.36
      command: ['sh', '-c', 
        'until nc -z db-service 5432; do echo waiting for db; sleep 2; done']
      # WHY: Prevents the app from starting before its database is ready.
      # Without this, you'd get connection errors on startup.

  containers:
    - name: api
      image: myapp:v2.1
      ports:
        - containerPort: 8080

      # Resource requests = SCHEDULER uses these to pick a node
      # Resource limits = KUBELET enforces these at runtime  
      resources:
        requests:           # "I need at least this much"
          cpu: 100m         # 100 millicores = 0.1 CPU core
          memory: 128Mi     # 128 mebibytes
        limits:             # "Never let me use more than this"
          cpu: 500m
          memory: 512Mi     # Exceed this → OOMKilled

      # Probes tell Kubernetes about your app's health
      livenessProbe:        # "Is my app alive?" — fails → container restart
        httpGet:
          path: /healthz
          port: 8080
        initialDelaySeconds: 15
        periodSeconds: 10

      readinessProbe:       # "Can my app serve traffic?" — fails → removed from Service
        httpGet:
          path: /ready
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 5

      startupProbe:         # "Has my app finished starting?" — fails → container restart
        httpGet:            # Disables liveness/readiness until it passes
          path: /healthz
          port: 8080
        failureThreshold: 30
        periodSeconds: 10   # Gives up to 300s (30×10) for slow-starting apps
Enter fullscreen mode Exit fullscreen mode

The Multi-Container Patterns You'll Actually Use

Sidecar Pattern:                    Ambassador Pattern:
┌──────────────────────┐            ┌──────────────────────┐
│        Pod           │            │        Pod           │
│ ┌──────┐  ┌────────┐ │            │ ┌──────┐  ┌────────┐ │
│ │ App  │  │Log     │ │            │ │ App  │  │ Proxy  │ │
│ │      │→ │Shipper │ │            │ │      │→ │(to DB) │ │
│ └──────┘  └────────┘ │            │ └──────┘  └────────┘ │
│   writes    reads    │            │ localhost   handles   │
│   to shared volume   │            │ :5432       auth/TLS  │
└──────────────────────┘            └──────────────────────┘

Adapter Pattern:
┌──────────────────────┐
│        Pod           │
│ ┌──────┐  ┌────────┐ │
│ │ App  │  │Adapter │ │    → Prometheus
│ │(custom│→ │(format │ │      scrapes
│ │ metrics)│ │convert)│ │      /metrics
│ └──────┘  └────────┘ │
└──────────────────────┘
Enter fullscreen mode Exit fullscreen mode

🔄 Deployments: Where the Magic Happens

Deployments are the workhorse of Kubernetes. Here's what happens when you update an image:

kubectl set image deployment/web nginx=nginx:1.26

Timeline:
─────────────────────────────────────────────────────────

t=0s   [v1] [v1] [v1]                    3 old pods
t=5s   [v1] [v1] [v1] [v2]               1 new pod starting
t=15s  [v1] [v1] [v2✓] ← ready           new pod passes readiness
t=16s  [v1] [v1] [v2✓]                   1 old pod terminating
t=30s  [v1] [v2✓] [v2✓]                  2nd new pod ready
t=45s  [v2✓] [v2✓] [v2✓]                 rollout complete!
Enter fullscreen mode Exit fullscreen mode

And if something goes wrong:

# Instant rollback — Kubernetes keeps revision history
kubectl rollout undo deployment/web

# Check rollout status
kubectl rollout status deployment/web

# See revision history
kubectl rollout history deployment/web
Enter fullscreen mode Exit fullscreen mode

The two deployment strategies you need to know:

Strategy How it Works When to Use
RollingUpdate Gradually replaces old pods with new ones Default. Works for 95% of cases
Recreate Kills ALL old pods, then starts new ones When old & new versions can't coexist (DB schema changes)

🌐 Networking: The Part Everyone Struggles With

Kubernetes networking has three fundamental rules:

  1. Every pod gets its own IP address
  2. Pods can communicate with any other pod without NAT
  3. Agents on a node can communicate with all pods on that node

Here's what actually happens when Pod A talks to a Service:

Pod A (10.244.1.5)
    │
    │ DNS lookup: "api-service" → 10.96.45.12 (ClusterIP)
    ▼
kube-proxy (iptables/IPVS rules on the node)
    │
    │ Load balances to one of the endpoint pods
    ▼
Pod B (10.244.2.8) ← one of the pods behind the Service
Enter fullscreen mode Exit fullscreen mode

Services Demystified

# ClusterIP (default) — internal only
apiVersion: v1
kind: Service
metadata:
  name: api-internal
spec:
  type: ClusterIP        # Only reachable inside the cluster
  selector:
    app: api
  ports:
    - port: 80            # Service port (what clients use)
      targetPort: 8080    # Container port (where your app listens)

---
# NodePort — exposes on every node's IP
apiVersion: v1
kind: Service
metadata:
  name: api-nodeport
spec:
  type: NodePort          # Accessible at <NodeIP>:30080
  selector:
    app: api
  ports:
    - port: 80
      targetPort: 8080
      nodePort: 30080     # Range: 30000-32767

---
# LoadBalancer — cloud provider creates an actual LB
apiVersion: v1
kind: Service
metadata:
  name: api-public
spec:
  type: LoadBalancer      # Creates AWS ALB / Azure LB / GCP LB
  selector:
    app: api
  ports:
    - port: 443
      targetPort: 8080
Enter fullscreen mode Exit fullscreen mode

Network Policies: The Firewall You're Probably Not Using

By default, every pod can talk to every other pod. This is terrifying in production. Network Policies fix this:

# Step 1: Default deny ALL traffic in the namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
  namespace: production
spec:
  podSelector: {}         # {} = applies to ALL pods
  policyTypes:
    - Ingress
    - Egress

---
# Step 2: Allow only what's needed
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-api
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api            # This policy protects the API pods
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend   # ONLY frontend pods can reach API
      ports:
        - port: 8080
          protocol: TCP
Enter fullscreen mode Exit fullscreen mode

⚠️ Important: Network Policies require a CNI that supports them. Calico and Cilium work. Flannel does not.


🔒 Security: The Chapter You Can't Skip

RBAC (Role-Based Access Control) is the gatekeeper of Kubernetes. Every API request goes through:

Request → Authentication → Authorization (RBAC) → Admission Control → etcd
          "Who are you?"   "Can you do this?"      "Should you do this?"
Enter fullscreen mode Exit fullscreen mode

Here's RBAC in practice:

# Step 1: Define what actions are allowed
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: development
rules:
  - apiGroups: [""]           # "" = core API group (pods, services, etc.)
    resources: ["pods"]
    verbs: ["get", "list", "watch"]   # Read-only access

---
# Step 2: Bind the role to a user/group/service account
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods-binding
  namespace: development
subjects:
  - kind: User
    name: jane@company.com
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io
Enter fullscreen mode Exit fullscreen mode

The golden rule: Start with zero permissions and add what's needed. Never use cluster-admin for workloads.

Pod Security: Defense in Depth

# A hardened container — this passes the "restricted" Pod Security Standard
securityContext:
  runAsNonRoot: true          # Don't run as root
  runAsUser: 1000             # Specific non-root user
  readOnlyRootFilesystem: true # Prevent writing to container filesystem
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]             # Drop ALL Linux capabilities
  seccompProfile:
    type: RuntimeDefault      # Use default seccomp profile
Enter fullscreen mode Exit fullscreen mode

📊 Observability: You Can't Fix What You Can't See

Production Kubernetes needs three pillars:

┌─────────────────────────────────────────────────────────┐
│              THE THREE PILLARS                           │
│                                                          │
│   📈 METRICS          📝 LOGS           🔗 TRACES       │
│   "What's happening"  "What happened"   "Why is it slow"│
│                                                          │
│   Prometheus          Loki / EFK        Jaeger /         │
│   + Grafana           Fluentd           OpenTelemetry    │
│                                                          │
│   CPU, memory,        Application       Request flow     │
│   request rates,      errors,           across           │
│   error ratios,       audit trails,     microservices    │
│   latency P99         debug output      with timing      │
└─────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The most important Kubernetes metrics to alert on:

Metric Alert Threshold Why
kube_pod_container_status_restarts_total > 5 in 1h CrashLoopBackOff — something is broken
node_memory_MemAvailable_bytes < 10% Node is about to OOM-kill pods
kube_deployment_status_replicas_unavailable > 0 for 5m Deployment health issue
kubelet_volume_stats_available_bytes < 15% PVC running out of disk space
etcd_server_has_leader == 0 Critical — cluster may be headless

⚡ Autoscaling: The Three Dimensions

Kubernetes can scale in three ways, and they work best together:

                    ┌─────────────────┐
                    │ Cluster          │
                    │ Autoscaler       │
                    │ (more NODES)     │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              │              ▼
    ┌─────────────┐          │    ┌──────────────┐
    │     HPA     │          │    │     VPA      │
    │ (more PODS) │          │    │ (bigger PODS)│
    └─────────────┘          │    └──────────────┘
                             │
                    ┌────────┴────────┐
                    │      KEDA       │
                    │ Event-driven    │ ← Kafka lag,
                    │ (custom         │   SQS depth,
                    │  triggers)      │   cron schedules
                    └─────────────────┘
Enter fullscreen mode Exit fullscreen mode
# HPA: Scale based on CPU + custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 20
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:                         # to prevent flapping
        - type: Percent
          value: 25                    # Remove max 25% of pods per period
          periodSeconds: 120
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70       # Target 70% CPU usage
Enter fullscreen mode Exit fullscreen mode

Pro tip: Never use HPA and VPA on the same metric (e.g., both on CPU). They'll fight each other. Use HPA for CPU/memory and VPA for right-sizing requests on separate workloads.


🚀 GitOps: The Deployment Model That Changed Everything

GitOps is the idea that your Git repository is the single source of truth for your entire infrastructure. No more kubectl apply from laptops. No more "who changed that in production?"

Traditional CI/CD:                    GitOps:
──────────────────                    ──────
CI builds image                       CI builds image
CI pushes to registry                 CI pushes to registry
CI runs kubectl apply ← PUSH model    CI updates Git manifest ← PR + review
                                      ArgoCD detects change   ← PULL model
                                      ArgoCD syncs to cluster
                                      ArgoCD detects drift & self-heals
Enter fullscreen mode Exit fullscreen mode

With ArgoCD, deploying to Kubernetes looks like this:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/k8s-manifests.git
    targetRevision: main
    path: overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true        # Delete resources removed from Git
      selfHeal: true     # Revert manual changes to match Git
    syncOptions:
      - CreateNamespace=true
Enter fullscreen mode Exit fullscreen mode

Every change goes through a Pull Request. Every PR gets reviewed. Every merge is an audit log entry. If something breaks, git revert is your rollback.


🔥 Service Mesh: When Microservices Get Serious

Imagine 20 microservices. Each needs encrypted communication, retries, circuit breaking, and observability. Without a service mesh, you'd bake all that logic into every service in every language.

A service mesh moves that logic to the infrastructure:

WITHOUT Mesh:                         WITH Mesh (Istio):
┌──────────┐                          ┌──────────────────┐
│ Service A │                          │       Pod A       │
│ (has TLS  │  HTTP (unencrypted)     │ ┌──────┐┌──────┐ │ mTLS
│  code,    │ ────────────────── →    │ │ App  ││Envoy │ │ ═══════ →
│  retry    │                          │ │(just ││proxy │ │ automatic
│  logic)   │                          │ │ biz  ││      │ │ encryption,
└──────────┘                          │ │logic)││      │ │ retries,
                                      │ └──────┘└──────┘ │ metrics
                                      └──────────────────┘
Enter fullscreen mode Exit fullscreen mode

Istio gives you traffic management superpowers:

# Canary deployment: Send 5% of traffic to v2
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
    - reviews
  http:
    - route:
        - destination:
            host: reviews
            subset: v1
          weight: 95
        - destination:
            host: reviews
            subset: v2
          weight: 5
Enter fullscreen mode Exit fullscreen mode

When to use a service mesh:

  • ✅ 10+ microservices that need mTLS
  • ✅ Complex traffic routing (canary, A/B, fault injection)
  • ✅ Need L7 observability without code changes
  • ❌ Monolith or < 5 services (overhead not justified)
  • ❌ Performance-critical workloads with sub-millisecond latency requirements

💰 Cost Optimization: The Chapter That Pays for Itself

Most Kubernetes clusters waste 30-50% of their compute budget. Here's why and how to fix it:

Typical Cluster Resource Usage:

Requested:   ████████████████████████████░░░░░░  70%
Actually Used: ██████████████░░░░░░░░░░░░░░░░░░  35%

         ← 35% WASTE (you're paying for this) →
Enter fullscreen mode Exit fullscreen mode

The Quick Wins

1. Right-size your requests:

# See actual vs requested resources
kubectl top pods -n production --sort-by=cpu

# What you'll find:
# NAME          CPU(cores)   MEMORY(bytes)
# api-server    50m          120Mi        ← requests: 500m/512Mi = 10× over-provisioned!
Enter fullscreen mode Exit fullscreen mode

2. Use Spot/Preemptible instances for fault-tolerant workloads:

# Node affinity for spot instances
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/lifecycle
              operator: In
              values: ["spot"]
Enter fullscreen mode Exit fullscreen mode

3. Set LimitRanges so no developer accidentally requests 32Gi of memory:

apiVersion: v1
kind: LimitRange
metadata:
  name: sensible-defaults
  namespace: development
spec:
  limits:
    - type: Container
      default:          # Applied if no limits specified
        cpu: 500m
        memory: 256Mi
      defaultRequest:   # Applied if no requests specified
        cpu: 100m
        memory: 128Mi
      max:              # Hard ceiling
        cpu: "2"
        memory: 2Gi
Enter fullscreen mode Exit fullscreen mode

4. Schedule idle workloads to zero with KEDA:

  • Dev environments that scale to 0 pods outside business hours
  • Queue processors that scale to 0 when no messages exist
  • Staging clusters that shut down overnight

Real savings: Teams implementing these four practices typically see 25-40% cost reduction within the first month.


🔧 Troubleshooting: The Skill That Separates Juniors from Seniors

When something breaks — and it will — follow this framework:

OBSERVE → DESCRIBE → LOGS → EVENTS → EXEC → NETWORK → RESOURCES → CONTROL PLANE
Enter fullscreen mode Exit fullscreen mode

Here's the debugging cheat sheet I keep open at all times:

# Pod stuck in Pending?
kubectl describe pod <name> -n <ns>
# Look at Events → "Insufficient cpu/memory" = need bigger nodes or lower requests
# Look at Events → "no nodes match" = check nodeSelector/affinity/taints

# CrashLoopBackOff?
kubectl logs <name> -n <ns> --previous    # Logs from the CRASHED container
# Common causes: missing env vars, wrong command, config file not found

# ImagePullBackOff?
kubectl describe pod <name> -n <ns>
# Check: image name typo? Private registry auth? Image tag exists?

# Service not working?
kubectl get endpoints <service-name> -n <ns>
# Empty endpoints? → Labels don't match between Service selector and Pod labels

# DNS not resolving?
kubectl exec -it debug-pod -- nslookup my-service.my-namespace.svc.cluster.local
# If this fails → check CoreDNS pods: kubectl get pods -n kube-system -l k8s-app=kube-dns

# OOMKilled?
kubectl describe pod <name> -n <ns> | grep -A5 "Last State"
# Solution: Increase memory limits or fix the memory leak in your app
Enter fullscreen mode Exit fullscreen mode

The Most Common Mistake Table

Symptom Rookie Move What to Actually Do
Pod CrashLooping Increase restartPolicy Read the logs with --previous flag
Service has no endpoints Delete and recreate Check that selector labels match pod labels exactly
Node NotReady Panic and drain Check kubelet logs: journalctl -u kubelet -f
PVC stuck in Pending Delete and retry Check if StorageClass exists and has a provisioner
Deployment rollout stuck Force rollout restart Check pod events with kubectl describe first

📋 The Production Readiness Checklist (Condensed)

Before going to production, verify every category:

Cluster Architecture

  • [ ] 3+ control plane nodes (HA)
  • [ ] Nodes spread across availability zones
  • [ ] etcd backup automated (hourly) and tested (quarterly)
  • [ ] CNI deployed (Calico or Cilium, not Flannel for production)

Security

  • [ ] RBAC enabled with least-privilege roles
  • [ ] Pod Security Standards enforced (restricted profile)
  • [ ] Network Policies deny-all + whitelist
  • [ ] Secrets encrypted at rest (--encryption-provider-config)
  • [ ] Container images scanned in CI pipeline
  • [ ] No containers running as root

Reliability

  • [ ] Resource requests AND limits set on every container
  • [ ] Liveness, readiness, and startup probes configured
  • [ ] PodDisruptionBudgets for critical workloads
  • [ ] Pod topology spread constraints for HA
  • [ ] Graceful shutdown handled (preStop hooks, SIGTERM)

Observability

  • [ ] Prometheus + Grafana for metrics
  • [ ] Centralized logging (Loki or EFK)
  • [ ] Alerting rules with escalation paths
  • [ ] SLOs defined for critical services

Operational

  • [ ] GitOps workflow (ArgoCD or Flux)
  • [ ] Cluster upgrade runbook documented and tested
  • [ ] Disaster recovery plan with tested RTO/RPO
  • [ ] Cost monitoring with ResourceQuotas per namespace

🛠️ What's in the Repository

Everything above barely scratches the surface. The full repository is organized into 13 sections, 53 modules, and 3 capstone projects:

Section Modules What You'll Learn
Foundations 01-05 What K8s is, architecture, installation, kubectl, YAML
Core Concepts 06-10 Pods, Deployments, Services, ConfigMaps, Namespaces
Workloads 11-15 DaemonSets, StatefulSets, Jobs, Scheduling, Resources
Networking 16-20 CNI, Ingress, Network Policies, Gateway API, CoreDNS
Storage 21-24 Volumes, PV/PVC, CSI Drivers, Backup with Velero
Security 25-29 RBAC, Pod Security, Secrets Management, Supply Chain, Audit
Observability 30-33 Prometheus, Logging, Tracing, Alerting & SLOs
Advanced 34-40 Helm, Kustomize, Service Mesh, Autoscaling, Operators, GitOps, Multi-Cluster
Cluster Mgmt 41-44 kOps, Rancher, kubeadm, Managed K8s (EKS/AKS/GKE)
CI/CD 45-47 Pipelines, Container Best Practices, Progressive Delivery
Production 48-50 Production Checklist, Cost Optimization, Disaster Recovery
Troubleshooting 51-53 Debugging Guide, Cheatsheet, CKA/CKAD/CKS Exam Prep
Projects P1-P3 E-Commerce Microservices, Monitoring Stack, Multi-Tenant SaaS

Every module includes:

  • README with concepts explained from first principles
  • ASCII diagrams showing architecture and data flow
  • Annotated YAML files you can apply directly
  • Troubleshooting tables for common issues
  • Hands-on exercises to cement understanding

🎯 Four Learning Paths

Not everyone starts from the same place. Pick your path:

Path Duration Modules You'll Be Able To
Beginner 2-3 weeks 01-10 Deploy apps, understand core K8s concepts
Intermediate 3-4 weeks 11-13, 16-17, 21-22, 25, 30-31, 34-35, 44 Handle real workloads with monitoring & storage
Advanced 4-6 weeks 14-15, 18-20, 23, 26-27, 36-39, 41-42 Build production clusters with service mesh & GitOps
Expert / CKA+CKS 2-3 weeks 28-29, 40, 43, 47-50, 51, 53 + Projects Enterprise multi-cluster architecture, pass certifications

🧪 Try It Right Now

# 1. Install Kind (takes 30 seconds)
go install sigs.k8s.io/kind@latest
# OR: brew install kind

# 2. Create a cluster
kind create cluster --name learn-k8s

# 3. Verify
kubectl cluster-info
kubectl get nodes

# 4. Deploy your first app
kubectl create deployment hello --image=nginx:1.25
kubectl expose deployment hello --port=80 --type=NodePort
kubectl get svc hello

# 5. You're running Kubernetes. Now go deeper. 🚀
Enter fullscreen mode Exit fullscreen mode

The One Thing I'd Tell My Past Self

If I could go back and tell myself one thing before starting this Kubernetes journey, it would be:

Stop trying to learn Kubernetes by memorizing YAML.

Instead, understand the why behind every resource:

  • A Deployment exists because you need rollbacks and scaling — a naked Pod gives you neither.
  • A Service exists because Pod IPs are ephemeral — they change on every restart.
  • A PersistentVolumeClaim exists because containers are ephemeral — their filesystem dies with them.
  • RBAC exists because "everyone is admin" doesn't survive the first security audit.
  • Network Policies exist because "all pods can talk to everything" is a lateral movement dream for attackers.

Every Kubernetes concept solves a specific problem. Learn the problem first, and the YAML writes itself.


If this guide helped you, the full repository has 130+ files of this depth. Star it, fork it, and start building.

🔗 GitHub Repository: Kubernetes Mastery — From Zero to Production Hero


What Kubernetes concept gave you the most trouble? Drop it in the comments — I'll explain it like you're five. 👇

Top comments (0)