S, Sanjay

Posted on Mar 6

I Built a 130+ File Kubernetes Repository — Here's Everything I Learned About Container Orchestration

#devops #beginners #kubernetes #cloud

There's a moment every engineer hits. You've deployed containers manually, you've docker run -d'd your way through development, and then someone says: "We need this in production. With auto-scaling. And zero downtime. And it needs to self-heal. Oh, and make it secure."

That's the moment you need Kubernetes.

But here's the problem — Kubernetes is enormous. The official documentation alone is thousands of pages. YouTube tutorials teach you pieces but never the whole picture. Blog posts cover single topics but miss how everything connects.

So I built something different: a 53-module, 130+ file repository that takes you from kubectl get pods to running production-grade, multi-cluster, GitOps-driven infrastructure — with every concept explained, every YAML annotated, and every decision justified.

This blog is a condensed tour of what's inside, what I learned building it, and the mental models that actually make Kubernetes click.

🧠 The Mental Model That Changes Everything

Before diving into commands and YAML, here's the single most important thing to understand about Kubernetes:

Kubernetes is a declarative state reconciliation engine.

You don't tell Kubernetes what to do. You tell it what you want, and it figures out how to get there. This is fundamentally different from writing shell scripts or Ansible playbooks.

Traditional (Imperative):          Kubernetes (Declarative):
─────────────────────────          ─────────────────────────
"Start 3 nginx containers"        "I want 3 nginx pods running"
"If one dies, start another"       ← K8s does this automatically
"Put them behind a load balancer"  "I want a Service of type LB"
"Update to v2 one at a time"      "Change image tag to v2"
                                    ← K8s rolls out gradually

Every controller in Kubernetes runs an infinite loop:

while true:
    desired = read_from_etcd()      # What the user declared
    actual  = observe_cluster()     # What's actually running
    if actual != desired:
        take_action(actual, desired) # Reconcile the difference
    sleep(interval)

Once this clicks, everything else — Deployments, StatefulSets, Operators, GitOps — is just variations of the same pattern.

🏗️ Architecture: The 10,000-Foot View

Every Kubernetes cluster has two planes:

┌─────────────────────────────────────────────────────────────────┐
│                    KUBERNETES CLUSTER                            │
│                                                                  │
│  ┌────────────────────────────┐  ┌───────────────────────────┐  │
│  │     CONTROL PLANE          │  │      WORKER NODE           │  │
│  │                            │  │                            │  │
│  │  API Server ← the ONLY    │  │  kubelet ← talks to API   │  │
│  │    entry point for ALL     │  │    Server, manages pods    │  │
│  │    cluster operations      │  │                            │  │
│  │                            │  │  kube-proxy ← manages     │  │
│  │  etcd ← the "brain"       │  │    network rules for       │  │
│  │    stores ALL cluster      │  │    Service routing         │  │
│  │    state as key-value      │  │                            │  │
│  │    pairs                   │  │  Container Runtime         │  │
│  │                            │  │    (containerd) ← runs     │  │
│  │  Scheduler ← decides      │  │    actual containers       │  │
│  │    WHERE pods run          │  │                            │  │
│  │                            │  │  ┌──────┐ ┌──────┐        │  │
│  │  Controller Manager ←     │  │  │ Pod  │ │ Pod  │        │  │
│  │    runs reconciliation    │  │  │  A   │ │  B   │        │  │
│  │    loops                   │  │  └──────┘ └──────┘        │  │
│  └────────────────────────────┘  └───────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

The one thing people get wrong: They think kubectl talks directly to nodes. It doesn't. Every single command goes through the API Server, which authenticates you, authorizes the request, runs admission controllers, and then stores the result in etcd. The kubelet on each node watches etcd (via the API Server) and makes reality match the desired state.

etcd: The Most Important Component Nobody Talks About

If your API Server goes down, you can't make changes — but existing workloads keep running. If etcd goes down and you have no backup, your entire cluster state is gone. Every pod definition, every secret, every service configuration — everything.

# etcd backup — do this or regret it later
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Production rule: Back up etcd every hour. Test the restore process quarterly. Store backups off-cluster in encrypted object storage.

📦 Pods: The Atomic Unit (But You'll Rarely Create Them Directly)

A Pod is the smallest deployable unit — but here's what beginners miss: you almost never create Pods directly. You create Deployments (which create ReplicaSets, which create Pods). Direct Pod creation means no self-healing, no scaling, and no rolling updates.

That said, understanding Pods is critical because everything builds on them:

apiVersion: v1
kind: Pod
metadata:
  name: production-pod
  labels:
    app: api-server
spec:
  # Init containers run FIRST, in order, one at a time.
  # The main containers don't start until ALL init containers succeed.
  initContainers:
    - name: wait-for-database
      image: busybox:1.36
      command: ['sh', '-c', 
        'until nc -z db-service 5432; do echo waiting for db; sleep 2; done']
      # WHY: Prevents the app from starting before its database is ready.
      # Without this, you'd get connection errors on startup.

  containers:
    - name: api
      image: myapp:v2.1
      ports:
        - containerPort: 8080

      # Resource requests = SCHEDULER uses these to pick a node
      # Resource limits = KUBELET enforces these at runtime  
      resources:
        requests:           # "I need at least this much"
          cpu: 100m         # 100 millicores = 0.1 CPU core
          memory: 128Mi     # 128 mebibytes
        limits:             # "Never let me use more than this"
          cpu: 500m
          memory: 512Mi     # Exceed this → OOMKilled

      # Probes tell Kubernetes about your app's health
      livenessProbe:        # "Is my app alive?" — fails → container restart
        httpGet:
          path: /healthz
          port: 8080
        initialDelaySeconds: 15
        periodSeconds: 10

      readinessProbe:       # "Can my app serve traffic?" — fails → removed from Service
        httpGet:
          path: /ready
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 5

      startupProbe:         # "Has my app finished starting?" — fails → container restart
        httpGet:            # Disables liveness/readiness until it passes
          path: /healthz
          port: 8080
        failureThreshold: 30
        periodSeconds: 10   # Gives up to 300s (30×10) for slow-starting apps

The Multi-Container Patterns You'll Actually Use

Sidecar Pattern:                    Ambassador Pattern:
┌──────────────────────┐            ┌──────────────────────┐
│        Pod           │            │        Pod           │
│ ┌──────┐  ┌────────┐ │            │ ┌──────┐  ┌────────┐ │
│ │ App  │  │Log     │ │            │ │ App  │  │ Proxy  │ │
│ │      │→ │Shipper │ │            │ │      │→ │(to DB) │ │
│ └──────┘  └────────┘ │            │ └──────┘  └────────┘ │
│   writes    reads    │            │ localhost   handles   │
│   to shared volume   │            │ :5432       auth/TLS  │
└──────────────────────┘            └──────────────────────┘

Adapter Pattern:
┌──────────────────────┐
│        Pod           │
│ ┌──────┐  ┌────────┐ │
│ │ App  │  │Adapter │ │    → Prometheus
│ │(custom│→ │(format │ │      scrapes
│ │ metrics)│ │convert)│ │      /metrics
│ └──────┘  └────────┘ │
└──────────────────────┘

🔄 Deployments: Where the Magic Happens

Deployments are the workhorse of Kubernetes. Here's what happens when you update an image:

kubectl set image deployment/web nginx=nginx:1.26

Timeline:
─────────────────────────────────────────────────────────

t=0s   [v1] [v1] [v1]                    3 old pods
t=5s   [v1] [v1] [v1] [v2]               1 new pod starting
t=15s  [v1] [v1] [v2✓] ← ready           new pod passes readiness
t=16s  [v1] [v1] [v2✓]                   1 old pod terminating
t=30s  [v1] [v2✓] [v2✓]                  2nd new pod ready
t=45s  [v2✓] [v2✓] [v2✓]                 rollout complete!

And if something goes wrong:

# Instant rollback — Kubernetes keeps revision history
kubectl rollout undo deployment/web

# Check rollout status
kubectl rollout status deployment/web

# See revision history
kubectl rollout history deployment/web

The two deployment strategies you need to know:

Strategy	How it Works	When to Use
RollingUpdate	Gradually replaces old pods with new ones	Default. Works for 95% of cases
Recreate	Kills ALL old pods, then starts new ones	When old & new versions can't coexist (DB schema changes)

🌐 Networking: The Part Everyone Struggles With

Kubernetes networking has three fundamental rules:

Every pod gets its own IP address
Pods can communicate with any other pod without NAT
Agents on a node can communicate with all pods on that node

Here's what actually happens when Pod A talks to a Service:

Pod A (10.244.1.5)
    │
    │ DNS lookup: "api-service" → 10.96.45.12 (ClusterIP)
    ▼
kube-proxy (iptables/IPVS rules on the node)
    │
    │ Load balances to one of the endpoint pods
    ▼
Pod B (10.244.2.8) ← one of the pods behind the Service

Services Demystified

# ClusterIP (default) — internal only
apiVersion: v1
kind: Service
metadata:
  name: api-internal
spec:
  type: ClusterIP        # Only reachable inside the cluster
  selector:
    app: api
  ports:
    - port: 80            # Service port (what clients use)
      targetPort: 8080    # Container port (where your app listens)

---
# NodePort — exposes on every node's IP
apiVersion: v1
kind: Service
metadata:
  name: api-nodeport
spec:
  type: NodePort          # Accessible at <NodeIP>:30080
  selector:
    app: api
  ports:
    - port: 80
      targetPort: 8080
      nodePort: 30080     # Range: 30000-32767

---
# LoadBalancer — cloud provider creates an actual LB
apiVersion: v1
kind: Service
metadata:
  name: api-public
spec:
  type: LoadBalancer      # Creates AWS ALB / Azure LB / GCP LB
  selector:
    app: api
  ports:
    - port: 443
      targetPort: 8080

Network Policies: The Firewall You're Probably Not Using

By default, every pod can talk to every other pod. This is terrifying in production. Network Policies fix this:

# Step 1: Default deny ALL traffic in the namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
  namespace: production
spec:
  podSelector: {}         # {} = applies to ALL pods
  policyTypes:
    - Ingress
    - Egress

---
# Step 2: Allow only what's needed
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-api
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api            # This policy protects the API pods
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend   # ONLY frontend pods can reach API
      ports:
        - port: 8080
          protocol: TCP

⚠️ Important: Network Policies require a CNI that supports them. Calico and Cilium work. Flannel does not.

🔒 Security: The Chapter You Can't Skip

RBAC (Role-Based Access Control) is the gatekeeper of Kubernetes. Every API request goes through:

Request → Authentication → Authorization (RBAC) → Admission Control → etcd
          "Who are you?"   "Can you do this?"      "Should you do this?"

Here's RBAC in practice:

# Step 1: Define what actions are allowed
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: development
rules:
  - apiGroups: [""]           # "" = core API group (pods, services, etc.)
    resources: ["pods"]
    verbs: ["get", "list", "watch"]   # Read-only access

---
# Step 2: Bind the role to a user/group/service account
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods-binding
  namespace: development
subjects:
  - kind: User
    name: jane@company.com
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

The golden rule: Start with zero permissions and add what's needed. Never use cluster-admin for workloads.

Pod Security: Defense in Depth

# A hardened container — this passes the "restricted" Pod Security Standard
securityContext:
  runAsNonRoot: true          # Don't run as root
  runAsUser: 1000             # Specific non-root user
  readOnlyRootFilesystem: true # Prevent writing to container filesystem
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]             # Drop ALL Linux capabilities
  seccompProfile:
    type: RuntimeDefault      # Use default seccomp profile

📊 Observability: You Can't Fix What You Can't See

Production Kubernetes needs three pillars:

┌─────────────────────────────────────────────────────────┐
│              THE THREE PILLARS                           │
│                                                          │
│   📈 METRICS          📝 LOGS           🔗 TRACES       │
│   "What's happening"  "What happened"   "Why is it slow"│
│                                                          │
│   Prometheus          Loki / EFK        Jaeger /         │
│   + Grafana           Fluentd           OpenTelemetry    │
│                                                          │
│   CPU, memory,        Application       Request flow     │
│   request rates,      errors,           across           │
│   error ratios,       audit trails,     microservices    │
│   latency P99         debug output      with timing      │
└─────────────────────────────────────────────────────────┘

The most important Kubernetes metrics to alert on:

Metric	Alert Threshold	Why
`kube_pod_container_status_restarts_total`	> 5 in 1h	CrashLoopBackOff — something is broken
`node_memory_MemAvailable_bytes`	< 10%	Node is about to OOM-kill pods
`kube_deployment_status_replicas_unavailable`	> 0 for 5m	Deployment health issue
`kubelet_volume_stats_available_bytes`	< 15%	PVC running out of disk space
`etcd_server_has_leader`	== 0	Critical — cluster may be headless

⚡ Autoscaling: The Three Dimensions

Kubernetes can scale in three ways, and they work best together:

                    ┌─────────────────┐
                    │ Cluster          │
                    │ Autoscaler       │
                    │ (more NODES)     │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              │              ▼
    ┌─────────────┐          │    ┌──────────────┐
    │     HPA     │          │    │     VPA      │
    │ (more PODS) │          │    │ (bigger PODS)│
    └─────────────┘          │    └──────────────┘
                             │
                    ┌────────┴────────┐
                    │      KEDA       │
                    │ Event-driven    │ ← Kafka lag,
                    │ (custom         │   SQS depth,
                    │  triggers)      │   cron schedules
                    └─────────────────┘

# HPA: Scale based on CPU + custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 20
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:                         # to prevent flapping
        - type: Percent
          value: 25                    # Remove max 25% of pods per period
          periodSeconds: 120
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70       # Target 70% CPU usage

Pro tip: Never use HPA and VPA on the same metric (e.g., both on CPU). They'll fight each other. Use HPA for CPU/memory and VPA for right-sizing requests on separate workloads.

🚀 GitOps: The Deployment Model That Changed Everything

GitOps is the idea that your Git repository is the single source of truth for your entire infrastructure. No more kubectl apply from laptops. No more "who changed that in production?"

Traditional CI/CD:                    GitOps:
──────────────────                    ──────
CI builds image                       CI builds image
CI pushes to registry                 CI pushes to registry
CI runs kubectl apply ← PUSH model    CI updates Git manifest ← PR + review
                                      ArgoCD detects change   ← PULL model
                                      ArgoCD syncs to cluster
                                      ArgoCD detects drift & self-heals

With ArgoCD, deploying to Kubernetes looks like this:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/k8s-manifests.git
    targetRevision: main
    path: overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true        # Delete resources removed from Git
      selfHeal: true     # Revert manual changes to match Git
    syncOptions:
      - CreateNamespace=true

Every change goes through a Pull Request. Every PR gets reviewed. Every merge is an audit log entry. If something breaks, git revert is your rollback.

🔥 Service Mesh: When Microservices Get Serious

Imagine 20 microservices. Each needs encrypted communication, retries, circuit breaking, and observability. Without a service mesh, you'd bake all that logic into every service in every language.

A service mesh moves that logic to the infrastructure:

WITHOUT Mesh:                         WITH Mesh (Istio):
┌──────────┐                          ┌──────────────────┐
│ Service A │                          │       Pod A       │
│ (has TLS  │  HTTP (unencrypted)     │ ┌──────┐┌──────┐ │ mTLS
│  code,    │ ────────────────── →    │ │ App  ││Envoy │ │ ═══════ →
│  retry    │                          │ │(just ││proxy │ │ automatic
│  logic)   │                          │ │ biz  ││      │ │ encryption,
└──────────┘                          │ │logic)││      │ │ retries,
                                      │ └──────┘└──────┘ │ metrics
                                      └──────────────────┘

Istio gives you traffic management superpowers:

# Canary deployment: Send 5% of traffic to v2
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
    - reviews
  http:
    - route:
        - destination:
            host: reviews
            subset: v1
          weight: 95
        - destination:
            host: reviews
            subset: v2
          weight: 5

When to use a service mesh:

✅ 10+ microservices that need mTLS
✅ Complex traffic routing (canary, A/B, fault injection)
✅ Need L7 observability without code changes
❌ Monolith or < 5 services (overhead not justified)
❌ Performance-critical workloads with sub-millisecond latency requirements

💰 Cost Optimization: The Chapter That Pays for Itself

Most Kubernetes clusters waste 30-50% of their compute budget. Here's why and how to fix it:

Typical Cluster Resource Usage:

Requested:   ████████████████████████████░░░░░░  70%
Actually Used: ██████████████░░░░░░░░░░░░░░░░░░  35%

         ← 35% WASTE (you're paying for this) →

The Quick Wins

1. Right-size your requests:

# See actual vs requested resources
kubectl top pods -n production --sort-by=cpu

# What you'll find:
# NAME          CPU(cores)   MEMORY(bytes)
# api-server    50m          120Mi        ← requests: 500m/512Mi = 10× over-provisioned!

2. Use Spot/Preemptible instances for fault-tolerant workloads:

# Node affinity for spot instances
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/lifecycle
              operator: In
              values: ["spot"]

3. Set LimitRanges so no developer accidentally requests 32Gi of memory:

apiVersion: v1
kind: LimitRange
metadata:
  name: sensible-defaults
  namespace: development
spec:
  limits:
    - type: Container
      default:          # Applied if no limits specified
        cpu: 500m
        memory: 256Mi
      defaultRequest:   # Applied if no requests specified
        cpu: 100m
        memory: 128Mi
      max:              # Hard ceiling
        cpu: "2"
        memory: 2Gi

4. Schedule idle workloads to zero with KEDA:

Dev environments that scale to 0 pods outside business hours
Queue processors that scale to 0 when no messages exist
Staging clusters that shut down overnight

Real savings: Teams implementing these four practices typically see 25-40% cost reduction within the first month.

🔧 Troubleshooting: The Skill That Separates Juniors from Seniors

When something breaks — and it will — follow this framework:

OBSERVE → DESCRIBE → LOGS → EVENTS → EXEC → NETWORK → RESOURCES → CONTROL PLANE

Here's the debugging cheat sheet I keep open at all times:

# Pod stuck in Pending?
kubectl describe pod <name> -n <ns>
# Look at Events → "Insufficient cpu/memory" = need bigger nodes or lower requests
# Look at Events → "no nodes match" = check nodeSelector/affinity/taints

# CrashLoopBackOff?
kubectl logs <name> -n <ns> --previous    # Logs from the CRASHED container
# Common causes: missing env vars, wrong command, config file not found

# ImagePullBackOff?
kubectl describe pod <name> -n <ns>
# Check: image name typo? Private registry auth? Image tag exists?

# Service not working?
kubectl get endpoints <service-name> -n <ns>
# Empty endpoints? → Labels don't match between Service selector and Pod labels

# DNS not resolving?
kubectl exec -it debug-pod -- nslookup my-service.my-namespace.svc.cluster.local
# If this fails → check CoreDNS pods: kubectl get pods -n kube-system -l k8s-app=kube-dns

# OOMKilled?
kubectl describe pod <name> -n <ns> | grep -A5 "Last State"
# Solution: Increase memory limits or fix the memory leak in your app

The Most Common Mistake Table

Symptom	Rookie Move	What to Actually Do
Pod CrashLooping	Increase `restartPolicy`	Read the logs with `--previous` flag
Service has no endpoints	Delete and recreate	Check that `selector` labels match pod labels exactly
Node NotReady	Panic and drain	Check kubelet logs: `journalctl -u kubelet -f`
PVC stuck in Pending	Delete and retry	Check if StorageClass exists and has a provisioner
Deployment rollout stuck	Force rollout restart	Check pod events with `kubectl describe` first

📋 The Production Readiness Checklist (Condensed)

Before going to production, verify every category:

Cluster Architecture

[ ] 3+ control plane nodes (HA)
[ ] Nodes spread across availability zones
[ ] etcd backup automated (hourly) and tested (quarterly)
[ ] CNI deployed (Calico or Cilium, not Flannel for production)

Security

[ ] RBAC enabled with least-privilege roles
[ ] Pod Security Standards enforced (restricted profile)
[ ] Network Policies deny-all + whitelist
[ ] Secrets encrypted at rest (--encryption-provider-config)
[ ] Container images scanned in CI pipeline
[ ] No containers running as root

Reliability

[ ] Resource requests AND limits set on every container
[ ] Liveness, readiness, and startup probes configured
[ ] PodDisruptionBudgets for critical workloads
[ ] Pod topology spread constraints for HA
[ ] Graceful shutdown handled (preStop hooks, SIGTERM)

Observability

[ ] Prometheus + Grafana for metrics
[ ] Centralized logging (Loki or EFK)
[ ] Alerting rules with escalation paths
[ ] SLOs defined for critical services

Operational

[ ] GitOps workflow (ArgoCD or Flux)
[ ] Cluster upgrade runbook documented and tested
[ ] Disaster recovery plan with tested RTO/RPO
[ ] Cost monitoring with ResourceQuotas per namespace

🛠️ What's in the Repository

Everything above barely scratches the surface. The full repository is organized into 13 sections, 53 modules, and 3 capstone projects:

Section	Modules	What You'll Learn
Foundations	01-05	What K8s is, architecture, installation, kubectl, YAML
Core Concepts	06-10	Pods, Deployments, Services, ConfigMaps, Namespaces
Workloads	11-15	DaemonSets, StatefulSets, Jobs, Scheduling, Resources
Networking	16-20	CNI, Ingress, Network Policies, Gateway API, CoreDNS
Storage	21-24	Volumes, PV/PVC, CSI Drivers, Backup with Velero
Security	25-29	RBAC, Pod Security, Secrets Management, Supply Chain, Audit
Observability	30-33	Prometheus, Logging, Tracing, Alerting & SLOs
Advanced	34-40	Helm, Kustomize, Service Mesh, Autoscaling, Operators, GitOps, Multi-Cluster
Cluster Mgmt	41-44	kOps, Rancher, kubeadm, Managed K8s (EKS/AKS/GKE)
CI/CD	45-47	Pipelines, Container Best Practices, Progressive Delivery
Production	48-50	Production Checklist, Cost Optimization, Disaster Recovery
Troubleshooting	51-53	Debugging Guide, Cheatsheet, CKA/CKAD/CKS Exam Prep
Projects	P1-P3	E-Commerce Microservices, Monitoring Stack, Multi-Tenant SaaS

Every module includes:

README with concepts explained from first principles
ASCII diagrams showing architecture and data flow
Annotated YAML files you can apply directly
Troubleshooting tables for common issues
Hands-on exercises to cement understanding

🎯 Four Learning Paths

Not everyone starts from the same place. Pick your path:

Path	Duration	Modules	You'll Be Able To
Beginner	2-3 weeks	01-10	Deploy apps, understand core K8s concepts
Intermediate	3-4 weeks	11-13, 16-17, 21-22, 25, 30-31, 34-35, 44	Handle real workloads with monitoring & storage
Advanced	4-6 weeks	14-15, 18-20, 23, 26-27, 36-39, 41-42	Build production clusters with service mesh & GitOps
Expert / CKA+CKS	2-3 weeks	28-29, 40, 43, 47-50, 51, 53 + Projects	Enterprise multi-cluster architecture, pass certifications

🧪 Try It Right Now

# 1. Install Kind (takes 30 seconds)
go install sigs.k8s.io/kind@latest
# OR: brew install kind

# 2. Create a cluster
kind create cluster --name learn-k8s

# 3. Verify
kubectl cluster-info
kubectl get nodes

# 4. Deploy your first app
kubectl create deployment hello --image=nginx:1.25
kubectl expose deployment hello --port=80 --type=NodePort
kubectl get svc hello

# 5. You're running Kubernetes. Now go deeper. 🚀

The One Thing I'd Tell My Past Self

If I could go back and tell myself one thing before starting this Kubernetes journey, it would be:

Stop trying to learn Kubernetes by memorizing YAML.

Instead, understand the why behind every resource:

A Deployment exists because you need rollbacks and scaling — a naked Pod gives you neither.
A Service exists because Pod IPs are ephemeral — they change on every restart.
A PersistentVolumeClaim exists because containers are ephemeral — their filesystem dies with them.
RBAC exists because "everyone is admin" doesn't survive the first security audit.
Network Policies exist because "all pods can talk to everything" is a lateral movement dream for attackers.

Every Kubernetes concept solves a specific problem. Learn the problem first, and the YAML writes itself.

If this guide helped you, the full repository has 130+ files of this depth. Star it, fork it, and start building.

🔗 GitHub Repository: Kubernetes Mastery — From Zero to Production Hero

What Kubernetes concept gave you the most trouble? Drop it in the comments — I'll explain it like you're five. 👇

DEV Community