Not just "what's different" — but WHY it's different, HOW each component works under the hood, and WHEN to choose which.
🧠 Why This Post Exists
Every "K3s vs K8s" article you've read probably gave you a table with checkmarks and said "K3s is lightweight." That's true — but why is it lightweight? What did Rancher actually strip out, merge, or replace? What are the architectural trade-offs you inherit when you deploy K3s in production?
This post tears open both control planes component by component. We'll go deep into what each piece actually does at the byte level, then see how K3s reimagines it.
🏗️ The Kubernetes Control Plane: A Ground-Up Look
Before comparing, let's build a mental model of each standard Kubernetes control plane component. Not the 30-second version — the real one.
1. 🔵 kube-apiserver — The Brain's Frontal Lobe
What It Actually Does
The API server is not just a REST endpoint. It is the only component in Kubernetes that talks directly to etcd. Every other component — scheduler, controller-manager, kubelet — communicates exclusively through the API server. This is a deliberate architectural decision called the hub-and-spoke pattern.
When you run kubectl apply -f deployment.yaml, here's what actually happens:
kubectl → HTTPS → kube-apiserver
│
├── 1. Authentication (Who are you?)
│ └── x509 certs / Bearer tokens / OIDC /Webhook
│
├── 2. Authorization (Can you do this?)
│ └── RBAC / ABAC / Node / Webhook evaluators
│
├── 3. Admission Control (Should this be allowed?)
│ ├── Mutating Webhooks ← can MODIFY the object
│ └── Validating Webhooks ← can REJECT theobject
│
├── 4. Schema Validation
│ └── OpenAPI v3 schema enforcement per GVK
│
└── 5. Persist to etcd
└── /registry/deployments/default/my-app
The Watch Mechanism — The Heartbeat of Kubernetes
The API server implements a long-poll watch mechanism over HTTP/2. This is what makes Kubernetes reactive rather than polling-based.
# You can see this yourself
kubectl get pods --watch -v=9
# Watch the raw HTTP stream — it's a chunked HTTP response that stays open
Every controller, scheduler, and kubelet maintains a persistent informer — a cached watch stream from the API server. The informer pattern:
- Does an initial
LISTto populate local cache - Starts a
WATCHfrom the resource version of that LIST - On disconnect, re-watches from the last known
resourceVersion - The API server buffers events in a watchCache in memory (configurable with
--watch-cache-sizes)
┌─────────────────────────┐
│ kube-apiserver │
│ │
│ ┌─────────────────┐ │
│ │ etcd watch │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ watchCache │ │ ← In-memory ring buffer
│ └────────┬────────┘ │
│ │ │
└───────────┼─────────────┘
│
┌─────────────────┼──────────────────┐
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│Scheduler│ │Controller │ │ kubelet │
│Informer │ │ Informer │ │ Informer │
└─────────┘ └───────────┘ └───────────┘
Aggregation Layer & CRDs
The API server can extend itself via two mechanisms:
- CRDs (Custom Resource Definitions): Schema is stored in etcd, handled natively by the API server itself
- Aggregation Layer (AA): Proxy traffic to an external API server (used by metrics-server, KEDA, etc.)
# CRD — API server owns the storage
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: widgets.example.com
# AA — API server proxies to external server
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
name: v1beta1.metrics.k8s.io
spec:
service:
name: metrics-server
namespace: kube-system
Production Tuning Knobs
kube-apiserver \
--max-requests-inflight=400 \ # Max non-mutating concurrent requests
--max-mutating-requests-inflight=200 \ # Max mutating concurrent requests
--watch-cache-sizes=pods#1000 \ # Per-resource watch cache sizes
--enable-admission-plugins=NodeRestriction,PodSecurity \
--audit-log-path=/var/log/audit.log \
--audit-policy-file=/etc/k8s/audit-policy.yaml
2. 🟣 etcd — The Distributed Brain's Memory
What etcd Actually Is
etcd is a distributed key-value store built on the Raft consensus algorithm. It's not a database in the traditional sense — it's a fault-tolerant state machine where every write must be agreed upon by a quorum of nodes before it's committed.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ etcd-0 │ │ etcd-1 │ │ etcd-2 │
│ (LEADER) │◄────│ (FOLLOWER) │ │ (FOLLOWER) │
│ │────►│ │ │ │
└──────┬──────┘ └─────────────┘ └──────▲──────┘
│ │
└───────────────────────────────────────┘
Raft Heartbeats
Raft in Plain English
- Leader Election: One node becomes leader. It sends heartbeats. If 2+ nodes don't hear a heartbeat, they call an election.
- Log Replication: Every write goes to the leader. Leader appends it to its log and replicates it to followers. Once a majority acknowledges, the write is committed.
-
Quorum Math:
(n/2) + 1nodes must agree. For 3 nodes: 2. For 5 nodes: 3.
etcd write path:
Client → Leader APPEND entry to log
Leader SEND AppendEntries RPC to all followers
Followers ACKNOWLEDGE
Leader COMMITS when the majority ack
Leader RESPONDS to client
Leader NOTIFIES followers of the commit
How Kubernetes Data Lives in etcd
All Kubernetes objects are stored under /registry/ with the structure:
/registry/{resource-type}/{namespace}/{name}
Examples:
/registry/pods/default/nginx-7d8b9f-xyz
/registry/deployments/kube-system/coredns
/registry/secrets/default/my-secret
/registry/events/default/pod-scheduled-event
The data is serialized using protobuf (not JSON!) for efficiency. You can inspect it:
# Decode an etcd value
etcdctl get /registry/pods/default/nginx \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
| auger decode # github.com/jpbetz/auger
MVCC — Multi-Version Concurrency Control
etcd uses MVCC, meaning it keeps multiple historical versions of every key. Each write increments a global revision counter. The API server uses this resourceVersion for watch ordering and conflict detection.
# See the revision
etcdctl get /registry/pods/default/nginx -w json | jq .header.revision
When etcd's keyspace grows too large (default compaction at 2GB), older revisions are compacted — deleted. This is why very old watches can fail with "compacted" errors.
etcd Failure Modes You Must Know
| Scenario | What Happens |
|---|---|
| 1 node fails (3-node cluster) | Cluster continues. Writes still work. |
| 2 nodes fail (3-node cluster) | CLUSTER STOPS ACCEPTING WRITES. API server returns 503. |
| Leader fails | Election happens. ~150-300ms downtime while new leader is elected. |
| Network partition | Minority partition goes read-only. Majority continues. |
| etcd OOM | API server loses state store. Catastrophic. |
⚠️ This is the critical difference with K3s. If you're running K3s with embedded SQLite, you get zero HA for the datastore by default.
3. 🟡 kube-scheduler — The CPU-Time Auctioneer
What It Actually Does
The scheduler watches for Pods in Pending state (no nodeName assigned) and decides which Node they should run on. It does NOT place the pod — it simply writes the chosen nodeName to the Pod spec in etcd via the API server. The kubelet on that node then sees its name and starts the pod.
Pod created (nodeName: "") → Scheduler sees it via watch
→ Runs filtering + scoring
→ Writes nodeName to Pod
→ kubelet on that node sees the Pod
→ kubelet pulls image + starts container
The Scheduling Framework — Two-Phase Deep Dive
Scheduling happens in two phases: Filtering and Scoring.
Phase 1: Filtering (Hard Constraints — binary pass/fail)
All Nodes
│
▼
┌─────────────────────────────────────────────────────┐
│ Filter Plugins (run in parallel, any fail = remove) │
│ │
│ • NodeUnschedulable — node.spec.unschedulable? │
│ • NodeAffinity — matchLabels on node? │
│ • TaintToleration — pod tolerates node taints? │
│ • PodTopologySpread — spread constraints met? │
│ • VolumeBinding — PVC can bind to this node? │
│ • NodeResourcesFit — enough CPU/mem/GPU? │
│ • NodePorts — hostPort conflicts? │
└─────────────────────────────────────────────────────┘
│
▼
Feasible Nodes (subset)
Phase 2: Scoring (Soft Preferences — 0-100 score)
Feasible Nodes
│
▼
┌─────────────────────────────────────────────────────┐
│ Score Plugins (weighted sum) │
│ │
│ • LeastAllocated — prefer less loaded nodes │
│ • NodeAffinity — preferred affinities │
│ • InterPodAffinity — co-locate or spread pods │
│ • ImageLocality — prefer nodes with image │
│ • TaintToleration — fewer preferred taints │
│ • TopologySpreadConstraint — balance spread │
└─────────────────────────────────────────────────────┘
│
▼
Highest Score Node → Binding (nodeName written)
Preemption — What Happens When No Node Passes Filtering
If no node can fit the Pod, the scheduler checks if lower priority pods can be evicted to make room:
- Find nodes where evicting lower-priority pods creates enough room
- Pick the node that requires evicting the fewest/lowest-priority pods
- Send eviction requests → evicted pods are deleted → pending pod is scheduled
# Priority classes matter here
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
---
# System-critical pods have value: 2000001000
# They will preempt your workloads if nodes are tight
The Binding Cache — Optimistic Concurrency
The scheduler maintains an assumed pod cache. After scoring but before the API server confirms the bind, the scheduler optimistically assumes the pod is placed and accounts for that node's capacity. This prevents scheduling thrash in high-throughput clusters.
4. 🟢 kube-controller-manager — The Reconciliation Engine
What It Actually Is
The controller manager is a single binary that runs ~30+ independent control loops as goroutines. Each controller watches specific resource types and reconciles desired state vs actual state.
# The reconciliation loop in pseudocode (every controller)
for {
desired := get_desired_state_from_api_server()
actual := get_actual_state_from_world()
if desired != actual {
take_action_to_make_actual_match_desired()
}
sleep(resync_period) // default: 10min
}
Key Controllers and What They Actually Do
ReplicaSet Controller
Watches: ReplicaSets, Pods
Loop:
current_pods = list pods with matching selector
delta = replicaset.spec.replicas - len(current_pods)
if delta > 0: create `delta` pods
if delta < 0: delete abs(delta) pods (by priority: unscheduled first)
Deployment Controller
Watches: Deployments, ReplicaSets
Loop:
desired_rs = compute_hash(deployment.spec.template)
if no RS with that hash: create new RS
scale up new RS, scale down old RS (by strategy: RollingUpdate or Recreate)
update deployment.status (readyReplicas, conditions, etc.)
Node Controller — This one is critical to understand
Watches: Nodes
Loop:
for each node:
if no heartbeat for node-monitor-grace-period (default 40s):
set NodeReady=Unknown
if no heartbeat for pod-eviction-timeout (default 5min):
taint node with node.kubernetes.io/unreachable:NoExecute
(this triggers pod eviction by the taint manager)
EndpointSlice Controller — How Services actually work
Watches: Services, Pods
Loop:
for each service:
pods = list pods matching service.spec.selector where pod.status.ready=true
build EndpointSlices (groups of 100 endpoints each)
write EndpointSlices to API server
(kube-proxy watches EndpointSlices and updates iptables/ipvs rules)
Informer + WorkQueue Architecture
Every controller is built on the same pattern:
API Server Watch
│
▼
Informer
(local cache)
│
▼ (on change event)
WorkQueue ←──── rate-limited, deduplicated
│
▼
Worker goroutines (usually 1-5)
│
▼
Reconcile function
│
├── Success → remove from queue
└── Failure → re-queue with exponential backoff
This pattern means controllers are eventually consistent — they don't act on every single event, they converge to the desired state over time.
5. 🔴 cloud-controller-manager — The Cloud API Bridge
What It Actually Does
The CCM was extracted from kube-controller-manager in Kubernetes 1.11 specifically to decouple Kubernetes from cloud provider APIs. It runs cloud-specific control loops:
Node Controller (cloud variant)
On new Node joining:
1. Fetch instance metadata from cloud API
(AWS EC2 DescribeInstances / GCP ComputeInstances)
2. Apply cloud provider labels:
- topology.kubernetes.io/zone = us-east-1a
- node.kubernetes.io/instance-type = m5.xlarge
3. Set node addresses (internal/external IP from cloud metadata)
4. Check if instance still exists periodically
→ If terminated in cloud: delete the Node object
Route Controller (AWS/GCP specific)
For each node:
ensure cloud routing table has route:
pod-cidr (e.g., 10.244.1.0/24) → node instance-id
This is how pod-to-pod routing works across nodes
WITHOUT an overlay network on supported clouds
Service Controller — The LoadBalancer Magic
Watch Services with type=LoadBalancer:
on CREATE: call cloud API → create load balancer
update service.status.loadBalancer.ingress with external IP
on UPDATE: update LB listener rules / health checks
on DELETE: delete cloud load balancer
This is why kubectl get svc shows <pending> for LoadBalancer services until the cloud LB is provisioned.
⚡ The K3s Control Plane: Architectural Reimagination
Now let's look at what K3s does differently — not just "it's smaller" but architecturally why.
K3s Single Binary Philosophy
K3s ships as a single ~70MB binary (k3s) that embeds:
k3s binary
├── k3s-server (control plane)
│ ├── kube-apiserver
│ ├── kube-controller-manager
│ ├── kube-scheduler
│ ├── kubelet
│ ├── kube-proxy
│ ├── embedded containerd
│ ├── embedded CoreDNS
│ ├── embedded Flannel (CNI)
│ ├── embedded Traefik (ingress)
│ ├── embedded ServiceLB (load balancer)
│ └── embedded local-path-provisioner (storage)
└── k3s-agent (worker)
├── kubelet
├── kube-proxy
└── embedded containerd
This is not containerized — these are linked as Go packages into a single binary. Startup goes from ~3 minutes (typical K8s) to under 30 seconds.
1. K3s API Server — Same Core, Slimmer Defaults
The K3s API server is still the upstream kube-apiserver — but K3s wraps it with:
Removed/Disabled by Default:
- Alpha feature gates are disabled
- Cloud provider plugins:
--cloud-provider=externalnot set (no CCM) - Several admission plugins that assume cloud infra
The K3s Tunnel Proxy — Replacing the CCM Node Controller
K3s introduces a reverse tunnel from agent → server. In standard K8s, the API server connects to the kubelet for exec/logs/port-forward. In K3s:
Standard K8s:
kube-apiserver → kubelet:10250 (API server initiates)
Requires API server to reach all nodes directly
K3s:
k3s-agent → k3s-server:6443 (agent initiates)
┌────────────────────────────────────────────────┐
│ WebSocket tunnel maintained by agent │
│ All kubelet traffic flows THROUGH this tunnel │
└────────────────────────────────────────────────┘
This is why K3s works behind NAT without special networking — agents reach out, not the server. This is a fundamental architectural shift that enables edge/IoT deployments.
2. SQLite / Kine — The etcd Abstraction Layer
This is the most significant architectural difference.
K3s introduces Kine (Kubernetes Is Not Etcd) — a shim that translates etcd's gRPC API into SQL queries.
kube-apiserver
│
│ etcd gRPC v3 protocol (ListWatch, Txn, etc.)
▼
┌──────────────┐
│ Kine │ ← translation layer
│ (etcd shim) │
└──────┬───────┘
│ SQL queries
▼
┌──────────────┐
│ SQLite / │ ← actual datastore
│ PostgreSQL │
│ MySQL │
│ DQLite │
└──────────────┘
How Kine Implements the etcd Watch API:
etcd's watch is event-driven via gRPC streams. SQL databases don't natively support this. Kine implements it via:
-- Kine's core table
CREATE TABLE kine (
id INTEGER PRIMARY KEY AUTOINCREMENT, -- acts as etcd revision
name TEXT, -- the key (/registry/pods/...)
created INTEGER,
deleted INTEGER,
create_revision INTEGER,
prev_revision INTEGER,
lease INTEGER,
value BLOB, -- the protobuf-encoded object
old_value BLOB
);
-- Watch is implemented as polling:
-- SELECT * FROM kine WHERE id > last_seen_id ORDER BY id
-- Run every ~100ms — NOT event-driven like real etcd
The Implications:
- For small clusters: unnoticeable
- For large clusters: polling adds latency to watch events
- SQLite: single-writer, no HA (single node only)
- PostgreSQL/MySQL with Kine: HA possible but watch latency higher than etcd
DQLite — Embedded Distributed SQLite (Experimental)
For HA without an external DB, K3s can use DQLite — a distributed SQLite implementation using Raft (similar to etcd but built on SQLite). It's embedded in the binary and doesn't require an external DB.
# K3s with embedded HA using DQLite
k3s server --cluster-init # First server (bootstrap)
k3s server --server https://first-server:6443 --token <token> # Join as HA peer
3. K3s Controller Manager — Pruned and Extended
K3s runs the upstream kube-controller-manager with several modifications:
Removed Controllers:
-
cloud-nodecontroller (no cloud metadata fetching) -
cloud-node-lifecyclecontroller -
routecontroller (no cloud routes) -
servicecontroller (replaced by ServiceLB)
Added: ServiceLB (a.k.a. Klipper LoadBalancer)
Instead of calling a cloud API to provision a load balancer, K3s runs a DaemonSet-based solution:
Service type=LoadBalancer created
│
▼
ServiceLB Controller watches for it
│
▼
Creates a DaemonSet:
- Runs a pod on every node with hostPort matching service ports
- The pod does iptables DNAT → service ClusterIP
│
▼
Every node's IP becomes a valid entry point
(no external LB needed)
# What ServiceLB actually deploys under the hood
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: svclb-my-service
namespace: kube-system
spec:
template:
spec:
hostNetwork: true
containers:
- name: lb-port-80
image: rancher/klipper-lb:latest
ports:
- hostPort: 80 # binds on every node
containerPort: 80
env:
- name: SRC_PORT
value: "80"
- name: DEST_PROTO
value: TCP
- name: DEST_IP
value: "10.96.100.50" # ClusterIP
- name: DEST_PORT
value: "80"
4. K3s Scheduler — Unchanged but Co-located
The scheduler in K3s is the unmodified upstream kube-scheduler. However, it runs as a goroutine inside the k3s-server binary rather than as a separate process.
The key difference is operational:
- In K8s: scheduler can be independently scaled, upgraded, or replaced (e.g., with Volcano, Yunikorn)
- In K3s: scheduler is embedded — replacing it requires rebuilding or running an external scheduler with leader election disabled on the built-in one
5. The Flannel CNI — Embedded Networking
Standard K8s requires you to install a CNI (Calico, Cilium, Flannel, Weave) separately. K3s embeds Flannel with VXLAN as the default backend.
Pod on Node 1 (10.42.1.5) → Pod on Node 2 (10.42.2.7)
Standard K8s + Calico:
10.42.1.5 → BGP route → 10.42.2.7 (no encapsulation on supported networks)
K3s + Flannel VXLAN:
10.42.1.5 → VXLAN encapsulate → eth0:8472 → Node 2 → decapsulate → 10.42.2.7
(works everywhere, slight overhead from encapsulation)
K3s also supports swapping Flannel for Cilium or Calico if you disable the built-in:
k3s server --flannel-backend=none --disable-network-policy
# Then install Cilium/Calico manually
📊 Side-by-Side Deep Comparison
| Dimension | Standard Kubernetes | K3s |
|---|---|---|
| Deployment model | Separate processes (+ etcd cluster) | Single binary, all-in-one |
| API server | Full upstream, all features | Full upstream, conservative defaults |
| Datastore | etcd (Raft, event-driven watch) | SQLite/Kine (SQL polling) or embedded DQLite |
| Watch latency | ~10ms (event-driven) | ~100ms (polling on SQL backends) |
| HA datastore | etcd cluster (3/5 nodes) | External DB + Kine OR embedded DQLite |
| Control plane HA | Multiple API server replicas | Multiple k3s-server nodes possible |
| Cloud integration | cloud-controller-manager | No CCM, uses ServiceLB + node-ip flags |
| LoadBalancer | Cloud LB (AWS ELB, GCP GLB) | ServiceLB DaemonSet (hostPort) |
| Ingress | Bring your own (nginx, traefik) | Traefik v2 embedded |
| CNI | Bring your own | Flannel (VXLAN) embedded |
| DNS | Bring your own CoreDNS | CoreDNS embedded |
| Storage | Bring your own CSI | local-path-provisioner embedded |
| Kubelet location | Separate binary on worker | Embedded in k3s binary |
| API server → kubelet | Direct connection (port 10250) | Reverse WebSocket tunnel |
| Memory (control plane) | ~2GB+ (separate processes) | ~512MB (single process) |
| Startup time | 2-5 minutes | 20-30 seconds |
| Alpha feature gates | Available | Disabled by default |
| Admission webhooks | Full support | Full support |
| CRDs | Full support | Full support |
| RBAC | Full support | Full support |
| Audit logging | Configurable | Configurable |
| Scheduler extensibility | Scheduler profiles, plugins | Embedded; replace with external |
| Controller extensibility | Separate binary, hot-swap | Embedded goroutine |
| Upgrades | Independent component upgrades | Single binary upgrade |
| Edge/NAT traversal | Requires direct reachability | Native via reverse tunnel |
| ARM support | Separate builds | Native multi-arch in single release |
🔑 When to Choose What
Choose Standard Kubernetes When:
✅ 100+ node clusters
✅ Financial / regulated workloads requiring etcd for compliance
✅ You need independent control plane component upgrades
✅ You're using cloud-managed control planes (EKS, GKE, AKS)
✅ You need custom scheduler profiles (ML batch, GPU scheduling)
✅ Multi-tenancy with strong isolation requirements
✅ You need external etcd for ultra-high availability
✅ Team has K8s expertise and infra budget
Choose K3s When:
✅ Edge computing (retail, industrial, remote sites)
✅ IoT / ARM devices (Raspberry Pi clusters)
✅ CI/CD ephemeral clusters (fast startup is critical)
✅ Development environments (minimal resource usage)
✅ Single-node homelab or small on-prem clusters
✅ Clusters behind NAT (reverse tunnel is a killer feature)
✅ Teams that want "it just works" with less Ops overhead
✅ Bare metal without a cloud provider
✅ Air-gapped environments (single binary, easy to ship)
🔭 The Architecture Decision Tree
Do you need >50 nodes?
├── YES → Standard K8s (EKS/GKE/AKS or kubeadm)
└── NO
├── Are you on edge/IoT/ARM?
│ └── YES → K3s (purpose-built for this)
├── Do you need cloud LoadBalancer integration?
│ └── YES → Standard K8s with CCM
├── Is startup speed critical? (CI/CD, dev envs)
│ └── YES → K3s
├── Do you need etcd for compliance/audit?
│ └── YES → Standard K8s
└── Default recommendation for <20 nodes on-prem?
└── K3s (less to manage, same K8s API)
🎯 Closing Thoughts
K3s isn't "Kubernetes with stuff removed." It's a purpose-built reimagining of the control plane for constrained environments. Rancher made deliberate trade-offs:
- etcd → Kine/SQLite: Sacrificed watch latency and native HA for operational simplicity
- Separate binaries → Single binary: Sacrificed independent upgradeability for atomic deployments
- CCM → ServiceLB: Sacrificed cloud-native LB for zero-dependency load balancing
- Direct kubelet access → Reverse tunnel: Sacrificed simplicity for NAT traversal capability
The result is a distribution that runs the full Kubernetes API on a Raspberry Pi with 512MB of RAM, starts in 30 seconds, and works behind NAT — things standard K8s simply wasn't designed for.
Both are Kubernetes. Both run your workloads. The control plane is where the real difference lives.
Top comments (0)