DEV Community: Santosh Koti

Kubernetes Object Hierarchy: A Simple Mental Model

Santosh Koti — Tue, 21 Apr 2026 05:00:54 +0000

Kubernetes has a lot of objects. Don't think of them as a flat list. That's the wrong mental model. The right one is this: Kubernetes is a declarative control system. You describe the state you want, and a set of objects — each with a narrow job — cooperate to make reality match. Once you see how those objects fit together, the rest of Kubernetes is mostly details.

Note: this post covers Kubernetes objects — the resources you define in YAML. The control plane components that make these objects work (API server, scheduler, controller manager, etcd) are a separate layer and not shown here.

The Visual Hierarchy

The first thing to get right is that Kubernetes partitions a cluster two different ways — physically (Nodes) and logically (Namespaces). These are orthogonal, not nested. A Namespace spans many Nodes; a Node hosts Pods from many Namespaces. Pods are where the two views meet.

[              Cluster              ]
       /                       \
[   Nodes   ]             [ Namespaces ]
 (physical)                  (logical)
       \                       /
        \                     /
         ▼                   ▼
         ┌─────────────────┐
         │      Pods       │  ← scheduled onto a Node, scoped to a Namespace
         └─────────────────┘

Inside each Namespace:

  [ Ingress ] ──► [ Service ] ──► [ Workload Controller ]
                                   (Deployment / StatefulSet /
                                    DaemonSet / Job)
                                        │
                                        └── [ ReplicaSet ]
                                                 │
                                                 └── [ Pod ]
                                                      ├── Container(s)
                                                      └── Volume(s)

  [ ConfigMap / Secret ] ───────► injected into Pods
  [ PVC ] ──► [ PV ] ───────────► mounted into Pods as persistent storage

1. The Infrastructure Layer

Cluster. The outer boundary — one control plane plus a pool of worker machines. Every other object lives inside it.

Node. A single worker machine, physical or virtual, running the kubelet agent. Nodes provide the raw CPU, memory, and disk that Pods get scheduled onto; they don't care which Namespace a Pod belongs to.

2. The Logical Layer

Namespace. A virtual partition for grouping and isolating resources — think environments (dev, staging, prod) or teams. Namespaces give you scoped names, per-team resource quotas, and RBAC boundaries, without needing separate clusters.

3. Networking

Ingress. The cluster's front door for HTTP/HTTPS. It routes external traffic to internal Services based on hostname or path, typically handling TLS termination along the way.

Service. A stable IP and DNS name for a group of Pods. Pods are ephemeral — they get replaced, rescheduled, and change IPs — so the Service is the fixed address that Ingress and other Pods talk to.

4. Workload Controllers

You almost never create Pods directly. Instead, you pick the controller that matches how your app should behave, and it manages Pods for you.

Deployment. The default for stateless apps. Handles rolling updates, rollbacks, and scaling by managing a ReplicaSet underneath.

StatefulSet. For apps that need stable identity and durable storage — databases, queues, anything where pod-0 and pod-1 aren't interchangeable. Each Pod gets a predictable name and its own PVC.

DaemonSet. Runs one Pod per Node (or a chosen subset). Use it for node-level agents: log shippers, metrics collectors, CNI plugins.

Job / CronJob. A Job runs a Pod to completion for one-off work like migrations or batch processing. A CronJob is a Job on a schedule — the Kubernetes equivalent of cron.

5. The Execution Layer

ReplicaSet. Keeps exactly N copies of a Pod running and replaces any that fail. You rarely manage these directly; a Deployment creates and updates them on your behalf.

Pod. The smallest deployable unit — one or more containers that share a network namespace (same IP) and a set of volumes. A Pod is the thing that actually gets scheduled onto a Node.

Container. The running instance of your application image. Most Pods have one main container, sometimes with sidecars for logging, proxying, or similar helper tasks.

Volume. A directory available to the containers in a Pod, defined at the Pod level. It outlives individual container restarts and can be backed by anything from a temporary disk to a cloud volume.

6. Configuration and Storage

These objects don't run code — they supply Pods with the data and storage they need.

ConfigMap. Non-sensitive configuration — env vars, config files, feature flags — kept outside the container image so the same image runs anywhere.

Secret. Same shape as a ConfigMap, but for sensitive values: API keys, passwords, TLS certs. Stored with tighter access controls and optionally encrypted at rest.

PersistentVolumeClaim (PVC) and PersistentVolume (PV). A PV is a real piece of storage in the cluster (a cloud disk, an NFS share). A PVC is a Pod's request for storage — "I need 20Gi of SSD" — which Kubernetes binds to a matching PV. This indirection lets developers ask for storage without knowing the underlying infrastructure.

How It All Fits Together

The payoff of this hierarchy is separation of concerns, and it shows up clearly once you trace a request end to end:

A Deployment declares what should run.
Its ReplicaSet ensures how many are running.
Pods are placed onto Nodes — that's where they run.
Ingress and Service decide how traffic reaches them.
ConfigMaps, Secrets, and PVCs provide what they need — turning a generic image into a configured, stateful instance.

This is the real payoff: your application is just a container. Everything around it — how it scales, how traffic reaches it, what config it reads, what storage it mounts — is handled by separate objects. Swap those objects and the same image runs anywhere, unchanged.

One axis this post deliberately skips is identity and access — ServiceAccounts, RBAC, and how cluster identity maps to cloud IAM. That would require a post of its own.

The Metadata Wall - Why Control Planes Break Before Data Planes Do

Santosh Koti — Thu, 02 Apr 2026 01:50:51 +0000

When building at massive scale, "Data" is rarely the most complex part of the puzzle. Data is heavy, but it’s predictable. We have S3 for storage, NVMe for local speed, and bit-shoveling pipelines that can move petabytes.

The real challenge is Metadata. Metadata is the "Control Plane." It’s the routing table, the ownership lease, and the global state of the world. In a multi-datacenter (MDC) system, if your metadata layer hiccups, your data layer becomes a collection of expensive, unreachable zeros and ones.

1. The Taxonomy: Map vs. Territory

The simplest mental model is a Library.

Data is the books. They are huge, they sit on shelves, and you rarely move them.
Metadata is the card catalog. It’s tiny, but it tells you exactly where every book is.

At this level of architecture, you realize that Metadata is the scaling bottleneck. You can always add more "shelves" (Data nodes), but you eventually hit a wall on how fast you can update the "catalog" (Metadata store) while keeping it consistent across Virginia, Dublin, and Singapore.

2. The Latency Floor: Paying the Speed of Light Tax

At global scale, the speed of light isn't just a physics trivia point—it is the hard floor of your system's tail latency (P99).

Metadata usually requires Strong Consistency. You cannot have two different datacenters believing they both "own" the same user’s record. To prevent this, we use consensus protocols like Raft or Paxos. This means a write to the metadata layer is physically bound by the Round Trip Time (RTT) of a majority of your nodes.

The math is unforgiving. Information in fiber-optic cables travels at roughly 2/3 the speed of light.

A round-trip from New York to London (~11,000 km total distance) has a hard physical floor of ~55ms for every metadata commit. While your Data Plane can cheat by using asynchronous replication (writing locally and syncing later), your Control Plane (Metadata) cannot. It must pay this "Consensus Tax" in real-time to maintain a valid global state.

3. The Architecture: Replicated Routing & The "Push" Model

To scale effectively, we must decouple the Metadata Store from Metadata Distribution. If every data request had to query a central store like etcd or FoundationDB for a route, the coordination overhead would collapse the system. Instead, we use a "Push" model:

The Source of Truth: A small, strongly consistent cluster (The "Gold Copy").
The Distribution Layer: A sidecar or agent that "watches" the source of truth and streams updates to every router in the fleet via a gossip protocol or watch stream.
Local View: Every router maintains an in-memory, $O(1)$ lookup table of the world.

This turns a "Global Network Trip" into a "Local Memory Lookup."

Handling the "Stale View"

The trade-off is that some routers will inevitably be a few milliseconds behind. High-availability designs handle this via Self-Healing Loops:

If a router uses a stale metadata version and hits the wrong node, the node rejects the request with a Version Mismatch or Wrong Shard error.
The router immediately invalidates its local cache and fetches the latest "Map" from the Control Plane.

4. At a Glance: Metadata vs. Data

Feature / Metric	Metadata (Control Plane)	Data (Data Plane)
Primary Goal	Consistency (CP)	Availability (AP)
Scaling Dimension	Coordination Overhead	Throughput / Volume
Access Pattern	Extremely High Frequency	High Volume, Lower Frequency
Storage Engine	`etcd`, `Zookeeper`, `FoundationDB`	S3, RocksDB, Cassandra
Failure Result	Global "Read-Only" or Outage	Partial unavailability (Degraded)

The Core Takeaway

When designing multi-region systems, the primary question is: "Am I hitting the Metadata Wall?" If coordination overhead is outgrowing data throughput, you don't need faster disks. You need to shard your control plane, push your routing tables to the edge, and use Hybrid Logical Clocks (HLCs) to order events without waiting for global locks.

Originally posted on:

pomogomo.com

Distributed Systems - Algebraic Types for Better State Modeling

Santosh Koti — Wed, 01 Apr 2026 18:44:24 +0000

Distributed systems are notoriously hard. They force us to deal with constant uncertainty: machines fail, networks drop messages, requests time out, and nodes unexpectedly change roles.

Because of this, the real challenge in distributed architecture is not just writing business logic—it is modeling state correctly.

For example, a node in a cluster might be a leader, a follower, or in recovery. If we model these states poorly, subtle correctness bugs can appear quickly. This is exactly where algebraic types shine.

What Are Algebraic Types?

The term sounds academic, but the idea is highly practical: algebraic types give us a structured way to model data.

In everyday programming, this usually comes down to two things:

Product types (structs): data that belongs together
Sum types (enums): values that can be one of several valid states

Why Does This Matter in Distributed Systems?

Distributed systems have many valid states, and mixing them up leads to serious correctness problems.

A node should never accidentally be both Leader and Recovering at the same time. If we model this with loose booleans like is_leader = true and is_recovering = true, invalid combinations can slip through. With a sum type like an enum, the system is forced to choose exactly one valid state.

That makes the code easier to reason about and harder to misuse.

A Simple Example in Rust

Here is how that looks in practice:

// A product type: grouping related data
struct Node {
    id: u64,
    term: u64,
    role: Role,
}

// A sum type: defining mutually exclusive states
enum Role {
    Leader,
    Follower,
    Recovering,
}

fn describe(node: &Node) {
    // match forces us to handle every possible Role
    match node.role {
        Role::Leader => println!("Node {} is the leader", node.id),
        Role::Follower => println!("Node {} is a follower", node.id),
        Role::Recovering => println!("Node {} is recovering", node.id),
    }
}

This example is small, but it shows real architectural value:

States are explicit: there is no guessing what a node is doing
Invalid combinations are impossible: a node cannot have two roles at once
The compiler helps enforce correctness: every valid case must be handled

The Takeaway

Algebraic types matter in distributed systems because they help us model reality more clearly. In an environment that is already messy, clear state modeling is a foundation of correctness.

Simply put: distributed systems are messy enough—your types should make your code clearer, not messier.

This post was originally published on:

pomogomo.com

Distributed Systems - Quorum vs. Raft vs 2PC

Santosh Koti — Wed, 01 Apr 2026 18:24:11 +0000

Quorum vs. Raft: The Hierarchy of Distributed Systems

In distributed systems, we often confuse "how we store data" with "how we govern it." To build reliable systems, we must distinguish between a mathematical property (Quorum), an orchestration protocol (Raft), and a logical contract (ACID).

1. The Functional Split

The primary difference lies in which "Plane" the logic occupies:

Quorum (The Data Plane Rule): This is a mathematical requirement for Durability. It defines how many nodes must acknowledge a piece of data to ensure it isn't lost ($W + R > N$). It is "blind"—it doesn't care who sends the data, only that it is stored safely.
Raft (The Full System Protocol): Raft provides Ordering. It tightly couples the Control Plane (Who is the leader?) with the Data Plane (Replicating the log). It ensures that every node sees the exact same sequence of events.

2. The "Fencing" Problem

These approaches differ most during a network failure:

Raft (Built-in Protection): Because Raft handles leadership and data, it has an internal fencing mechanism. If a leader is partitioned, the protocol's "Terms" prevent it from committing more data.
Quorum (External Dependency): In a pure Quorum system (like Amazon Aurora storage), the storage doesn't know who the leader is. You need an external Control Plane (AWS, Kubernetes, or a human) to "fence" or kill an old writer before they accidentally overwrite new data.

3. The Hierarchy of Guarantees

We can visualize these concepts as a stack. Each layer solves a progressively harder problem:

Layer	Primary Guarantee	What it Solves
ACID (The Peak)	Correctness	Logical validity (e.g., bank transfers are safe and private).
Consensus (The Middle)	Ordering	Agreement on a single, linear sequence of events.
Quorum (The Base)	Durability	Survival of data even if a minority of nodes fail.

4. Scaling: Partitioned Consensus

A single Raft group is a bottleneck. Modern systems like CockroachDB, TiDB, and Kafka scale by breaking data into shards, where each shard runs its own independent consensus.

Multi-Raft: Every 64MB of data is its own "City-State" with its own leader. This minimizes the "blast radius" of a failure.
Two-Phase Commit (2PC): When a transaction touches multiple shards, we use 2PC as the "Diplomatic Treaty." It ensures that Shard A and Shard B either commit together or fail together, providing Distributed ACID on top of the Raft layer.

5. Case Study: Kafka’s Hybrid Approach

Kafka (KRaft mode) uses a clever "Brain vs. Muscle" distinction:

The Metadata Quorum (The Brain): A single Raft group for high-level cluster state.
Data Partitions (The Muscle): A leaner protocol called ISR (In-Sync Replicas) for the messages. This keeps the data plane fast by avoiding the "chatter" of thousands of independent Raft heartbeats.

Summary Comparison

System	Control Plane (The Brain)	Data Plane (The Muscle)
CockroachDB	Multi-Raft (Distributed)	Multi-Raft (Distributed)
Kafka (KRaft)	Raft Quorum	ISR Replication
AWS Aurora	External Cluster Manager	Storage Quorum ($W + R > N$)

The Bottom Line: The Franchise Model

Think of a global business:

Quorum is the requirement for a sale (Customer pays + Register logs it).
Raft is the Store Manager. They ensure the local shop is running in order.
Partitioning is opening 1,000 different stores so no one shop is too crowded.
ACID is Corporate HQ ensuring a "Gift Card" used in Store A is correctly deducted and accepted in Store B.

Quorum gives you durability, Consensus gives you ordering, and ACID gives you correctness.

This post was originally published on:

pomogomo.com

Distributed Systems - Lamport Clock vs Hybrid Logical Clocks

Santosh Koti — Wed, 01 Apr 2026 18:06:10 +0000

Why This Matters

In a distributed system — especially one spanning multiple regions and datacenters — there's no single global clock. A node in Virginia and a node in Frankfurt each have their own clock, and they drift by milliseconds or more. Yet we need two guarantees:

Causal consistency — if event A caused event B, every node in every datacenter must see A before B. A user in Tokyo shouldn't see a reply before the original message, just because their nearest replica processed the reply first.
Total ordering — even for unrelated events happening simultaneously in different regions, we need a deterministic way to put them in a single, agreed-upon order. Every node across every datacenter must agree, and they need to do it without a central coordinator — fully decentralized.

The tricky part? Neither Lamport clocks nor HLCs can actually detect concurrent events (Vector Clocks do that). But they don't need to — they solve a different problem. They assign every event a unique, sortable timestamp so you get a total order that respects causality, even across nodes in different continents that never talk to each other directly.

Lamport Clocks — Logical Order Only

A Lamport clock is just an integer counter. Two rules:

Before any local event or sending a message, increment your counter.
When receiving a message, set your counter to max(yours, theirs) + 1.

Example: A node in Virginia (clock=1) sends a write to a node in Frankfurt (clock=3). Frankfurt sets its clock to max(3, 1) + 1 = 4. Now we know Frankfurt's event at 4 happened after Virginia's event at 1 — even though these nodes are 6,000 km apart and their wall clocks disagree.

Tie-breaking for total order: When two events across different datacenters have the same Lamport timestamp, they're concurrent — but we still need to order them. The standard trick is to append the node ID: (timestamp, node_id). So (5, virginia-1) always comes before (5, frankfurt-2) lexicographically. Arbitrary, but deterministic and consistent across all regions.

The catch: The timestamp "5" tells you nothing about wall-clock time. A user debugging a cross-region issue can't look at Lamport timestamps and say "this happened at 2:03 PM." You get ordering, not timing.

Hybrid Logical Clocks (HLC) — Best of Both Worlds

An HLC is a tuple: (physical_time, logical_counter, node_id).

It tracks causality like a Lamport clock and stays close to real wall-clock time — critical when your nodes are spread across regions with varying NTP sync quality.

How it works:

On a local event: use max(your_physical, wall_clock). If the physical part didn't advance, bump the logical counter. Otherwise reset it to 0.
On receiving a message from another region: take max(your_physical, their_physical, wall_clock). If the max physical time didn't change, bump the logical counter. Otherwise reset to 0.

Example: A node in Virginia sends at physical time 10:00:01.000, HLC = (10:00:01.000, 0, virginia-1). A node in Singapore receives it, but Singapore's wall clock reads 10:00:00.500 (behind due to NTP drift). Instead of going backward, Singapore's HLC becomes (10:00:01.000, 1, singapore-1) — it takes the higher physical time and bumps the counter. Causality preserved, clock never goes backward, and the timestamp still reflects roughly when it happened.

HLC tie-breaking (three levels):

Compare physical timestamps — the later wall-clock time wins.
If physical times are equal, compare logical counters — higher counter means it saw more causal history at that timestamp.
If both are equal, compare node IDs — deterministic, arbitrary, but guarantees a total order even for truly simultaneous events in different datacenters.

This three-part comparison means every event globally — across every region — gets a unique, sortable position. No ambiguity, no central coordinator needed.

Why production systems love this: CockroachDB and YugabyteDB use HLCs across their geo-distributed clusters because they need causal ordering without sacrificing the ability to reason about "when" something actually happened. When a user in London writes data replicated to nodes in São Paulo and Sydney, every replica agrees on the order — and an operator can still look at the timestamps and map them to real-world time.

Summary

Feature	Lamport Clock	Hybrid Clock
Tracks causality?	Yes	Yes
Detects concurrency?	No (use Vector Clocks)	No (use Vector Clocks)
Reflects real time?	No	Approximately
Tie-breaker	Node ID	Physical time → counter → node ID
Total order?	Yes (with node ID)	Yes (built-in)
Centralized?	No	No
Works cross-region?	Yes, but no time context	Yes, with real-time context
Size	Integer + node ID	Timestamp + counter + node ID
Use case	Theory, simple systems	Production geo-distributed databases

Lamport clocks tell you what depends on what. Hybrid clocks tell you that and roughly when. Neither detects concurrency — but both give you a fully decentralized total order that never violates causality, even across datacenters on opposite sides of the planet.

This post was originally published on:

pomogomo.com