(Part of the Distributed Systems & Networking Series. If you missed the first chapter, check it out here:
👉 Distributed Systems & Networking: From 0️⃣ to 🦸)
Hey everyone! 👋
Welcome to one of the most comprehensive, practical, and deeply technical guides you’ll ever read on Distributed Systems, Networking, Consensus, Scheduling, Observability, and High‑Performance Datacenter Architecture.
If you’ve ever wondered:
- How do real distributed systems stay consistent?
- How do schedulers like Kubernetes or Slurm decide where jobs run?
- How do datacenters achieve microsecond latency?
- How do we observe, debug, and stress‑test massive clusters?
- How do protocols like RDMA, NVMe‑oF, MPI, ZeroMQ, Raft, and Paxos actually work?
- How do cloud providers like AWS, Azure, and Google Cloud design their internal systems?
…then buckle up — this guide takes you from hero 🦸 to king 🫅 of distributed systems.
Let’s dive in.
- Foundations: Latency, Throughput, and System Architecture
Distributed systems are shaped by two fundamental forces:
- Latency — how long it takes for a message to travel
- Throughput — how much data you can push per second
Different workloads stress these differently:
| Workload | Latency | Throughput |
|---|---|---|
| HPC (MPI) | Extremely high | Medium |
| Storage (NVMe‑oF) | High | Extremely high |
| Microservices | Medium | Low |
| Batch jobs | Low | Medium |
A modern datacenter is a single distributed supercomputer, and your job is to make communication:
✅ fast
✅ predictable
✅ scalable
✅ observable
✅ secure
- RDMA — Remote Direct Memory Access
RDMA allows one machine to read/write memory on another machine without involving the remote CPU.
This gives:
- Ultra‑low latency (1–2 µs)
- Extremely high throughput (100–400 Gbps)
- Minimal CPU overhead
2.1 RDMA Architecture
+---------------------------+
| Application |
+---------------------------+
| RDMA Verbs / RDMA CM |
+---------------------------+
| RNIC (RDMA NIC) |
+---------------------------+
| InfiniBand / RoCE Fabric |
+---------------------------+
2.2 Core RDMA Concepts
- QP (Queue Pair) — send/recv queues
- CQ (Completion Queue) — event notifications
- MR (Memory Region) — pinned memory
- PD (Protection Domain) — isolation
2.3 RDMA Example (Corrected C Code)
`c
include
struct ibvcontext *ctx = ibvopen_device(dev);
struct ibvpd *pd = ibvalloc_pd(ctx);
struct ibvcq *cq = ibvcreate_cq(ctx, 16, NULL, NULL, 0);
char *buf = aligned_alloc(4096, size);
struct ibvmr *mr = ibvreg_mr(
pd, buf, size,
IBVACCESSLOCAL_WRITE |
IBVACCESSREMOTE_WRITE |
IBVACCESSREMOTE_READ
);
struct ibvqpinitattr qpinit = {
.send_cq = cq,
.recv_cq = cq,
.cap = {
.maxsendwr = 16,
.maxrecvwr = 16,
.maxsendsge = 1,
.maxrecvsge = 1
},
.qptype = IBVQPT_RC
};
struct ibvqp *qp = ibvcreateqp(pd, &qpinit);
`
- NVMe over Fabrics (NVMe‑oF)
NVMe‑oF exposes remote NVMe devices over a network fabric with local‑like performance.
3.1 Connect to NVMe‑oF Target
bash
nvme discover -t rdma -a <target-ip> -s 4420
nvme connect -t rdma -n <nqn> -a <target-ip> -s 4420
3.2 Benchmark
bash
fio --name=randrw --filename=/dev/nvme1n1 \
--rw=randrw --bs=4k --iodepth=64 --numjobs=4 --runtime=60
- Cluster Scheduling — Kubernetes, Slurm, and Custom Schedulers
Schedulers decide:
- Which job runs where
- How resources are allocated
- How to maximize utilization
4.1 Job Model (C++)
cpp
struct Job {
int id;
int cpu_req;
int mem_req;
int gpu_req;
};
4.2 Node Model (C++)
`cpp
struct Resources {
int cpu;
int mem;
int gpu;
};
struct Node {
int id;
Resources total, used;
bool canRun(const Job& j) const {
return used.cpu + j.cpu_req <= total.cpu &&
used.mem + j.mem_req <= total.mem &&
used.gpu + j.gpu_req <= total.gpu;
}
void assign(const Job& j) {
used.cpu += j.cpu_req;
used.mem += j.mem_req;
used.gpu += j.gpu_req;
}
};
`
4.3 Simple Scheduler (C++)
cpp
int schedule(const Job& j, vector<Node>& nodes) {
for (auto& n : nodes) {
if (n.canRun(j)) {
n.assign(j);
return n.id;
}
}
return -1;
}
- Messaging — MPI & ZeroMQ
5.1 MPI Example
`c
include
include
int main(int argc, char argv) {
MPI_Init(&argc, &argv);
int rank, size;
MPICommrank(MPICOMMWORLD, &rank);
MPICommsize(MPICOMMWORLD, &size);
printf("Hello from rank %d/%d\n", rank, size);
MPI_Finalize();
return 0;
}
`
5.2 ZeroMQ PUB/SUB
Publisher
`python
import zmq, time
ctx = zmq.Context()
sock = ctx.socket(zmq.PUB)
sock.bind("tcp://*:5556")
while True:
sock.send_string("topic1 Hello subscribers")
time.sleep(1)
`
Subscriber
`python
import zmq
ctx = zmq.Context()
sock = ctx.socket(zmq.SUB)
sock.connect("tcp://localhost:5556")
sock.setsockopt_string(zmq.SUBSCRIBE, "topic1")
while True:
print(sock.recv_string())
`
- Distributed Consensus — Raft, Paxos, and Beyond
Consensus ensures:
✅ consistency
✅ fault tolerance
✅ progress
6.1 Raft — Leader Election & Log Replication
Leader Election
- Followers wait for heartbeat
- Timeout → become candidate
- Request votes
- Majority → leader
Log Replication
- Leader appends entry
- Sends AppendEntries
- Majority ack → commit
Raft Pseudocode
`text
on timeout:
become candidate
request votes
on majority:
become leader
on client write:
append log
replicate to followers
`
6.2 Paxos — Prepare / Accept
Basic Flow
Proposer → Prepare(n)
Acceptor → Promise
Proposer → Accept(n, value)
Acceptor → Accepted
Multi‑Paxos
- One stable leader
- Continuous replication
- Similar to Raft in practice
- Observability — eBPF, OpenTelemetry, Chaos Engineering
7.1 eBPF — Kernel‑Level Tracing
Trace TCP Connect
c
SEC("kprobe/tcp_connect")
int bpfprog(struct ptregs *ctx) {
bpf_printk("TCP connect\n");
return 0;
}
7.2 OpenTelemetry — Distributed Tracing
cpp
auto span = tracer->StartSpan("handle_request");
{
auto scope = tracer->WithActiveSpan(span);
}
span->End();
7.3 Chaos Engineering
Inject Latency
bash
tc qdisc add dev eth0 root netem delay 200ms
- Final Notes — You’re Now in the Top 1%
You’ve just learned:
✅ RDMA
✅ NVMe‑oF
✅ Scheduling
✅ MPI & ZeroMQ
✅ Consensus (Raft, Paxos)
✅ Observability (eBPF, OTel)
✅ Chaos Engineering
This is the knowledge that powers Google, Meta, AWS, Azure, NVIDIA, and HPC clusters worldwide.
Thanks for Reading! 🙌
If you enjoyed this deep dive:
✅ Follow me for more advanced systems content
✅ Leave a reaction — it helps a lot
✅ Bookmark this post and read it again
✅ Practice the examples
✅ Share it with your team
See you in the next deep dive! 🚀🫅
Top comments (2)
Thanks everyone! 🚀
As of (2026/1/2), 5 of my posts are in the
Top 15on Dev.to!This one is currently ranked #7 🎉
Here are the other 4 posts in the Top 15:
This post is the second entry in the Distributed Systems & Networking series 🚀
Stay tuned for the next one