Javad

Posted on Jan 2

Distributed Systems & Networking: From 🦸 to 🫅 — Make your own AWS!

#tutorial #devops #aws #discuss

(Part of the Distributed Systems & Networking Series. If you missed the first chapter, check it out here:

👉 Distributed Systems & Networking: From 0️⃣ to 🦸)

Hey everyone! 👋

Welcome to one of the most comprehensive, practical, and deeply technical guides you’ll ever read on Distributed Systems, Networking, Consensus, Scheduling, Observability, and High‑Performance Datacenter Architecture.

If you’ve ever wondered:

How do real distributed systems stay consistent?
How do schedulers like Kubernetes or Slurm decide where jobs run?
How do datacenters achieve microsecond latency?
How do we observe, debug, and stress‑test massive clusters?
How do protocols like RDMA, NVMe‑oF, MPI, ZeroMQ, Raft, and Paxos actually work?
How do cloud providers like AWS, Azure, and Google Cloud design their internal systems?

…then buckle up — this guide takes you from hero 🦸 to king 🫅 of distributed systems.

Let’s dive in.

Foundations: Latency, Throughput, and System Architecture

Distributed systems are shaped by two fundamental forces:

Latency — how long it takes for a message to travel
Throughput — how much data you can push per second

Different workloads stress these differently:

Workload	Latency	Throughput
HPC (MPI)	Extremely high	Medium
Storage (NVMe‑oF)	High	Extremely high
Microservices	Medium	Low
Batch jobs	Low	Medium

A modern datacenter is a single distributed supercomputer, and your job is to make communication:

✅ fast

✅ predictable

✅ scalable

✅ observable

✅ secure

RDMA — Remote Direct Memory Access

RDMA allows one machine to read/write memory on another machine without involving the remote CPU.

This gives:

Ultra‑low latency (1–2 µs)
Extremely high throughput (100–400 Gbps)
Minimal CPU overhead

2.1 RDMA Architecture

+---------------------------+ | Application | +---------------------------+ | RDMA Verbs / RDMA CM | +---------------------------+ | RNIC (RDMA NIC) | +---------------------------+ | InfiniBand / RoCE Fabric | +---------------------------+

2.2 Core RDMA Concepts

QP (Queue Pair) — send/recv queues
CQ (Completion Queue) — event notifications
MR (Memory Region) — pinned memory
PD (Protection Domain) — isolation

2.3 RDMA Example (Corrected C Code)

include

struct ibvcontext *ctx = ibvopen_device(dev);
struct ibvpd *pd = ibvalloc_pd(ctx);
struct ibvcq *cq = ibvcreate_cq(ctx, 16, NULL, NULL, 0);

char *buf = aligned_alloc(4096, size);

struct ibvmr *mr = ibvreg_mr(
pd, buf, size,
IBVACCESSLOCAL_WRITE |
IBVACCESSREMOTE_WRITE |
IBVACCESSREMOTE_READ
);

struct ibvqpinitattr qpinit = {
.send_cq = cq,
.recv_cq = cq,
.cap = {
.maxsendwr = 16,
.maxrecvwr = 16,
.maxsendsge = 1,
.maxrecvsge = 1
},
.qptype = IBVQPT_RC
};

struct ibvqp *qp = ibvcreateqp(pd, &qpinit);
`

NVMe over Fabrics (NVMe‑oF)

NVMe‑oF exposes remote NVMe devices over a network fabric with local‑like performance.

3.1 Connect to NVMe‑oF Target

bash nvme discover -t rdma -a <target-ip> -s 4420 nvme connect -t rdma -n <nqn> -a <target-ip> -s 4420

3.2 Benchmark

bash fio --name=randrw --filename=/dev/nvme1n1 \ --rw=randrw --bs=4k --iodepth=64 --numjobs=4 --runtime=60

Cluster Scheduling — Kubernetes, Slurm, and Custom Schedulers

Schedulers decide:

Which job runs where
How resources are allocated
How to maximize utilization

4.1 Job Model (C++)

cpp struct Job { int id; int cpu_req; int mem_req; int gpu_req; };

4.2 Node Model (C++)

`cpp
struct Resources {
int cpu;
int mem;
int gpu;
};

struct Node {
int id;
Resources total, used;

bool canRun(const Job& j) const {
    return used.cpu + j.cpu_req <= total.cpu &&
           used.mem + j.mem_req <= total.mem &&
           used.gpu + j.gpu_req <= total.gpu;
}

void assign(const Job& j) {
    used.cpu += j.cpu_req;
    used.mem += j.mem_req;
    used.gpu += j.gpu_req;
}

};
`

4.3 Simple Scheduler (C++)

cpp int schedule(const Job& j, vector<Node>& nodes) { for (auto& n : nodes) { if (n.canRun(j)) { n.assign(j); return n.id; } } return -1; }

Messaging — MPI & ZeroMQ

5.1 MPI Example

include

int main(int argc, char argv) {
MPI_Init(&argc, &argv);

int rank, size;
MPICommrank(MPICOMMWORLD, &rank);
MPICommsize(MPICOMMWORLD, &size);

printf("Hello from rank %d/%d\n", rank, size);

MPI_Finalize();
return 0;

}
`

5.2 ZeroMQ PUB/SUB

Publisher

`python
import zmq, time

ctx = zmq.Context()
sock = ctx.socket(zmq.PUB)
sock.bind("tcp://*:5556")

while True:
sock.send_string("topic1 Hello subscribers")
time.sleep(1)
`

Subscriber

`python
import zmq

ctx = zmq.Context()
sock = ctx.socket(zmq.SUB)
sock.connect("tcp://localhost:5556")
sock.setsockopt_string(zmq.SUBSCRIBE, "topic1")

while True:
print(sock.recv_string())
`

Distributed Consensus — Raft, Paxos, and Beyond

Consensus ensures:

✅ consistency

✅ fault tolerance

✅ progress

6.1 Raft — Leader Election & Log Replication

Leader Election

Followers wait for heartbeat
Timeout → become candidate
Request votes
Majority → leader

Log Replication

Leader appends entry
Sends AppendEntries
Majority ack → commit

Raft Pseudocode

`text
on timeout:
become candidate
request votes

on majority:
become leader

on client write:
append log
replicate to followers
`

6.2 Paxos — Prepare / Accept

Basic Flow

Proposer → Prepare(n) Acceptor → Promise Proposer → Accept(n, value) Acceptor → Accepted

Multi‑Paxos

One stable leader
Continuous replication
Similar to Raft in practice

Observability — eBPF, OpenTelemetry, Chaos Engineering

7.1 eBPF — Kernel‑Level Tracing

Trace TCP Connect

c SEC("kprobe/tcp_connect") int bpfprog(struct ptregs *ctx) { bpf_printk("TCP connect\n"); return 0; }

7.2 OpenTelemetry — Distributed Tracing

cpp auto span = tracer->StartSpan("handle_request"); { auto scope = tracer->WithActiveSpan(span); } span->End();

7.3 Chaos Engineering

Inject Latency

bash tc qdisc add dev eth0 root netem delay 200ms

Final Notes — You’re Now in the Top 1%

You’ve just learned:

✅ RDMA

✅ NVMe‑oF

✅ Scheduling

✅ MPI & ZeroMQ

✅ Consensus (Raft, Paxos)

✅ Observability (eBPF, OTel)

✅ Chaos Engineering

This is the knowledge that powers Google, Meta, AWS, Azure, NVIDIA, and HPC clusters worldwide.

Thanks for Reading! 🙌

If you enjoyed this deep dive:

✅ Follow me for more advanced systems content

✅ Leave a reaction — it helps a lot

✅ Bookmark this post and read it again

✅ Practice the examples

✅ Share it with your team

See you in the next deep dive! 🚀🫅