DEV Community

Javad
Javad

Posted on

Distributed Systems & Networking: From 🦸 to 🫅 — Make your own AWS!

(Part of the Distributed Systems & Networking Series. If you missed the first chapter, check it out here:

👉 Distributed Systems & Networking: From 0️⃣ to 🦸)


Hey everyone! 👋

Welcome to one of the most comprehensive, practical, and deeply technical guides you’ll ever read on Distributed Systems, Networking, Consensus, Scheduling, Observability, and High‑Performance Datacenter Architecture.

If you’ve ever wondered:

  • How do real distributed systems stay consistent?
  • How do schedulers like Kubernetes or Slurm decide where jobs run?
  • How do datacenters achieve microsecond latency?
  • How do we observe, debug, and stress‑test massive clusters?
  • How do protocols like RDMA, NVMe‑oF, MPI, ZeroMQ, Raft, and Paxos actually work?
  • How do cloud providers like AWS, Azure, and Google Cloud design their internal systems?

…then buckle up — this guide takes you from hero 🦸 to king 🫅 of distributed systems.

Let’s dive in.


  1. Foundations: Latency, Throughput, and System Architecture

Distributed systems are shaped by two fundamental forces:

  • Latency — how long it takes for a message to travel
  • Throughput — how much data you can push per second

Different workloads stress these differently:

Workload Latency Throughput
HPC (MPI) Extremely high Medium
Storage (NVMe‑oF) High Extremely high
Microservices Medium Low
Batch jobs Low Medium

A modern datacenter is a single distributed supercomputer, and your job is to make communication:

✅ fast

✅ predictable

✅ scalable

✅ observable

✅ secure


  1. RDMA — Remote Direct Memory Access

RDMA allows one machine to read/write memory on another machine without involving the remote CPU.

This gives:

  • Ultra‑low latency (1–2 µs)
  • Extremely high throughput (100–400 Gbps)
  • Minimal CPU overhead

2.1 RDMA Architecture


+---------------------------+
| Application |
+---------------------------+
| RDMA Verbs / RDMA CM |
+---------------------------+
| RNIC (RDMA NIC) |
+---------------------------+
| InfiniBand / RoCE Fabric |
+---------------------------+


2.2 Core RDMA Concepts

  • QP (Queue Pair) — send/recv queues
  • CQ (Completion Queue) — event notifications
  • MR (Memory Region) — pinned memory
  • PD (Protection Domain) — isolation

2.3 RDMA Example (Corrected C Code)

`c

include

struct ibvcontext *ctx = ibvopen_device(dev);
struct ibvpd *pd = ibvalloc_pd(ctx);
struct ibvcq *cq = ibvcreate_cq(ctx, 16, NULL, NULL, 0);

char *buf = aligned_alloc(4096, size);

struct ibvmr *mr = ibvreg_mr(
pd, buf, size,
IBVACCESSLOCAL_WRITE |
IBVACCESSREMOTE_WRITE |
IBVACCESSREMOTE_READ
);

struct ibvqpinitattr qpinit = {
.send_cq = cq,
.recv_cq = cq,
.cap = {
.maxsendwr = 16,
.maxrecvwr = 16,
.maxsendsge = 1,
.maxrecvsge = 1
},
.qptype = IBVQPT_RC
};

struct ibvqp *qp = ibvcreateqp(pd, &qpinit);
`


  1. NVMe over Fabrics (NVMe‑oF)

NVMe‑oF exposes remote NVMe devices over a network fabric with local‑like performance.


3.1 Connect to NVMe‑oF Target

bash
nvme discover -t rdma -a <target-ip> -s 4420
nvme connect -t rdma -n <nqn> -a <target-ip> -s 4420


3.2 Benchmark

bash
fio --name=randrw --filename=/dev/nvme1n1 \
--rw=randrw --bs=4k --iodepth=64 --numjobs=4 --runtime=60


  1. Cluster Scheduling — Kubernetes, Slurm, and Custom Schedulers

Schedulers decide:

  • Which job runs where
  • How resources are allocated
  • How to maximize utilization

4.1 Job Model (C++)

cpp
struct Job {
int id;
int cpu_req;
int mem_req;
int gpu_req;
};


4.2 Node Model (C++)

`cpp
struct Resources {
int cpu;
int mem;
int gpu;
};

struct Node {
int id;
Resources total, used;

bool canRun(const Job& j) const {
    return used.cpu + j.cpu_req <= total.cpu &&
           used.mem + j.mem_req <= total.mem &&
           used.gpu + j.gpu_req <= total.gpu;
}

void assign(const Job& j) {
    used.cpu += j.cpu_req;
    used.mem += j.mem_req;
    used.gpu += j.gpu_req;
}
Enter fullscreen mode Exit fullscreen mode

};
`


4.3 Simple Scheduler (C++)

cpp
int schedule(const Job& j, vector<Node>& nodes) {
for (auto& n : nodes) {
if (n.canRun(j)) {
n.assign(j);
return n.id;
}
}
return -1;
}


  1. Messaging — MPI & ZeroMQ

5.1 MPI Example

`c

include

include

int main(int argc, char argv) {
MPI_Init(&argc, &argv);

int rank, size;
MPICommrank(MPICOMMWORLD, &rank);
MPICommsize(MPICOMMWORLD, &size);

printf("Hello from rank %d/%d\n", rank, size);

MPI_Finalize();
return 0;
Enter fullscreen mode Exit fullscreen mode

}
`


5.2 ZeroMQ PUB/SUB

Publisher

`python
import zmq, time

ctx = zmq.Context()
sock = ctx.socket(zmq.PUB)
sock.bind("tcp://*:5556")

while True:
sock.send_string("topic1 Hello subscribers")
time.sleep(1)
`

Subscriber

`python
import zmq

ctx = zmq.Context()
sock = ctx.socket(zmq.SUB)
sock.connect("tcp://localhost:5556")
sock.setsockopt_string(zmq.SUBSCRIBE, "topic1")

while True:
print(sock.recv_string())
`


  1. Distributed Consensus — Raft, Paxos, and Beyond

Consensus ensures:

✅ consistency

✅ fault tolerance

✅ progress


6.1 Raft — Leader Election & Log Replication

Leader Election

  • Followers wait for heartbeat
  • Timeout → become candidate
  • Request votes
  • Majority → leader

Log Replication

  • Leader appends entry
  • Sends AppendEntries
  • Majority ack → commit

Raft Pseudocode

`text
on timeout:
become candidate
request votes

on majority:
become leader

on client write:
append log
replicate to followers
`


6.2 Paxos — Prepare / Accept

Basic Flow


Proposer → Prepare(n)
Acceptor → Promise
Proposer → Accept(n, value)
Acceptor → Accepted

Multi‑Paxos

  • One stable leader
  • Continuous replication
  • Similar to Raft in practice

  1. Observability — eBPF, OpenTelemetry, Chaos Engineering

7.1 eBPF — Kernel‑Level Tracing

Trace TCP Connect

c
SEC("kprobe/tcp_connect")
int bpfprog(struct ptregs *ctx) {
bpf_printk("TCP connect\n");
return 0;
}


7.2 OpenTelemetry — Distributed Tracing

cpp
auto span = tracer->StartSpan("handle_request");
{
auto scope = tracer->WithActiveSpan(span);
}
span->End();


7.3 Chaos Engineering

Inject Latency

bash
tc qdisc add dev eth0 root netem delay 200ms


  1. Final Notes — You’re Now in the Top 1%

You’ve just learned:

✅ RDMA

✅ NVMe‑oF

✅ Scheduling

✅ MPI & ZeroMQ

✅ Consensus (Raft, Paxos)

✅ Observability (eBPF, OTel)

✅ Chaos Engineering

This is the knowledge that powers Google, Meta, AWS, Azure, NVIDIA, and HPC clusters worldwide.


Thanks for Reading! 🙌

If you enjoyed this deep dive:

✅ Follow me for more advanced systems content

✅ Leave a reaction — it helps a lot

✅ Bookmark this post and read it again

✅ Practice the examples

✅ Share it with your team

See you in the next deep dive! 🚀🫅

Top comments (2)

Collapse
 
javadinteger profile image
Javad

Thanks everyone! 🚀

As of (2026/1/2), 5 of my posts are in the Top 15 on Dev.to!

This one is currently ranked #7 🎉

Here are the other 4 posts in the Top 15:

Collapse
 
javadinteger profile image
Javad

This post is the second entry in the Distributed Systems & Networking series 🚀
Stay tuned for the next one