DEV Community

Javad
Javad

Posted on

Distributed Systems & Networking: From 0️⃣ to 🦸 — A Complete, High‑Performance Datacenter Guide

Hey Dev Community!

Welcome!

Modern distributed systems are built on one fundamental truth: the network is the computer. Whether you're running HPC workloads, microservices, distributed storage, or large‑scale data pipelines, your system’s performance is ultimately shaped by latency, bandwidth, and coordination.

This guide is a complete, end‑to‑end deep dive into the technologies that power real datacenters:

  • RDMA (InfiniBand / RoCE)
  • NVMe over Fabrics (NVMe‑oF)
  • Cluster scheduling (Kubernetes, Slurm, Nomad)
  • Messaging layers (MPI, ZeroMQ)
  • Performance engineering
  • Observability
  • Security & deployment

You’ll get architecture diagrams, protocol explanations, tuning advice, and runnable examples. By the end, you’ll be able to design and operate production‑grade distributed systems.


  1. Foundations: Latency, Throughput, and System Model

Distributed systems are shaped by two constraints:

  • Latency — how long it takes for a message to travel
  • Throughput — how much data you can push per second

Different workloads stress these differently:

Workload Latency Sensitivity Throughput Sensitivity
HPC (MPI) Extremely high Medium
Storage (NVMe‑oF) High Extremely high
Microservices Medium Low
Batch jobs Low Medium

A modern datacenter is a single distributed supercomputer. Your job is to make communication between nodes:

  • fast
  • predictable
  • scalable
  • observable
  • secure

System model assumptions

This guide assumes a cluster with:

  • Linux hosts
  • High‑speed NICs (InfiniBand or RoCE)
  • NVMe SSDs
  • Kubernetes or Slurm for scheduling
  • Root/admin access to configure kernel modules and NICs

  1. RDMA — Remote Direct Memory Access

RDMA allows one machine to read/write memory on another machine without involving the remote CPU. Benefits:

  • Ultra‑low latency (1–2 µs)
  • Extremely high throughput (100–400 Gbps)
  • Minimal CPU overhead

2.1 RDMA architecture


+---------------------------+
| Application |
+---------------------------+
| RDMA Verbs / RDMA CM |
+---------------------------+
| RNIC (RDMA NIC) |
+---------------------------+
| InfiniBand / RoCE Fabric |
+---------------------------+

2.2 Core RDMA concepts

  • Queue Pair (QP) — send queue and receive queue per endpoint
  • Completion Queue (CQ) — where completion events are posted
  • Memory Region (MR) — pinned memory registered with the RNIC
  • Protection Domain (PD) — isolation boundary for QPs and MRs

2.3 RDMA transport types

Transport Reliability Ordering Use case
RC Reliable Ordered HPC, storage
UC Unreliable Ordered Niche low‑latency
UD Unreliable Unordered Broadcast, discovery

2.4 InfiniBand vs RoCE vs iWARP

Fabric Latency Complexity Notes
InfiniBand Best Medium HPC standard, dedicated fabric
RoCEv2 Very good High RDMA over Converged Ethernet; requires PFC/DCQCN tuning
iWARP Moderate Low RDMA over TCP; easier deployment but higher latency

  1. RDMA Programming (libibverbs)

A minimal RDMA program lifecycle:

  1. Open device (ibvopendevice)
  2. Allocate protection domain (ibvallocpd)
  3. Create completion queue (ibvcreatecq)
  4. Register memory region (ibvregmr)
  5. Create queue pair (ibvcreateqp)
  6. Exchange QP connection info with peer (out‑of‑band)
  7. Transition QP to RTR then RTS
  8. Post send/recv work requests (ibvpostsend, ibvpostrecv)
  9. Poll completion queue (ibvpollcq)

3.1 Minimal RDMA example (conceptual, corrected)

Note: This is a conceptual snippet showing correct API names and structures. Real programs require error handling, out‑of‑band exchange (e.g., TCP) for QP attributes, and proper cleanup.

`c

include

include

include

int main() {
struct ibvdevice devlist = ibvgetdevice_list(NULL);
struct ibvdevice *dev = devlist[0];
struct ibvcontext *ctx = ibvopen_device(dev);

struct ibvpd *pd = ibvalloc_pd(ctx);
struct ibvcq *cq = ibvcreate_cq(ctx, 16, NULL, NULL, 0);

size_t size = 4096;
void *buf = aligned_alloc(4096, size);

struct ibvmr *mr = ibvreg_mr(pd, buf, size,
    IBVACCESSLOCALWRITE | IBVACCESSREMOTEWRITE | IBVACCESSREMOTE_READ);

struct ibvqpinitattr qpinit = {
    .send_cq = cq,
    .recv_cq = cq,
    .cap = {
        .maxsendwr = 16,
        .maxrecvwr = 16,
        .maxsendsge = 1,
        .maxrecvsge = 1
    },
    .qptype = IBVQPT_RC
};

struct ibvqp *qp = ibvcreateqp(pd, &qpinit);

// Out-of-band: exchange qp->qp_num, lid/gid, mr->rkey, buf address with peer
// Then modify QP to RTR and RTS using ibvmodifyqp

// Post a receive
struct ibv_sge sge = {
    .addr = (uintptr_t)buf,
    .length = size,
    .lkey = mr->lkey
};

struct ibvrecvwr rr = {
    .wr_id = 1,
    .sg_list = &sge,
    .num_sge = 1,
    .next = NULL
};
struct ibvrecvwr *bad_rr;
ibvpostrecv(qp, &rr, &bad_rr);

// Poll completions (example)
struct ibv_wc wc;
while (ibvpollcq(cq, 1, &wc) == 0) {
    // busy-wait or use event-driven approach
}

// Cleanup omitted for brevity
return 0;
Enter fullscreen mode Exit fullscreen mode

}
`

3.2 Production tuning tips

  • Use hugepages for large pinned allocations
  • Pin CQ polling threads and set CPU affinity
  • Enable busy‑polling (net.core.busy_poll) for low latency where appropriate
  • Configure MTU (e.g., 4096 for InfiniBand or 9000 for Ethernet jumbo frames)
  • RoCE tuning: enable PFC, ECN, and tune DCQCN to avoid packet loss and head‑of‑line blocking

  1. NVMe over Fabrics (NVMe‑oF)

NVMe‑oF exposes remote NVMe devices over a network fabric with local‑like performance. Transports include RDMA, TCP, and Fibre Channel.

4.1 NVMe‑oF architecture


+-------------------+
| Application |
+-------------------+
| NVMe Driver |
+-------------------+
| NVMe-oF Transport |
+-------------------+
| RDMA / TCP / FC |
+-------------------+
| Network Fabric |
+-------------------+

4.2 NVMe‑oF transports

Transport Latency Deployment Notes
RDMA Best Hard Best performance for HPC/storage
TCP Good Easy Widely deployable, good performance with kernel NVMe/TCP
FC Enterprise Medium SAN environments

4.3 Example: NVMe‑oF with RDMA (initiator)

`bash

Discover NVMe-oF targets over RDMA
nvme discover -t rdma -a -s 4420

Connect to a discovered target (replace and )
nvme connect -t rdma -n -a -s 4420
`

4.4 Benchmark example with fio

bash
fio --name=randrw --filename=/dev/nvme1n1 \
--rw=randrw --bs=4k --iodepth=64 --numjobs=4 --runtime=60 --time_based


  1. Cluster Scheduling & Resource Management

Schedulers decide where workloads run and how resources are allocated. Examples: Kubernetes, Slurm, Nomad.

5.1 Kubernetes key features

  • Pods, Deployments, StatefulSets
  • Resource requests and limits
  • Taints and tolerations
  • Device plugins (GPU, SR‑IOV, NVMe)
  • Topology Manager and CPU Manager for NUMA awareness

5.2 Slurm (HPC) example job script

Correct SBATCH directives must start with #SBATCH and be at the top of the script.

`bash

!/bin/bash

SBATCH --job-name=mpi_job

SBATCH --nodes=4

SBATCH --ntasks-per-node=32

SBATCH --time=01:00:00

SBATCH --partition=compute

module load mpi
srun --mpi=pmixv3 ./mympiapp
`

5.3 Scheduling algorithms and strategies

  • Bin packing for utilization
  • Gang scheduling for tightly coupled parallel jobs
  • Preemption for priority handling
  • NUMA‑aware placement to reduce cross‑socket memory access
  • Network‑aware placement to reduce cross‑rack traffic and latency

  1. Messaging: MPI & ZeroMQ

6.1 MPI (Message Passing Interface)

MPI is the backbone of HPC. Correct minimal example:

`c

include

include

int main(int argc, char argv) {
MPI_Init(&argc, &argv);

int rank, size;
MPICommrank(MPICOMMWORLD, &rank);
MPICommsize(MPICOMMWORLD, &size);

printf("Hello from rank %d/%d\n", rank, size);

if (rank == 0) {
    int data = 42;
    MPISend(&data, 1, MPIINT, 1, 0, MPICOMMWORLD);
} else if (rank == 1) {
    int recv;
    MPIRecv(&recv, 1, MPIINT, 0, 0, MPICOMMWORLD, MPISTATUSIGNORE);
    printf("Rank 1 received %d\n", recv);
}

MPI_Finalize();
return 0;
Enter fullscreen mode Exit fullscreen mode

}
`

Build and run

bash
mpicc -o mpiexample mpiexample.c
mpirun -np 4 ./mpi_example

6.2 ZeroMQ PUB/SUB example

Publisher

`python

publisher.py
import zmq
import time

ctx = zmq.Context()
sock = ctx.socket(zmq.PUB)
sock.bind("tcp://*:5556")

while True:
sock.send_string("topic1 Hello subscribers")
time.sleep(1)
`

Subscriber

`python

subscriber.py
import zmq

ctx = zmq.Context()
sock = ctx.socket(zmq.SUB)
sock.connect("tcp://localhost:5556")
sock.setsockopt_string(zmq.SUBSCRIBE, "topic1")

while True:
msg = sock.recv_string()
print("Received:", msg)
`

Run publisher and subscriber in separate terminals.


  1. Performance Engineering

7.1 Network

  • Enable jumbo frames where supported (MTU 9000)
  • Tune RSS/XPS to distribute interrupts across CPUs
  • Set IRQ affinity for RNIC queues
  • RoCE: configure PFC, ECN, and DCQCN to avoid packet loss

7.2 CPU

  • Pin critical threads to dedicated cores
  • Use isolcpus kernel parameter for latency‑sensitive workloads
  • Disable frequency scaling (set performance governor) for predictable latency

7.3 Memory

  • Use hugepages for large allocations and to reduce TLB pressure
  • Align allocations to NUMA nodes and schedule accordingly
  • Tune vm.swappiness to avoid swapping

7.4 Storage

  • Use appropriate I/O scheduler (e.g., none or mq-deadline for NVMe)
  • Tune queue depth and I/O parallelism for NVMe devices

  1. Observability & Debugging

Tools and techniques to observe and debug distributed systems:

  • iperf3 for network throughput testing
  • ibv_devinfo, ibstat for RDMA device info
  • perf, bpftrace, bcc for profiling and tracing
  • tcpdump for packet captures
  • fio for storage benchmarking
  • OpenTelemetry for distributed tracing, metrics, and logs

  1. Observability Deep Dive: eBPF, OpenTelemetry, Chaos Engineering

9.1 eBPF — kernel‑level observability

eBPF programs run safely in the kernel and can attach to kprobes, uprobes, tracepoints, and sockets.

Example: simple kprobe program (libbpf style)

`c
// tracetcpconnect.c (libbpf skeleton omitted for brevity)

include

include

SEC("kprobe/tcp_connect")
int kprobetcpconnect(struct ptregs *ctx) {
bpfprintk("tcpconnect called\n");
return 0;
}

char LICENSE[] SEC("license") = "GPL";
`

Load with bpftool or libbpf-based loader. Use bpftrace for quick experiments:

bash
sudo bpftrace -e 'kprobe:tcpconnect { printf("tcpconnect %s\n", comm); }'

9.2 OpenTelemetry — distributed tracing

Instrument services to produce spans and traces. Example C++ pseudo‑usage:

`cpp

include

auto tracer = opentelemetry::trace::Provider::GetTracer("service");
auto span = tracer->StartSpan("handle_request");
{
auto scope = tracer->WithActiveSpan(span);
// work...
}
span->End();
`

Export traces to a collector (e.g., OpenTelemetry Collector) and visualize in Jaeger, Zipkin, or commercial backends.

9.3 Chaos Engineering

Inject controlled failures to validate resilience.

Network latency injection

`bash

Add 200ms latency on eth0
sudo tc qdisc add dev eth0 root netem delay 200ms

Remove the rule
sudo tc qdisc del dev eth0 root
`

CPU stress

`bash

Install stress-ng and run CPU load
stress-ng --cpu 4 --timeout 60s
`

Use tools like Chaos Mesh, LitmusChaos, or Gremlin for orchestrated experiments in Kubernetes.


  1. Security & Deployment
  • VLAN/VRF isolation for tenant separation
  • Secure boot and signed kernel modules for host integrity
  • Access control for NVMe‑oF targets and RDMA resources
  • RDMA subnet manager and fabric authentication where applicable
  • Gradual rollout per rack and canary deployments for safe changes

  1. Appendix: Quick Commands and Corrected Examples

RDMA utilities

`bash

Show RDMA devices
ibv_devinfo

Show InfiniBand link status
ibstat

Simple RDMA pingpong test (from rdma-core examples)
ibvrcpingpong
`

NVMe‑oF

bash
nvme discover -t rdma -a <ip> -s 4420
nvme connect -t rdma -n <nqn> -a <ip> -s 4420

MPI

bash
mpicc app.c -o app
mpirun -np 64 ./app

ZeroMQ

bash
python3 publisher.py & python3 subscriber.py


Thanks for reading!

Follow me for more advanced systems content

Leave a reaction — it helps a lot

Bookmark this post and read it again

Practice the examples and give feedback

See you in the next deep dive! 🚀🫅

Top comments (2)

Collapse
 
javadinteger profile image
Javad

This post is the first entry in the Distributed Systems & Networking series 🚀
Stay tuned for the next one

Some comments may only be visible to logged-in visitors. Sign in to view all comments.