Javad

Posted on Jan 2

Distributed Systems & Networking: From 0️⃣ to 🦸 — A Complete, High‑Performance Datacenter Guide

#tutorial #devops #discuss #cloud

Hey Dev Community!

Welcome!

Modern distributed systems are built on one fundamental truth: the network is the computer. Whether you're running HPC workloads, microservices, distributed storage, or large‑scale data pipelines, your system’s performance is ultimately shaped by latency, bandwidth, and coordination.

This guide is a complete, end‑to‑end deep dive into the technologies that power real datacenters:

RDMA (InfiniBand / RoCE)
NVMe over Fabrics (NVMe‑oF)
Cluster scheduling (Kubernetes, Slurm, Nomad)
Messaging layers (MPI, ZeroMQ)
Performance engineering
Observability
Security & deployment

You’ll get architecture diagrams, protocol explanations, tuning advice, and runnable examples. By the end, you’ll be able to design and operate production‑grade distributed systems.

Foundations: Latency, Throughput, and System Model

Distributed systems are shaped by two constraints:

Latency — how long it takes for a message to travel
Throughput — how much data you can push per second

Different workloads stress these differently:

Workload	Latency Sensitivity	Throughput Sensitivity
HPC (MPI)	Extremely high	Medium
Storage (NVMe‑oF)	High	Extremely high
Microservices	Medium	Low
Batch jobs	Low	Medium

A modern datacenter is a single distributed supercomputer. Your job is to make communication between nodes:

fast
predictable
scalable
observable
secure

System model assumptions

This guide assumes a cluster with:

Linux hosts
High‑speed NICs (InfiniBand or RoCE)
NVMe SSDs
Kubernetes or Slurm for scheduling
Root/admin access to configure kernel modules and NICs

RDMA — Remote Direct Memory Access

RDMA allows one machine to read/write memory on another machine without involving the remote CPU. Benefits:

Ultra‑low latency (1–2 µs)
Extremely high throughput (100–400 Gbps)
Minimal CPU overhead

2.1 RDMA architecture

+---------------------------+ | Application | +---------------------------+ | RDMA Verbs / RDMA CM | +---------------------------+ | RNIC (RDMA NIC) | +---------------------------+ | InfiniBand / RoCE Fabric | +---------------------------+

2.2 Core RDMA concepts

Queue Pair (QP) — send queue and receive queue per endpoint
Completion Queue (CQ) — where completion events are posted
Memory Region (MR) — pinned memory registered with the RNIC
Protection Domain (PD) — isolation boundary for QPs and MRs

2.3 RDMA transport types

Transport	Reliability	Ordering	Use case
RC	Reliable	Ordered	HPC, storage
UC	Unreliable	Ordered	Niche low‑latency
UD	Unreliable	Unordered	Broadcast, discovery

2.4 InfiniBand vs RoCE vs iWARP

Fabric	Latency	Complexity	Notes
InfiniBand	Best	Medium	HPC standard, dedicated fabric
RoCEv2	Very good	High	RDMA over Converged Ethernet; requires PFC/DCQCN tuning
iWARP	Moderate	Low	RDMA over TCP; easier deployment but higher latency

RDMA Programming (libibverbs)

A minimal RDMA program lifecycle:

Open device (ibvopendevice)
Allocate protection domain (ibvallocpd)
Create completion queue (ibvcreatecq)
Register memory region (ibvregmr)
Create queue pair (ibvcreateqp)
Exchange QP connection info with peer (out‑of‑band)
Transition QP to RTR then RTS
Post send/recv work requests (ibvpostsend, ibvpostrecv)
Poll completion queue (ibvpollcq)

3.1 Minimal RDMA example (conceptual, corrected)

Note: This is a conceptual snippet showing correct API names and structures. Real programs require error handling, out‑of‑band exchange (e.g., TCP) for QP attributes, and proper cleanup.

include

int main() {
struct ibvdevice devlist = ibvgetdevice_list(NULL);
struct ibvdevice *dev = devlist[0];
struct ibvcontext *ctx = ibvopen_device(dev);

struct ibvpd *pd = ibvalloc_pd(ctx);
struct ibvcq *cq = ibvcreate_cq(ctx, 16, NULL, NULL, 0);

size_t size = 4096;
void *buf = aligned_alloc(4096, size);

struct ibvmr *mr = ibvreg_mr(pd, buf, size,
    IBVACCESSLOCALWRITE | IBVACCESSREMOTEWRITE | IBVACCESSREMOTE_READ);

struct ibvqpinitattr qpinit = {
    .send_cq = cq,
    .recv_cq = cq,
    .cap = {
        .maxsendwr = 16,
        .maxrecvwr = 16,
        .maxsendsge = 1,
        .maxrecvsge = 1
    },
    .qptype = IBVQPT_RC
};

struct ibvqp *qp = ibvcreateqp(pd, &qpinit);

// Out-of-band: exchange qp->qp_num, lid/gid, mr->rkey, buf address with peer
// Then modify QP to RTR and RTS using ibvmodifyqp

// Post a receive
struct ibv_sge sge = {
    .addr = (uintptr_t)buf,
    .length = size,
    .lkey = mr->lkey
};

struct ibvrecvwr rr = {
    .wr_id = 1,
    .sg_list = &sge,
    .num_sge = 1,
    .next = NULL
};
struct ibvrecvwr *bad_rr;
ibvpostrecv(qp, &rr, &bad_rr);

// Poll completions (example)
struct ibv_wc wc;
while (ibvpollcq(cq, 1, &wc) == 0) {
    // busy-wait or use event-driven approach
}

// Cleanup omitted for brevity
return 0;

}
`

3.2 Production tuning tips

Use hugepages for large pinned allocations
Pin CQ polling threads and set CPU affinity
Enable busy‑polling (net.core.busy_poll) for low latency where appropriate
Configure MTU (e.g., 4096 for InfiniBand or 9000 for Ethernet jumbo frames)
RoCE tuning: enable PFC, ECN, and tune DCQCN to avoid packet loss and head‑of‑line blocking

NVMe over Fabrics (NVMe‑oF)

NVMe‑oF exposes remote NVMe devices over a network fabric with local‑like performance. Transports include RDMA, TCP, and Fibre Channel.

4.1 NVMe‑oF architecture

+-------------------+ | Application | +-------------------+ | NVMe Driver | +-------------------+ | NVMe-oF Transport | +-------------------+ | RDMA / TCP / FC | +-------------------+ | Network Fabric | +-------------------+

4.2 NVMe‑oF transports

Transport	Latency	Deployment	Notes
RDMA	Best	Hard	Best performance for HPC/storage
TCP	Good	Easy	Widely deployable, good performance with kernel NVMe/TCP
FC	Enterprise	Medium	SAN environments

4.3 Example: NVMe‑oF with RDMA (initiator)

`bash

Discover NVMe-oF targets over RDMA
nvme discover -t rdma -a -s 4420

Connect to a discovered target (replace and )
nvme connect -t rdma -n -a -s 4420
`

4.4 Benchmark example with fio

bash fio --name=randrw --filename=/dev/nvme1n1 \ --rw=randrw --bs=4k --iodepth=64 --numjobs=4 --runtime=60 --time_based

Cluster Scheduling & Resource Management

Schedulers decide where workloads run and how resources are allocated. Examples: Kubernetes, Slurm, Nomad.

5.1 Kubernetes key features

Pods, Deployments, StatefulSets
Resource requests and limits
Taints and tolerations
Device plugins (GPU, SR‑IOV, NVMe)
Topology Manager and CPU Manager for NUMA awareness

5.2 Slurm (HPC) example job script

Correct SBATCH directives must start with #SBATCH and be at the top of the script.

`bash

!/bin/bash

SBATCH --job-name=mpi_job

SBATCH --nodes=4

SBATCH --ntasks-per-node=32

SBATCH --time=01:00:00

SBATCH --partition=compute

module load mpi
srun --mpi=pmixv3 ./mympiapp
`

5.3 Scheduling algorithms and strategies

Bin packing for utilization
Gang scheduling for tightly coupled parallel jobs
Preemption for priority handling
NUMA‑aware placement to reduce cross‑socket memory access
Network‑aware placement to reduce cross‑rack traffic and latency

Messaging: MPI & ZeroMQ

6.1 MPI (Message Passing Interface)

MPI is the backbone of HPC. Correct minimal example:

include

int main(int argc, char argv) {
MPI_Init(&argc, &argv);

int rank, size;
MPICommrank(MPICOMMWORLD, &rank);
MPICommsize(MPICOMMWORLD, &size);

printf("Hello from rank %d/%d\n", rank, size);

if (rank == 0) {
    int data = 42;
    MPISend(&data, 1, MPIINT, 1, 0, MPICOMMWORLD);
} else if (rank == 1) {
    int recv;
    MPIRecv(&recv, 1, MPIINT, 0, 0, MPICOMMWORLD, MPISTATUSIGNORE);
    printf("Rank 1 received %d\n", recv);
}

MPI_Finalize();
return 0;

}
`

Build and run

bash mpicc -o mpiexample mpiexample.c mpirun -np 4 ./mpi_example

6.2 ZeroMQ PUB/SUB example

Publisher

`python

publisher.py
import zmq
import time

ctx = zmq.Context()
sock = ctx.socket(zmq.PUB)
sock.bind("tcp://*:5556")

while True:
sock.send_string("topic1 Hello subscribers")
time.sleep(1)
`

Subscriber

`python

subscriber.py
import zmq

ctx = zmq.Context()
sock = ctx.socket(zmq.SUB)
sock.connect("tcp://localhost:5556")
sock.setsockopt_string(zmq.SUBSCRIBE, "topic1")

while True:
msg = sock.recv_string()
print("Received:", msg)
`

Run publisher and subscriber in separate terminals.

Performance Engineering

7.1 Network

Enable jumbo frames where supported (MTU 9000)
Tune RSS/XPS to distribute interrupts across CPUs
Set IRQ affinity for RNIC queues
RoCE: configure PFC, ECN, and DCQCN to avoid packet loss

7.2 CPU

Pin critical threads to dedicated cores
Use isolcpus kernel parameter for latency‑sensitive workloads
Disable frequency scaling (set performance governor) for predictable latency

7.3 Memory

Use hugepages for large allocations and to reduce TLB pressure
Align allocations to NUMA nodes and schedule accordingly
Tune vm.swappiness to avoid swapping

7.4 Storage

Use appropriate I/O scheduler (e.g., none or mq-deadline for NVMe)
Tune queue depth and I/O parallelism for NVMe devices

Observability & Debugging

Tools and techniques to observe and debug distributed systems:

iperf3 for network throughput testing
ibv_devinfo, ibstat for RDMA device info
perf, bpftrace, bcc for profiling and tracing
tcpdump for packet captures
fio for storage benchmarking
OpenTelemetry for distributed tracing, metrics, and logs

Observability Deep Dive: eBPF, OpenTelemetry, Chaos Engineering

9.1 eBPF — kernel‑level observability

eBPF programs run safely in the kernel and can attach to kprobes, uprobes, tracepoints, and sockets.

Example: simple kprobe program (libbpf style)

`c
// tracetcpconnect.c (libbpf skeleton omitted for brevity)

include

SEC("kprobe/tcp_connect")
int kprobetcpconnect(struct ptregs *ctx) {
bpfprintk("tcpconnect called\n");
return 0;
}

char LICENSE[] SEC("license") = "GPL";
`

Load with bpftool or libbpf-based loader. Use bpftrace for quick experiments:

bash sudo bpftrace -e 'kprobe:tcpconnect { printf("tcpconnect %s\n", comm); }'

9.2 OpenTelemetry — distributed tracing

Instrument services to produce spans and traces. Example C++ pseudo‑usage:

`cpp

include

auto tracer = opentelemetry::trace::Provider::GetTracer("service");
auto span = tracer->StartSpan("handle_request");
{
auto scope = tracer->WithActiveSpan(span);
// work...
}
span->End();
`

Export traces to a collector (e.g., OpenTelemetry Collector) and visualize in Jaeger, Zipkin, or commercial backends.

9.3 Chaos Engineering

Inject controlled failures to validate resilience.

Network latency injection

`bash

Add 200ms latency on eth0
sudo tc qdisc add dev eth0 root netem delay 200ms

Remove the rule
sudo tc qdisc del dev eth0 root
`

CPU stress

`bash

Install stress-ng and run CPU load
stress-ng --cpu 4 --timeout 60s
`

Use tools like Chaos Mesh, LitmusChaos, or Gremlin for orchestrated experiments in Kubernetes.

Security & Deployment

VLAN/VRF isolation for tenant separation
Secure boot and signed kernel modules for host integrity
Access control for NVMe‑oF targets and RDMA resources
RDMA subnet manager and fabric authentication where applicable
Gradual rollout per rack and canary deployments for safe changes

Appendix: Quick Commands and Corrected Examples

RDMA utilities

`bash

Show RDMA devices
ibv_devinfo

Show InfiniBand link status
ibstat

Simple RDMA pingpong test (from rdma-core examples)
ibvrcpingpong
`

NVMe‑oF

bash nvme discover -t rdma -a <ip> -s 4420 nvme connect -t rdma -n <nqn> -a <ip> -s 4420

MPI

bash mpicc app.c -o app mpirun -np 64 ./app

ZeroMQ

bash python3 publisher.py & python3 subscriber.py

Thanks for reading!

Follow me for more advanced systems content

Leave a reaction — it helps a lot

Bookmark this post and read it again

Practice the examples and give feedback

See you in the next deep dive! 🚀🫅