Hey Dev Community!
Welcome!
Modern distributed systems are built on one fundamental truth: the network is the computer. Whether you're running HPC workloads, microservices, distributed storage, or large‑scale data pipelines, your system’s performance is ultimately shaped by latency, bandwidth, and coordination.
This guide is a complete, end‑to‑end deep dive into the technologies that power real datacenters:
- RDMA (InfiniBand / RoCE)
- NVMe over Fabrics (NVMe‑oF)
- Cluster scheduling (Kubernetes, Slurm, Nomad)
- Messaging layers (MPI, ZeroMQ)
- Performance engineering
- Observability
- Security & deployment
You’ll get architecture diagrams, protocol explanations, tuning advice, and runnable examples. By the end, you’ll be able to design and operate production‑grade distributed systems.
- Foundations: Latency, Throughput, and System Model
Distributed systems are shaped by two constraints:
- Latency — how long it takes for a message to travel
- Throughput — how much data you can push per second
Different workloads stress these differently:
| Workload | Latency Sensitivity | Throughput Sensitivity |
|---|---|---|
| HPC (MPI) | Extremely high | Medium |
| Storage (NVMe‑oF) | High | Extremely high |
| Microservices | Medium | Low |
| Batch jobs | Low | Medium |
A modern datacenter is a single distributed supercomputer. Your job is to make communication between nodes:
- fast
- predictable
- scalable
- observable
- secure
System model assumptions
This guide assumes a cluster with:
- Linux hosts
- High‑speed NICs (InfiniBand or RoCE)
- NVMe SSDs
- Kubernetes or Slurm for scheduling
- Root/admin access to configure kernel modules and NICs
- RDMA — Remote Direct Memory Access
RDMA allows one machine to read/write memory on another machine without involving the remote CPU. Benefits:
- Ultra‑low latency (1–2 µs)
- Extremely high throughput (100–400 Gbps)
- Minimal CPU overhead
2.1 RDMA architecture
+---------------------------+
| Application |
+---------------------------+
| RDMA Verbs / RDMA CM |
+---------------------------+
| RNIC (RDMA NIC) |
+---------------------------+
| InfiniBand / RoCE Fabric |
+---------------------------+
2.2 Core RDMA concepts
- Queue Pair (QP) — send queue and receive queue per endpoint
- Completion Queue (CQ) — where completion events are posted
- Memory Region (MR) — pinned memory registered with the RNIC
- Protection Domain (PD) — isolation boundary for QPs and MRs
2.3 RDMA transport types
| Transport | Reliability | Ordering | Use case |
|---|---|---|---|
| RC | Reliable | Ordered | HPC, storage |
| UC | Unreliable | Ordered | Niche low‑latency |
| UD | Unreliable | Unordered | Broadcast, discovery |
2.4 InfiniBand vs RoCE vs iWARP
| Fabric | Latency | Complexity | Notes |
|---|---|---|---|
| InfiniBand | Best | Medium | HPC standard, dedicated fabric |
| RoCEv2 | Very good | High | RDMA over Converged Ethernet; requires PFC/DCQCN tuning |
| iWARP | Moderate | Low | RDMA over TCP; easier deployment but higher latency |
- RDMA Programming (libibverbs)
A minimal RDMA program lifecycle:
- Open device (ibvopendevice)
- Allocate protection domain (ibvallocpd)
- Create completion queue (ibvcreatecq)
- Register memory region (ibvregmr)
- Create queue pair (ibvcreateqp)
- Exchange QP connection info with peer (out‑of‑band)
- Transition QP to RTR then RTS
- Post send/recv work requests (ibvpostsend, ibvpostrecv)
- Poll completion queue (ibvpollcq)
3.1 Minimal RDMA example (conceptual, corrected)
Note: This is a conceptual snippet showing correct API names and structures. Real programs require error handling, out‑of‑band exchange (e.g., TCP) for QP attributes, and proper cleanup.
`c
include
include
include
int main() {
struct ibvdevice devlist = ibvgetdevice_list(NULL);
struct ibvdevice *dev = devlist[0];
struct ibvcontext *ctx = ibvopen_device(dev);
struct ibvpd *pd = ibvalloc_pd(ctx);
struct ibvcq *cq = ibvcreate_cq(ctx, 16, NULL, NULL, 0);
size_t size = 4096;
void *buf = aligned_alloc(4096, size);
struct ibvmr *mr = ibvreg_mr(pd, buf, size,
IBVACCESSLOCALWRITE | IBVACCESSREMOTEWRITE | IBVACCESSREMOTE_READ);
struct ibvqpinitattr qpinit = {
.send_cq = cq,
.recv_cq = cq,
.cap = {
.maxsendwr = 16,
.maxrecvwr = 16,
.maxsendsge = 1,
.maxrecvsge = 1
},
.qptype = IBVQPT_RC
};
struct ibvqp *qp = ibvcreateqp(pd, &qpinit);
// Out-of-band: exchange qp->qp_num, lid/gid, mr->rkey, buf address with peer
// Then modify QP to RTR and RTS using ibvmodifyqp
// Post a receive
struct ibv_sge sge = {
.addr = (uintptr_t)buf,
.length = size,
.lkey = mr->lkey
};
struct ibvrecvwr rr = {
.wr_id = 1,
.sg_list = &sge,
.num_sge = 1,
.next = NULL
};
struct ibvrecvwr *bad_rr;
ibvpostrecv(qp, &rr, &bad_rr);
// Poll completions (example)
struct ibv_wc wc;
while (ibvpollcq(cq, 1, &wc) == 0) {
// busy-wait or use event-driven approach
}
// Cleanup omitted for brevity
return 0;
}
`
3.2 Production tuning tips
- Use hugepages for large pinned allocations
- Pin CQ polling threads and set CPU affinity
- Enable busy‑polling (net.core.busy_poll) for low latency where appropriate
- Configure MTU (e.g., 4096 for InfiniBand or 9000 for Ethernet jumbo frames)
- RoCE tuning: enable PFC, ECN, and tune DCQCN to avoid packet loss and head‑of‑line blocking
- NVMe over Fabrics (NVMe‑oF)
NVMe‑oF exposes remote NVMe devices over a network fabric with local‑like performance. Transports include RDMA, TCP, and Fibre Channel.
4.1 NVMe‑oF architecture
+-------------------+
| Application |
+-------------------+
| NVMe Driver |
+-------------------+
| NVMe-oF Transport |
+-------------------+
| RDMA / TCP / FC |
+-------------------+
| Network Fabric |
+-------------------+
4.2 NVMe‑oF transports
| Transport | Latency | Deployment | Notes |
|---|---|---|---|
| RDMA | Best | Hard | Best performance for HPC/storage |
| TCP | Good | Easy | Widely deployable, good performance with kernel NVMe/TCP |
| FC | Enterprise | Medium | SAN environments |
4.3 Example: NVMe‑oF with RDMA (initiator)
`bash
Discover NVMe-oF targets over RDMA
nvme discover -t rdma -a -s 4420
Connect to a discovered target (replace and )
nvme connect -t rdma -n -a -s 4420
`
4.4 Benchmark example with fio
bash
fio --name=randrw --filename=/dev/nvme1n1 \
--rw=randrw --bs=4k --iodepth=64 --numjobs=4 --runtime=60 --time_based
- Cluster Scheduling & Resource Management
Schedulers decide where workloads run and how resources are allocated. Examples: Kubernetes, Slurm, Nomad.
5.1 Kubernetes key features
- Pods, Deployments, StatefulSets
- Resource requests and limits
- Taints and tolerations
- Device plugins (GPU, SR‑IOV, NVMe)
- Topology Manager and CPU Manager for NUMA awareness
5.2 Slurm (HPC) example job script
Correct SBATCH directives must start with #SBATCH and be at the top of the script.
`bash
!/bin/bash
SBATCH --job-name=mpi_job
SBATCH --nodes=4
SBATCH --ntasks-per-node=32
SBATCH --time=01:00:00
SBATCH --partition=compute
module load mpi
srun --mpi=pmixv3 ./mympiapp
`
5.3 Scheduling algorithms and strategies
- Bin packing for utilization
- Gang scheduling for tightly coupled parallel jobs
- Preemption for priority handling
- NUMA‑aware placement to reduce cross‑socket memory access
- Network‑aware placement to reduce cross‑rack traffic and latency
- Messaging: MPI & ZeroMQ
6.1 MPI (Message Passing Interface)
MPI is the backbone of HPC. Correct minimal example:
`c
include
include
int main(int argc, char argv) {
MPI_Init(&argc, &argv);
int rank, size;
MPICommrank(MPICOMMWORLD, &rank);
MPICommsize(MPICOMMWORLD, &size);
printf("Hello from rank %d/%d\n", rank, size);
if (rank == 0) {
int data = 42;
MPISend(&data, 1, MPIINT, 1, 0, MPICOMMWORLD);
} else if (rank == 1) {
int recv;
MPIRecv(&recv, 1, MPIINT, 0, 0, MPICOMMWORLD, MPISTATUSIGNORE);
printf("Rank 1 received %d\n", recv);
}
MPI_Finalize();
return 0;
}
`
Build and run
bash
mpicc -o mpiexample mpiexample.c
mpirun -np 4 ./mpi_example
6.2 ZeroMQ PUB/SUB example
Publisher
`python
publisher.py
import zmq
import time
ctx = zmq.Context()
sock = ctx.socket(zmq.PUB)
sock.bind("tcp://*:5556")
while True:
sock.send_string("topic1 Hello subscribers")
time.sleep(1)
`
Subscriber
`python
subscriber.py
import zmq
ctx = zmq.Context()
sock = ctx.socket(zmq.SUB)
sock.connect("tcp://localhost:5556")
sock.setsockopt_string(zmq.SUBSCRIBE, "topic1")
while True:
msg = sock.recv_string()
print("Received:", msg)
`
Run publisher and subscriber in separate terminals.
- Performance Engineering
7.1 Network
- Enable jumbo frames where supported (MTU 9000)
- Tune RSS/XPS to distribute interrupts across CPUs
- Set IRQ affinity for RNIC queues
- RoCE: configure PFC, ECN, and DCQCN to avoid packet loss
7.2 CPU
- Pin critical threads to dedicated cores
- Use isolcpus kernel parameter for latency‑sensitive workloads
- Disable frequency scaling (set performance governor) for predictable latency
7.3 Memory
- Use hugepages for large allocations and to reduce TLB pressure
- Align allocations to NUMA nodes and schedule accordingly
- Tune vm.swappiness to avoid swapping
7.4 Storage
- Use appropriate I/O scheduler (e.g., none or mq-deadline for NVMe)
- Tune queue depth and I/O parallelism for NVMe devices
- Observability & Debugging
Tools and techniques to observe and debug distributed systems:
- iperf3 for network throughput testing
- ibv_devinfo, ibstat for RDMA device info
- perf, bpftrace, bcc for profiling and tracing
- tcpdump for packet captures
- fio for storage benchmarking
- OpenTelemetry for distributed tracing, metrics, and logs
- Observability Deep Dive: eBPF, OpenTelemetry, Chaos Engineering
9.1 eBPF — kernel‑level observability
eBPF programs run safely in the kernel and can attach to kprobes, uprobes, tracepoints, and sockets.
Example: simple kprobe program (libbpf style)
`c
// tracetcpconnect.c (libbpf skeleton omitted for brevity)
include
include
SEC("kprobe/tcp_connect")
int kprobetcpconnect(struct ptregs *ctx) {
bpfprintk("tcpconnect called\n");
return 0;
}
char LICENSE[] SEC("license") = "GPL";
`
Load with bpftool or libbpf-based loader. Use bpftrace for quick experiments:
bash
sudo bpftrace -e 'kprobe:tcpconnect { printf("tcpconnect %s\n", comm); }'
9.2 OpenTelemetry — distributed tracing
Instrument services to produce spans and traces. Example C++ pseudo‑usage:
`cpp
include
auto tracer = opentelemetry::trace::Provider::GetTracer("service");
auto span = tracer->StartSpan("handle_request");
{
auto scope = tracer->WithActiveSpan(span);
// work...
}
span->End();
`
Export traces to a collector (e.g., OpenTelemetry Collector) and visualize in Jaeger, Zipkin, or commercial backends.
9.3 Chaos Engineering
Inject controlled failures to validate resilience.
Network latency injection
`bash
Add 200ms latency on eth0
sudo tc qdisc add dev eth0 root netem delay 200ms
Remove the rule
sudo tc qdisc del dev eth0 root
`
CPU stress
`bash
Install stress-ng and run CPU load
stress-ng --cpu 4 --timeout 60s
`
Use tools like Chaos Mesh, LitmusChaos, or Gremlin for orchestrated experiments in Kubernetes.
- Security & Deployment
- VLAN/VRF isolation for tenant separation
- Secure boot and signed kernel modules for host integrity
- Access control for NVMe‑oF targets and RDMA resources
- RDMA subnet manager and fabric authentication where applicable
- Gradual rollout per rack and canary deployments for safe changes
- Appendix: Quick Commands and Corrected Examples
RDMA utilities
`bash
Show RDMA devices
ibv_devinfo
Show InfiniBand link status
ibstat
Simple RDMA pingpong test (from rdma-core examples)
ibvrcpingpong
`
NVMe‑oF
bash
nvme discover -t rdma -a <ip> -s 4420
nvme connect -t rdma -n <nqn> -a <ip> -s 4420
MPI
bash
mpicc app.c -o app
mpirun -np 64 ./app
ZeroMQ
bash
python3 publisher.py & python3 subscriber.py
Thanks for reading!
Follow me for more advanced systems content
Leave a reaction — it helps a lot
Bookmark this post and read it again
Practice the examples and give feedback
See you in the next deep dive! 🚀🫅
Top comments (2)
This post is the first entry in the
Distributed Systems & Networkingseries 🚀Stay tuned for the next one
Some comments may only be visible to logged-in visitors. Sign in to view all comments.