Muhammed Shafin P

Posted on Feb 14

NDM-TCP: The 100Gbps Ultra-Low Latency Build

#ai #discuss #hejhdiss #network

What's New in the Optimized Build (v2.0.0-100g)

The "Ultra Optimized" build of NDM-TCP represents a radical shift from the standard v1.0 logic. While the standard version prioritizes mathematical precision and readability, this 100Gbps target build prioritizes CPU cache locality and interrupt-context efficiency.

This version is designed specifically for high-throughput environments (100GbE/400GbE) where the CPU budget per packet is measured in nanoseconds.

Github:hejdiss/lkm-ndm-tcp

Key Optimizations vs v1.0

1. Aggressive Quantization (s8/u8)

v1.0: Used s32 for inputs and s16 for weights.
100G Build: Converted the entire neural network pipeline to signed 8-bit integers (s8).

Impact: This reduces memory bandwidth requirements by 75%. The entire weight matrix now fits in L1 cache, and vector operations can be performed using standard integer registers without complex casting.

2. Single-Cache-Line Struct (40 Bytes)

v1.0: The ndm_tcp struct was packed to fit ICSK_CA_PRIV_SIZE (64 bytes) but utilized most of it.
100G Build: Compressed to exactly 40 bytes.

Impact: This fits comfortably within a single x86 cache line (64 bytes). When the CPU fetches the congestion control state, it gets the entire context (history, weights, flags) in a single memory fetch, eliminating L2/L3 cache misses during the critical path.

3. Bitwise Entropy Calculation

v1.0: Used division and loops to calculate Shannon entropy.
100G Build: Replaces division with bitwise shifts based on range magnitude. The loop is unrolled and operates on u8 history data, allowing the CPU to calculate entropy in fewer than 20 cycles.

5. "Stable State" Neural Bypass

The module now includes a nn_skip_counter. If the network entropy is low (stable) and plasticity is high, the algorithm assumes the network state hasn't changed effectively enough to warrant a full forward pass. It reuses the previous cwnd calculation for up to 16 packets, saving massive amounts of CPU time during bulk data transfers.

Important Disclaimers

This optimized version is a specialized low-latency implementation.

Precision: The move to 8-bit quantization reduces the "resolution" of the neural network. While sufficient for TCP congestion control (which is inherently noisy), it effectively trades mathematical purity for raw speed.
Performance: You should expect a 50-60% reduction in CPU cycles per packet. Throughput gains will be most noticeable on CPU-bound senders driving 100Gbps links.

Compilation Instructions

The Linux kernel build system expects the source file to match the module name defined in the Makefile. To compile this ultra-optimized version, you must rename it to replace the standard source.

Step 1: Backup standard version

mv ndm_tcp_lkm.c ndm_tcp_lkm.c.bak

Step 2: Rename optimized source

cp ndm_tcp_optimized_ultra.c ndm_tcp_lkm.c

Step 3: Compile

make

Step 4: Load Module

make enable

DEV Community