Introduction
The latest iteration of NDM-TCP (v3.0.4-hyper-entropy) represents a fundamental shift in high-speed congestion control. While previous optimized versions focused on vectorized SIMD (AVX) instructions, the Hyper-Embedded build prioritizes deterministic execution latency and zero-branch neural inference to support 100Gbps+ environments with sub-microsecond processing budgets.
Github:hejdiss/lkm-ndm-tcp
What's New in the Hyper Build
The transition from the "Optimized" v1.0/v2.0 to the "Hyper" v3.0 focuses on eliminating CPU pipeline stalls and reducing the per-ACK cycle count to the absolute theoretical minimum.
1. Zero-Branch Neural Inference
Unlike earlier versions that relied on if/else logic for ReLU activation or manifold gradients, v3.0 uses bit-masking. By calculating a mask from the RTT delta (mask = -(s32)(rtt > min_rtt)), the CPU executes the neural forward pass in a constant time, preventing Branch Target Buffer (BTB) misses that cost up to 20 cycles per packet.
2. Multiplier-Free Math (Shift-Add)
We have replaced standard integer multiplications with Power-of-Two Shifts.
Old Optimized: Used vpmaddwd (AVX) or imul (Scalar).
Hyper v3.0: Uses SHL/SHR approximations. This allows the neural weights to be processed by the CPU's simple ALU ports, bypassing the latency of the hardware multiplier entirely.
3. Hyper-Fast Bit-Stream Entropy
The entropy calculation has been refactored from histogram binning into a Hamming Weight (Population Count) analysis of an 8-bit RTT trend window.
It uses the native CPU POPCNT instruction via hweight8().
This reduces the entropy calculation from ~40 cycles down to ~3 cycles.
4. 1-600 Scaled Plasticity (CPA)
To replace floating-point sensitivity (0.1 to 6.0), the Hyper version implements a fixed-point integer scale of 1 to 600.
Base (1.0): Represented as 100.
Saturating Logic: Prevents 8-bit or 16-bit overflows during high-congestion learning phases.
Cooling Effect: A post-loss "cooldown" reduces plasticity by 25% per window to stabilize the manifold.
5. Extreme Memory Compression
The internal state struct (ndm_tcp) has been compressed to 16 bytes.
This is exactly 1/4th of a standard 64-byte Cache Line.
Benefit: Four flows can fit their congestion state into a single L1 cache line fetch, significantly reducing memory bus contention on multi-core 100G servers.
Important Disclaimers
The Hyper-Embedded version is designed for production environments where CPU SoftIRQ overhead is the primary bottleneck. By reducing the cycle count by nearly 70% compared to the first optimized build, NDM-TCP v3.0 ensures that the congestion control logic does not become the bottleneck for 100Gbps line-rate processing.(around expecting is 25-30 cycles)
Compilation & Usage
The module is self-contained and does not require FPU/SIMD state saving:
# Rename
cp ndm_tcp_lkm_hyp.c ndm_tcp_lm.c
# Standard Build
make
# Load Module
make enable
The module automatically detects RTT manifold shifts and applies the 1-600 plasticity gain to the CWND growth curve, stabilized by the high-speed entropy noise-floor damping.
Top comments (0)