NDM-TCP Hyper-Embedded (v3.0): Technical Overview

#architecture #machinelearning #networking #performance

Introduction

The latest iteration of NDM-TCP (v3.0.4-hyper-entropy) represents a fundamental shift in high-speed congestion control. While previous optimized versions focused on vectorized SIMD (AVX) instructions, the Hyper-Embedded build prioritizes deterministic execution latency and zero-branch neural inference to support 100Gbps+ environments with sub-microsecond processing budgets.

Github:hejdiss/lkm-ndm-tcp

What's New in the Hyper Build

The transition from the "Optimized" v1.0/v2.0 to the "Hyper" v3.0 focuses on eliminating CPU pipeline stalls and reducing the per-ACK cycle count to the absolute theoretical minimum.

1. Zero-Branch Neural Inference

Unlike earlier versions that relied on if/else logic for ReLU activation or manifold gradients, v3.0 uses bit-masking. By calculating a mask from the RTT delta (mask = -(s32)(rtt > min_rtt)), the CPU executes the neural forward pass in a constant time, preventing Branch Target Buffer (BTB) misses that cost up to 20 cycles per packet.

2. Multiplier-Free Math (Shift-Add)

We have replaced standard integer multiplications with Power-of-Two Shifts.

Old Optimized: Used vpmaddwd (AVX) or imul (Scalar).
Hyper v3.0: Uses SHL/SHR approximations. This allows the neural weights to be processed by the CPU's simple ALU ports, bypassing the latency of the hardware multiplier entirely.

3. Hyper-Fast Bit-Stream Entropy

The entropy calculation has been refactored from histogram binning into a Hamming Weight (Population Count) analysis of an 8-bit RTT trend window.

It uses the native CPU POPCNT instruction via hweight8().
This reduces the entropy calculation from ~40 cycles down to ~3 cycles.

4. 1-600 Scaled Plasticity (CPA)

To replace floating-point sensitivity (0.1 to 6.0), the Hyper version implements a fixed-point integer scale of 1 to 600.

Base (1.0): Represented as 100.
Saturating Logic: Prevents 8-bit or 16-bit overflows during high-congestion learning phases.
Cooling Effect: A post-loss "cooldown" reduces plasticity by 25% per window to stabilize the manifold.

5. Extreme Memory Compression

The internal state struct (ndm_tcp) has been compressed to 16 bytes.

This is exactly 1/4th of a standard 64-byte Cache Line.
Benefit: Four flows can fit their congestion state into a single L1 cache line fetch, significantly reducing memory bus contention on multi-core 100G servers.

Important Disclaimers

The Hyper-Embedded version is designed for production environments where CPU SoftIRQ overhead is the primary bottleneck. By reducing the cycle count by nearly 70% compared to the first optimized build, NDM-TCP v3.0 ensures that the congestion control logic does not become the bottleneck for 100Gbps line-rate processing.(around expecting is 25-30 cycles)

Compilation & Usage

The module is self-contained and does not require FPU/SIMD state saving:

# Rename 
cp ndm_tcp_lkm_hyp.c ndm_tcp_lm.c

# Standard Build
make

# Load Module
make enable

The module automatically detects RTT manifold shifts and applies the 1-600 plasticity gain to the CWND growth curve, stabilized by the high-speed entropy noise-floor damping.

DEV Community

NDM-TCP Hyper-Embedded (v3.0): Technical Overview

Introduction

What's New in the Hyper Build

Top comments (0)