Muhammed Shafin P

Posted on Feb 15

NDM-TCP: A Conservative Approach to High-Speed Congestion Control

#ai #discuss #hejhdiss #network

GitHub Repository: https://github.com/hejhdiss/lkm-ndm-tcp

Introduction: Stability Over Aggressiveness

NDM-TCP represents a fundamentally conservative approach to TCP congestion control. Unlike aggressive algorithms that push throughput boundaries at the risk of instability, NDM-TCP prioritizes predictable behavior and network stability through its ML-based neural network architecture. This design philosophy becomes evident across all four released versions: the standard implementation, the Optimized build, the Ultra-Optimized variant, and the Hyper-Embedded build.

Important Clarification on Testing Status: All four versions (standard, optimized, ultra-optimized, and hyper-embedded) have been tested at the build and load level—they compile successfully and load as Linux Kernel Modules without errors. However, only the standard version (v1.0) has undergone comprehensive performance testing in a simulated network environment with detailed results showing its behavior under various conditions.

This is NOT a Standard Network TCP Congestion Control: It's crucial to understand that this implementation has some performance benefits based on its architecture and design approach, but it is not a standardized or formally validated TCP congestion control algorithm. The benefits come from the specific architectural choices made during development.

Comparison with Real NDM-TCP: This LKM (Linux Kernel Module) version represents a significantly reduced implementation compared to the real NDM-TCP (non-LKM version). Many sophisticated components have been removed or simplified to reduce CPU cycle usage:

No real differential manifold calculations—only conceptual manifold detection using simplified approximations
Manifold complexity reduced from sophisticated mathematical operations to basic pattern detection
Neural network style is more like sophisticated manifolds rather than the full differential manifold approach
Even the standard version includes major reductions to keep CPU cycles manageable

Critical Expectation Management: Because the standard version itself is already heavily simplified from the real NDM-TCP, users should NOT expect standard-level performance from the highly optimized versions (ultra-optimized and hyper-embedded). These versions include changes that significantly break the character and adaptive capabilities of the algorithm. They represent extreme trade-offs that may compromise the core intelligence of the system.

Community Collaboration Needed: This is an open call to the networking community. If you have access to high-performance networking equipment (100Gbps+ links), testing infrastructure, or research facilities, your collaboration in validating these optimized versions would be invaluable. The implementations are ready for testing, and the community can build upon them to develop further optimizations or validate their real-world behavior.

The results and analysis presented here for the standard version reflect actual testing in simulated network environments. For the optimized variants, the discussion focuses on design intentions and expected behavior based on implementation details.

The Standard Build (v1.0): Proven Stability in Simulated Environment

The original NDM-TCP v1.0 is the only version that has undergone comprehensive performance testing in a simulated network environment. As an ML-based congestion control algorithm, NDM-TCP uses neural network decision-making to adapt to patterns in network conditions even without prior training—exhibiting zero-training behavior that allows it to function effectively from the moment of deployment.

Important Context: Even this "standard" version is a significantly simplified implementation compared to the real NDM-TCP (non-LKM version). To reduce CPU cycle usage, this LKM implementation:

Uses conceptual manifold detection rather than real differential manifold calculations
Employs simplified pattern detection instead of sophisticated mathematical manifold operations
Neural network style is more like sophisticated manifolds rather than the full differential manifold approach
Sacrifices mathematical sophistication for kernel-space efficiency

Despite these simplifications, the standard build employs 32-bit precision for inputs and 16-bit weights, prioritizing mathematical accuracy over raw speed within its simplified framework. The algorithm learns from observed conditions to balance three critical factors: throughput, stability, and retransmissions. Unlike purely reactive algorithms, NDM-TCP's neural network can identify patterns in RTT variations, packet loss events, and network state transitions, allowing it to anticipate rather than merely respond to congestion.

This version's conservatism manifests in its RTT manifold detection system. Rather than immediately adjusting the congestion window upon detecting RTT variations, the algorithm accumulates evidence over a 16-slot history window. Only when patterns emerge consistently does the system modify its behavior. This approach prevents overreaction to transient network events—a critical consideration for maintaining stable throughput. The adaptive nature means NDM-TCP continuously refines its understanding of the network's behavior, adjusting its responses based on what it learns rather than following rigid, pre-programmed rules.

Tested Results (Simulated Environment): Comprehensive testing in simulated network environments using tc (traffic control) and iperf3 on localhost. Most importantly, the testing revealed stable and adaptive results with notably low retransmission rates—a key indicator of conservative, intelligent congestion control.

Testing Environment Details: The testing was conducted on Xubuntu 24.04, VMware 17, Linux kernel 6.11.0. Since this kernel version does not have BBR (Bottleneck Bandwidth and RTT) installed by default, no comparison testing against BBR was performed. Additionally, testing was not conducted on real hardware. For comprehensive validation including BBR comparison and real hardware testing, community support is needed.

Performance Numbers: In the simulated environment with tc and iperf3 on localhost, the standard version achieved an average throughput of 56 Gbps. In pure localhost testing without any tc or delay configurations, the maximum achieved throughput was 61.1 Gbps (if recalled correctly). The implementation shows some performance benefits based on its architectural and design approach, though it is not a standardized or formally validated TCP congestion control algorithm.

For Readers Needing Context - Expected Overview of BBR v1 vs NDM-TCP Character:

BBR (Bottleneck Bandwidth and RTT):

Type: Model-based TCP congestion control
Mechanism: Measures bottleneck bandwidth and RTT continuously
Decision Making: Uses a deterministic algorithm to pace packets; no learning from past network patterns
Adaptation: Reactive/predictive based on measured network parameters (bandwidth, RTT), not neural networks
Goal: Maximize throughput and minimize queueing delay

NDM-TCP (This LKM Implementation):

Type: ML-based congestion control (neural network)
Mechanism: Uses entropy calculation and 6 active inputs (inputs 7 and 8 are future reserved/dummy in standard version)
Decision Making: Neural network that can adapt to patterns in network conditions with zero-training behavior
Adaptation: Both predictive and reactive - learns from observed conditions to balance throughput, stability, and retransmissions. Due to its design and architecture, it has the possibility to detect congestion before it even happens (predictive capability based on pattern recognition)
Goal: Stability-focused, conservative, adaptive to varying network conditions
Note: Inputs 7 and 8 are reserved for future use—anyone can edit and remove them if not needed, or add real inputs there for more accurate predictions if desired

Entropy Use Cases in NDM-TCP:

Entropy Level	Interpretation	Congestion Control Response	Rationale
High Entropy (>0.7)	Network noise, random variations, transient fluctuations	Be Aggressive: Increase cwnd faster, treat as non-congestion event	High entropy indicates no clear pattern → likely random network jitter, not real congestion
Low Entropy (<0.7)	Real congestion pattern, consistent RTT increases	Be Conservative: Slow down cwnd growth, treat as actual congestion	Low entropy indicates clear pattern → reliable signal of network congestion
Loss + High Entropy	Packet loss with noisy network	Reduce less aggressively: cwnd reduced by 1/3 instead of 1/2	Loss might be due to noise, not congestion
Loss + Low Entropy	Packet loss with clear pattern	Standard reduction: cwnd reduced by 1/2	Clear congestion signal, use standard TCP response

The entropy threshold of 0.7 acts as the decision boundary: above this threshold, the algorithm assumes network variations are primarily noise and responds less conservatively; below this threshold, it interprets the patterns as genuine congestion signals and responds with appropriate caution.

The CPU overhead remains moderate, making it suitable for general-purpose servers where congestion control is one of many competing workloads. Each ACK requires a full neural network forward pass and entropy calculation, consuming approximately 300-400 CPU cycles per packet (this is specifically for the neural network computation and entropy calculation, not the full CPU cycle count for all TCP processing,rounded expectation(nearest)). This computational cost is the price of adaptive intelligence within the simplified framework.

NDM-TCP Optimized (v2.0): Theoretical Improvements (Untested)

Testing Status: This version builds and loads successfully as an LKM but has not undergone performance testing in simulated or real network environments.

Important: Each optimization level involves higher changes at both architectural and design levels to achieve performance gains in CPU cycle usage and RAM efficiency, NOT to improve prediction performance. These changes prioritize computational efficiency over the algorithm's adaptive capabilities.

The Optimized build introduces the first wave of performance enhancements while attempting to maintain algorithmic integrity. This version leverages AVX/SIMD intrinsics for vectorized neural network computation, allowing parallel processing of multiply-accumulate operations. The RTT history window is compressed from 16 to 8 slots, and the hidden layer is reduced from 8 to 4 neurons, shrinking the total struct size to exactly 64 bytes.

The design intent preserves the conservative nature through intelligent choices. Pre-computed lookup tables replace runtime calculations for activation functions like tanh and sigmoid, theoretically eliminating floating-point variability that could lead to unpredictable congestion responses. The LUT-based approach should ensure deterministic behavior—the same network state always producing the same congestion window adjustment.

Expected performance projections suggest a 56-62% reduction in CPU cycles per packet compared to the standard build. The computation caching mechanism is designed to reuse previous cwnd deltas for up to 8 consecutive ACKs when network conditions remain stable. However, the reduced history window means the algorithm would have less historical context for decision-making, which could potentially result in slower adaptation to genuinely changing network conditions.

Important: Without empirical testing, it's unknown whether this version maintains the proven stability and adaptive capabilities of the standard build. The reduced neural network capacity (4 neurons vs 8) and shorter history window (8 vs 16 slots) represent architectural changes made for CPU/RAM efficiency, not prediction accuracy. Users should not assume this version will deliver the same balance of throughput, stability, and retransmission control that characterizes the tested standard implementation.

NDM-TCP Ultra-Optimized (v2.0.0-100g): Extreme Quantization (Character-Breaking Changes)

Testing Status: This version builds and loads successfully as an LKM but has not undergone performance testing in simulated or real network environments.

Critical Warning: Each optimization level involves higher changes at both architectural and design levels to achieve performance gains in CPU cycle usage and RAM efficiency, NOT to improve prediction performance. These changes prioritize computational efficiency over the algorithm's adaptive intelligence.

The Ultra-Optimized build represents a radical implementation strategy shift, targeting 100GbE and 400GbE environments where CPU budget per packet is measured in nanoseconds. The entire neural network pipeline is converted to signed 8-bit integers (s8), reducing memory bandwidth requirements by 75%. The struct is compressed to exactly 40 bytes, fitting within a single x86 cache line.

The design philosophy attempts to maintain some level of conservatism through aggressive quantization. By reducing precision from 32-bit to 8-bit, the system theoretically filters out minor network fluctuations that fall below the quantization threshold. However, this drastic reduction fundamentally compromises the algorithm's learning capability.

The bitwise entropy calculation replaces division with shifts, operating on u8 history data to compute entropy in fewer than 20 cycles. The "stable state neural bypass" is designed to skip neural network inference for up to 16 packets during bulk transfers. Theoretical projections suggest 50-60% reduction in CPU cycles per packet, with expected performance around 60-80 cycles per ACK.

Do Not Expect Standard Performance: Because the standard version itself is already heavily reduced from the real NDM-TCP, and the ultra-optimized version makes even more aggressive compromises for CPU/RAM efficiency rather than prediction accuracy, users should have very low expectations for adaptive behavior, stability, or the low retransmission rates achieved by the standard version. This version may function as a basic congestion control mechanism but likely lacks the intelligence and adaptability that characterizes NDM-TCP.

NDM-TCP Hyper-Embedded (v3.0.4-hyper-entropy): Deterministic Execution (Character-Breaking Changes)

Testing Status: This version builds and loads successfully as an LKM but has not undergone performance testing in simulated or real network environments.

Critical Warning: Each optimization level involves higher changes at both architectural and design levels to achieve performance gains in CPU cycle usage and RAM efficiency, NOT to improve prediction performance. The hyper-embedded version represents the most extreme optimization with the most aggressive architectural changes prioritizing computational speed over adaptive intelligence.

The Hyper-Embedded build pushes optimization to the theoretical limit, targeting 100Gbps+ environments with sub-microsecond processing budgets. The design aims to eliminate CPU pipeline stalls and reduce per-ACK cycle count to the absolute minimum by removing every source of execution variability.

Zero-branch neural inference is the cornerstone of this version's design. By replacing if/else logic with bit-masking techniques (mask = -(s32)(rtt > min_rtt)), the CPU theoretically executes the neural forward pass in constant time, preventing Branch Target Buffer misses. However, this deterministic execution comes at the cost of adaptive flexibility.

The move to multiplier-free math using power-of-two shifts represents an extreme trade-off. By replacing standard integer multiplications with SHL/SHR approximations, neural weights are processed by simple ALU ports. This sacrifices significant precision for deterministic timing and CPU efficiency.

The hyper-fast bit-stream entropy calculation uses Hamming Weight analysis via POPCNT, reducing entropy calculation from approximately 40 cycles to 3 cycles. The 8-bit RTT trend window provides limited historical context. The 1-600 scaled plasticity system replaces floating-point sensitivity with fixed-point integers, with a post-loss "cooldown" reducing plasticity by 25% per window.

Extreme memory compression to 16 bytes per flow theoretically allows four flows to share a single L1 cache line fetch. Projections suggest approximately 70% cycle count reduction compared to the first optimized build, with expected performance around 25-30 cycles per packet.

Do Not Expect Standard Performance: This version likely functions more as a deterministic state machine than an intelligent, adaptive congestion control algorithm. The aggressive architectural and design changes made to achieve CPU/RAM performance gains almost certainly eliminate the ML-based learning capability. Users should expect this to behave more like a traditional heuristic-based congestion control with very fast execution, not an adaptive algorithm that learns from network patterns. The stability and low retransmission rates demonstrated by the standard version are unlikely to be replicated here.

Conclusion: Understanding Limitations and Setting Realistic Expectations

NDM-TCP's core philosophy—stability and adaptability over raw throughput—has been demonstrated through testing of the standard v1.0 build in simulated network environments. However, it's critical to understand the fundamental limitations of this entire project.

This is NOT Standard Network TCP Congestion Control: This LKM implementation has some performance benefits based on its architectural and design approach, but it is not a standardized or formally validated TCP congestion control algorithm. The benefits are derived from specific design choices, not from formal validation against networking standards.

All Versions Are Simplified: Even the standard v1.0 build is a significantly reduced implementation compared to the real NDM-TCP (non-LKM version). To achieve manageable CPU cycle usage in kernel space, no real differential manifold calculations are performed, manifold detection is conceptual and greatly simplified, and the neural network style is more like sophisticated manifolds rather than the full differential manifold approach. Mathematical sophistication has been traded for practical efficiency.

Critical Understanding About Optimized Versions: Each optimization level involves higher changes at both architectural and design levels. These changes are made to achieve performance gains in CPU cycle usage and RAM efficiency, NOT to improve prediction performance or adaptive capabilities. The more optimized the version, the more aggressive the architectural changes, and the greater the sacrifice of adaptive intelligence for computational speed.

The three optimized variants—optimized (v2.0), ultra-optimized (v2.0.0-100g), and hyper-embedded (v3.0.4)—all build and load successfully as Linux Kernel Modules. They have been tested at the build/load level, confirming they work as functional kernel modules. However, they have NOT been tested for actual network performance behavior. Users should NOT expect standard-level performance from these highly optimized versions.

Testing Reality and Community Collaboration: The standard v1.0 has been tested in simulated environments and shows stable, adaptive results with low retransmission rates. The optimized versions have only been verified to build and load—their network behavior is completely untested due to time constraints and lack of access to advanced equipment and research lab facilities. This is where the community can help. If you have access to high-performance networking equipment (100Gbps+ links), testing infrastructure, or research facilities, your collaboration in validating these optimized versions would be invaluable.

Recommendations for Use:

Standard v1.0 has demonstrated stable, adaptive behavior with low retransmission rates in simulated testing. Suitable for environments where proven behavior is needed, though remember even this is simplified from real NDM-TCP.

Optimized v2.0 has moderate architectural changes made for CPU/RAM efficiency; may be worth testing but expect some degradation from standard performance in terms of adaptive behavior.

Ultra-Optimized and Hyper-Embedded are experimental only. These versions have extensive architectural and design changes prioritizing CPU/RAM performance over prediction accuracy. They likely do NOT behave like adaptive ML-based congestion control and should not be expected to match the stability or low retransmission rates of the standard version.

All versions are fundamentally experimental. The "recommendation" for standard or optimized versions is relative—they are less compromised than the ultra/hyper variants, but none of this is production-grade, standardized TCP congestion control. If you want more optimized versions or need validation for 100Gbps+ environments, the community can build upon these implementations, but realistic expectations about their capabilities are essential. Remember: architectural changes were made for computational performance, not for better network prediction or adaptation.