GitHub Repository: https://github.com/hejhdiss/lkm-ndm-tcp
Important Clarification: Updated and Why?
There was a significant confusion in the original analysis. When I uploaded test results to Claude Sonnet AI, it compared v1's 6.82 Mbps result from the 100ms ±200ms extreme delay test against the 50ms ±100ms moderate delay test results (21.8-27.3 Mbps range). The AI either got the data mixed up or misunderstood which test was which.
The truth is more nuanced:
- Extreme delay (100ms ±200ms): v1 performed between good and excellent
- Moderate delay (50ms ±100ms): v1 performed good but not excellent
The Pure Delay Problem: Two Different Scenarios
NDM-TCP v1 has different performance levels in two distinct pure delay scenarios. Understanding these differences is critical to understanding v1's actual capabilities and limitations.
Test 1: Extreme Pure Delay (100ms ±200ms)
Test Conditions:
- Delay: 100ms with ±200ms variation (0-300ms+ range)
- No packet loss, no bandwidth limits
- Pure delay only
- Duration: 40 seconds
Results:
| Algorithm | Throughput | Retransmissions | Performance |
|---|---|---|---|
| NDM-TCP v1 | 6.82 Mbps | 34 | ✅ Between good and excellent |
| Cubic | 6.87 Mbps | 54 | Similar throughput |
| Reno | 7.94 Mbps | 36 | Slightly better throughput |
| BBR | 48.1 Mbps | 427 | ❌ High throughput, poor stability |
Analysis:
NDM-TCP v1 achieved between good and excellent stability in this extreme delay scenario:
- Lowest retransmissions: 34 (vs Cubic's 54, Reno's 36, BBR's 427)
- Nearly identical throughput to Cubic: 6.82 vs 6.87 Mbps (only 0.05 Mbps difference)
- Only 1.1 Mbps behind Reno: 6.82 vs 7.94 Mbps (14% difference)
- Significantly more stable than BBR: 34 vs 427 retransmissions
Verdict for 100ms ±200ms: NDM-TCP v1 performed between good and excellent. Not perfect, but clearly competitive in both stability and throughput.
Test 2: Moderate Pure Delay (50ms ±100ms)
Test Conditions:
- Delay: 50ms with ±100ms variation (0-150ms range)
- No packet loss, no bandwidth limits
- Pure delay only
- Duration: 60 seconds
Results:
| Algorithm | Throughput | Retransmissions | Performance |
|---|---|---|---|
| Reno | 27.3 Mbps | 76 | Excellent |
| Cubic | 23.5 Mbps | 74 | Excellent |
| NDM-TCP v1 | 21.8 Mbps | 84 | Good, not excellent |
| BBR | 57.6 Mbps | 1,227 | ❌ Very high throughput, poor stability |
Analysis:
This is where v1's weakness becomes clear:
- v1 throughput: 21.8 Mbps - good but not excellent
- Reno throughput: 27.3 Mbps (25% higher than v1)
- Cubic throughput: 23.5 Mbps (8% higher than v1)
- v1 retransmissions: 84 - only slightly higher than Reno (76) and Cubic (74)
- No clear advantage: v1 doesn't excel in stability OR throughput
Verdict for 50ms ±100ms: NDM-TCP v1 performed good but not excellent - and since NDM-TCP's philosophy is stability, this means it failed its purpose in cases similar to this.
Why I Said v1 "Failed" - The Philosophy Angle
Even though v1's performance was between good and excellent in extreme delay and good in moderate delay, I still say it failed its purpose. Here's why:
NDM-TCP is designed with stability as the core philosophy. For stability-critical scenarios, we need excellence, not just "good" performance.
In the 100ms ±200ms extreme delay case:
- ✅ v1 achieved excellent stability (34 retransmissions - lowest of all)
- ✅ v1 achieved good throughput (6.82 Mbps - nearly identical to Cubic)
- ✅ Overall: Between good and excellent performance
In the 50ms ±100ms moderate delay case:
- ⚠️ v1 achieved good stability (84 retransmissions - comparable to others)
- ⚠️ v1 achieved good throughput (21.8 Mbps - but 25% behind Reno)
- ❌ Overall: Good but not excellent - no clear advantage
The failure: In moderate pure delay scenarios similar to 50ms ±100ms, v1 doesn't demonstrate the excellence expected from a stability-focused algorithm. It's not bad, but it doesn't shine or excel.
- High-Frequency Trading (HFT) networks - Ultra-low latency links with minimal buffering where delay variation comes from route changes
- Dedicated fiber connections - Point-to-point links with minimal packet loss but variable delay from temperature or physical changes
- Satellite communication during clear weather - Atmospheric delay variation without signal loss
- Quality wireless links - Good signal strength but variable delay from interference patterns
- Data center cross-connects - Well-maintained links where delay varies but packet loss is extremely rare
These scenarios require a very specific combination: delay variation only, with NO loss, NO queueing, NO bandwidth constraints, and NO other typical network issues. This combination almost never occurs in real-world networks.
When Do Pure Delay Scenarios Actually Occur?
These failure modes only manifest in highly specific edge cases that are extremely rare in normal network conditions:
Why Not BBR-Inspired Approaches?
I considered implementing a BBR-inspired solution since BBR excels in pure delay scenarios (81.6 Mbps compared to other algorithms).
However, BBR demonstrates extreme instability with massive retransmissions:
- Pure delay test: 885 retransmissions
- Extreme delay test: 1,326 retransmissions
- High throughput achieved at the complete sacrifice of stability
This fundamentally conflicts with NDM-TCP's core principle: stability first.
v4 Development: Delay Enhancement Experiments
I attempted to create v4 with delay awareness to address the pure delay problem. Here's a summary of the experimental approaches:
Changes Implemented in v4
Two New Inputs Added (replacing dummy inputs 6 & 7):
Input 6 - Queuing Delay:
/* Calculate absolute difference between current RTT and minimum RTT */
u32 q_delay = (rtt_us > ca->min_rtt_us) ? (rtt_us - ca->min_rtt_us) : 0;
inputs[6] = (s32)min_t(u64, (q_delay * 1000ULL) / max(ca->min_rtt_us, 1U), 1000);
Input 7 - RTT Gradient:
/* Calculate difference between current RTT and previous RTT */
u16 last_rtt_ms = ca->rtt_history[(ca->history_index + ENTROPY_WINDOW_SIZE - 1) % ENTROPY_WINDOW_SIZE];
s32 rtt_diff = (s32)(rtt_us / 1000) - (s32)last_rtt_ms;
inputs[7] = clamp(rtt_diff * 100, -1000, 1000);
Modified Congestion Avoidance Logic
Three new response modes were added to ndm_tcp_cong_avoid():
/* 1. DELAY-ONLY DETECTION: High queuing delay but no loss/entropy signals */
if (inputs[6] > 800 && !ca->congestion_detected) {
/* Pure delay scenario: cautious growth */
u32 delta = max(1U, acked * cwnd_delta / 3000);
tcp_cong_avoid_ai(tp, tp->snd_cwnd, delta);
}
/* 2. REAL CONGESTION: Low entropy + High RTT Gradient */
else if (ca->has_data && ca->congestion_detected) {
/* Conservative growth, extra conservative if gradient is high */
u32 divisor = (inputs[7] > 500) ? 4000 : 2000;
u32 delta = max(1U, acked * cwnd_delta / divisor);
tcp_cong_avoid_ai(tp, tp->snd_cwnd, delta);
}
/* 3. NOISE/CLEAR PATH: High entropy or low delay */
else if (ca->has_data && !ca->congestion_detected) {
/* High entropy = noise: be aggressive */
u32 delta = max(1U, acked * cwnd_delta / 1000);
tcp_cong_avoid_ai(tp, tp->snd_cwnd, delta);
}
The Fundamental Issues with v4
Multiple variations were tested:
- BBR-like delay additions → Increased RTT and high retransmissions
- Delay queue + RTT variance → Lower retransmissions but introduced other problems
- RTT jitter + variance (current v4) → Low retransmissions but significantly reduced throughput
v4 Test Results (50ms ±100ms delay, 40 seconds):
- Throughput: 11.6 Mbps
- Retransmissions: 70
v1 Comparison (same delay conditions, 60 seconds):
- v1: 21.8 Mbps with 84 retransmissions
- v4: 11.6 Mbps with 70 retransmissions (scaled to 60s: ~105 retransmissions)
The Design Philosophy Conflict
Even though v4's results aren't a complete failure, they create a fundamental philosophical problem with NDM-TCP's design.
NDM-TCP prioritizes stability over raw throughput. This is why we can't simply adopt Cubic or Reno approaches here - they have their own drawbacks in many scenarios where v1 excels (as demonstrated in previous test results).
The conflict: v4 achieves the lowest retransmission count (70), which aligns with stability goals. However, its throughput (11.6 Mbps) creates a problematic trade-off.
Since NDM-TCP follows a stability-focused design and has delivered excellent results in all other tests, I initially expected it to perform well in this delay case too. That expectation was misguided.
The delay-based mathematical additions significantly reduce throughput. The acceptability of this trade-off is unclear because:
- Stability requires balancing both retransmissions AND throughput
- v4 has excellent retransmission stability (70) but poor throughput (11.6 Mbps)
- v1 has good retransmission stability (84) with better throughput (21.8 Mbps)
I prefer v1 over v4 because v1 provides better overall balance based on all localhost tests conducted so far.
Critical Unknown: We haven't tested whether v4 causes problems in the other scenarios where v1 excels.
Why I'm Maintaining v1 as Primary
We cannot specialize for specific edge cases when it degrades general performance.
Based on all localhost tests conducted to date, NDM-TCP v1 performs well in general network scenarios and delivers better overall results than alternatives:
v1's Performance Summary:
- ✅ Constrained networks (loss + delay + bandwidth limits): Excellent
- ✅ Loss-heavy environments: Excellent (26-63 retransmissions)
- ✅ Mixed conditions: Good stability
- ✅ Extreme pure delay (100ms ±200ms): Between good and excellent (34 retransmissions - lowest of all)
- ❌ Moderate pure delay (50ms ±100ms): Good but not excellent - fails stability-first philosophy
Recommendations for Pure Delay Edge Cases
For Extreme Pure Delay (100ms ±200ms or similar):
NDM-TCP v1 (Recommended):
- Extreme delay test: 6.82 Mbps, 34 retransmissions
- Between good and excellent stability
- Lowest retransmissions of all algorithms
- Competitive throughput
Reno (Alternative for higher throughput):
- Extreme delay test: 7.94 Mbps, 36 retransmissions
- Slightly higher throughput (14% more than v1)
- Still excellent stability
For Moderate Pure Delay (50ms ±100ms or similar):
When v1 fails to excel in moderate delay scenarios, users should consider:
Cubic (Recommended for Balance):
- Moderate delay test: 23.5 Mbps
- Retransmissions: 74 (lowest among traditional algorithms)
- Proven traditional algorithm
- Best compromise for pure delay scenarios
Reno (For Higher Throughput):
- Moderate delay test: 27.3 Mbps
- Retransmissions: 76
- Balanced approach with good throughput
BBR (Maximum Throughput, Stability Sacrifice):
- Moderate delay test: 57.6 Mbps
- Retransmissions: 1,227 (extremely high)
- Use only if raw throughput is critical
v1 Performance in Moderate Delay (For Comparison):
- 21.8 Mbps
- 84 retransmissions (comparable to Reno/Cubic)
- Actually performs reasonably well, just not maximum throughput
v4 Performance in Moderate Delay:
- 11.6 Mbps (47% slower than v1)
- 70 retransmissions (lowest of all, but at significant cost)
- Too much throughput sacrificed for minimal retransmission improvement
I am not modifying v1 further. It works well for general cases based on all localhost testing conducted to date.
Community Invitation: Experiment with Your Own Version
The complete v4 code is available in the GitHub repository. If you want to experiment with delay enhancements, here is the full v4 congestion avoidance logic:
Complete v4 Congestion Avoidance Implementation
/* Apply congestion control decision */
if (ca->in_slow_start) {
/* Slow start: exponential growth */
/* ADAPTIVE DELAY RESPONSE: If RTT gradient is high (Input 7),
slow down even if no loss is detected yet. */
if (ca->congestion_detected || inputs[7] > 400) {
/* Detected congestion or rising delay, grow slower */
tcp_slow_start(tp, acked / 2);
} else {
/* Normal slow start */
tcp_slow_start(tp, acked);
}
} else {
/* Congestion avoidance */
/* 1. DELAY-ONLY DETECTION: High queuing delay but no loss/entropy signals */
if (inputs[6] > 800 && !ca->congestion_detected) {
/* Pure delay scenario: use a cautious delta to avoid bufferbloat */
u32 delta = max(1U, acked * cwnd_delta / 3000);
tcp_cong_avoid_ai(tp, tp->snd_cwnd, delta);
}
/* 2. REAL CONGESTION: Low entropy + High RTT Gradient */
else if (ca->has_data && ca->congestion_detected) {
/* Real congestion: be conservative */
/* If gradient (Input 7) is also high, be extra conservative */
u32 divisor = (inputs[7] > 500) ? 4000 : 2000;
u32 delta = max(1U, acked * cwnd_delta / divisor);
tcp_cong_avoid_ai(tp, tp->snd_cwnd, delta);
}
/* 3. NOISE/CLEAR PATH: High entropy or low delay */
else if (ca->has_data && !ca->congestion_detected) {
/* High entropy = noise: be aggressive */
u32 delta = max(1U, acked * cwnd_delta / 1000);
tcp_cong_avoid_ai(tp, tp->snd_cwnd, delta);
}
/* 4. FALLBACK */
else {
/* Not enough data: use standard Reno */
tcp_reno_cong_avoid(sk, ack, acked);
}
}
My Position on v1 vs v4
Based on all localhost tests conducted so far, v1 maintains superiority for general cases.
The pure delay scenario is problematic, but we can't simply borrow approaches from Reno, Cubic, or BBR. NDM-TCP has a fundamentally different philosophy: stability first.
If minimizing retransmissions is your only goal: v4 achieves this (70 retransmissions), but at the cost of poor throughput.
If you want balanced stability: v1 is superior (84 retransmissions with 21.8 Mbps throughput).
My preference remains v1 over v4 because stability means balancing both retransmissions AND throughput, not just minimizing one metric.
Alternative Recommendation for Pure Delay Cases
For pure delay scenarios where v4 is not preferred, use Cubic:
- 23.5 Mbps throughput
- 74 retransmissions (excellent stability)
- Proven traditional algorithm
- Better overall balance than v4 for these specific cases
You can modify v1 with this code if you wish to experiment, or create your own version with different thresholds and approaches.
Be aware: Delay-based additions reduce throughput significantly. I cannot modify v1 further because the trade-offs remain unclear, and v1 performs well for general cases.
Final Recommendations
For General Network Use: NDM-TCP v1
- Proven across multiple localhost test scenarios
- Excellent stability in loss-heavy conditions
- Good performance in constrained networks
- Between good and excellent in extreme pure delay (100ms ±200ms): 6.82 Mbps, 34 retransmissions
- Good but not excellent in moderate pure delay (50ms ±100ms): 21.8 Mbps, 84 retransmissions
- Only fails to excel in moderate pure delay scenarios - still performs acceptably
For Extreme Pure Delay (100ms ±200ms or similar):
- NDM-TCP v1 (Recommended) - Between good and excellent: 6.82 Mbps, 34 retransmissions (lowest of all)
- Reno - Slightly higher throughput: 7.94 Mbps, 36 retransmissions
For Moderate Pure Delay (50ms ±100ms or similar):
- Reno (Recommended) - Best throughput: 27.3 Mbps, 76 retransmissions
- Cubic - Excellent balance: 23.5 Mbps, 74 retransmissions (lowest retransmissions)
- NDM-TCP v1 - Works but doesn't excel: 21.8 Mbps, 84 retransmissions
- BBR - Maximum throughput with massive retransmission cost: 57.6 Mbps, 1,227 retransmissions
- v4 - Only if absolute lowest retransmissions (70) is required and very low throughput (11.6 Mbps) is acceptable
I prefer v1 over v4 because stability requires balancing retransmissions AND throughput together, not optimizing a single metric.
Understanding the AI Comparison Confusion
What Happened
When I uploaded the test results to Claude Sonnet AI for analysis, the AI compared:
- v1's 6.82 Mbps from the 100ms ±200ms extreme delay test
- Against the 50ms ±100ms moderate delay test results (21.8-27.3 Mbps range)
The AI concluded v1 "failed catastrophically" because 6.82 Mbps is much lower than 21.8-27.3 Mbps.
The Problem
These are two different tests with different delay ranges. The AI either:
- Got the data mixed up (comparing wrong test scenarios)
- Had difficulty understanding which test result belonged to which scenario
The Reality
When compared within the same test, v1's performance looks very different:
- In 100ms ±200ms test: v1 got 6.82 Mbps, others got 6.87-7.94 Mbps (v1 is competitive)
- In 50ms ±100ms test: v1 got 21.8 Mbps, others got 23.5-27.3 Mbps (v1 is behind)
Why This Clarification Matters
Avoiding negative reader impressions: Some readers might see "6.82 Mbps" and "failed catastrophically" and think NDM-TCP v1 is terrible in all pure delay cases. That's not accurate.
The truth is:
- v1 is between good and excellent when delay variation is extreme (100ms ±200ms)
- v1 is good but fails its purpose when delay variation is moderate (50ms ±100ms)
- The AI comparison confused these two different scenarios
For general network use, v1 remains excellent based on all other localhost tests (loss, congestion, mixed conditions).
Critical Need: Real-World Testing
Community help is desperately needed for real-world testing. All these results come from localhost artificial testing. Real hardware validation is critical to understand:
- How these algorithms actually perform on production networks
- Whether v1's or v4's approach works better in real deployment scenarios
- What the actual trade-offs are outside of localhost simulation
- Which version should be recommended for real-world use
Community contributions are welcome if you can solve the pure delay problem without sacrificing v1's general-case performance.
Disclaimer: All results are from localhost artificial testing. None of these versions have been tested on real hardware. Real hardware validation is critically needed.
Key clarifications:
- v1 performs between good and excellent in extreme pure delay (100ms ±200ms)
- v1 performs good but not excellent in moderate pure delay (50ms ±100ms)
- v4 is experimental and has only been tested in one scenario beyond v1's test suite
- v1 remains the main version for general use based on localhost testing results across multiple scenarios
Community testing on real hardware is essential before production deployment. Pure delay-only scenarios are extremely rare edge cases - even in those, v1 performs between good and excellent in extreme conditions and acceptably (though not excellent) in moderate conditions.
Top comments (0)