DEV Community

Muhammed Shafin P
Muhammed Shafin P

Posted on

NDM-TCP: Lightweight Neural Networking for TCP Congestion Control

The Challenge: Intelligence Without Overhead

Most machine learning systems are resource-intensive. A typical deep learning model for network control might consume hundreds of megabytes of RAM, require TensorFlow or PyTorch libraries, and heavily utilize CPU resources. This works fine in data centers with abundant resources, but what about embedded systems, IoT devices, or resource-constrained environments?

NDM-TCP (Neural Differential Manifolds TCP) solves this challenge by bringing neural network intelligence to TCP congestion control while maintaining a minimal resource footprint comparable to traditional algorithms.


Memory Usage: Ultra-Compact Design

Understanding Memory: Per-Connection vs "The Model"

IMPORTANT CLARIFICATION: When we talk about memory usage in NDM-TCP, we need to distinguish between two concepts:

  1. Per-Connection State Memory: ~70 bytes per TCP connection (may vary by compiler: typically 72-88 bytes)
  2. "The Model" (Neural Network Weights): 0 bytes!

Note on struct size variation: The actual memory footprint depends on your compiler and CPU architecture due to memory alignment requirements. While the theoretical size is ~69 bytes, compilers add padding for optimal CPU access:

  • Theoretical size: 69 bytes (sum of all fields)
  • 32-bit systems: Typically ~72 bytes (4-byte alignment)
  • 64-bit systems: Typically ~80-88 bytes (8-byte alignment)
  • Verified at runtime: The kernel module prints the actual size during initialization

You can check the exact size on your system by loading the module and checking kernel logs:

sudo dmesg | grep "NDM-TCP: Structure size"
# Output: NDM-TCP: Structure size = 88 bytes (limit = 128 bytes)
Enter fullscreen mode Exit fullscreen mode

This is radically different from traditional machine learning approaches:

Traditional ML TCP (e.g., DRL-based):

Neural network model (weights): 50-500 MB (stored once in memory)
Per-connection state: Additional 1-5 MB per connection
Total for 1000 connections: 50-500 MB + (1-5 MB × 1000) = up to 5 GB
Enter fullscreen mode Exit fullscreen mode

NDM-TCP's Revolutionary Approach:

Neural network model (weights): 0 bytes (NO MODEL STORED!)
Per-connection state: ~70 bytes per connection
Total for 1000 connections: 70 KB (yes, kilobytes!)
Enter fullscreen mode Exit fullscreen mode

How NDM-TCP Has "No Model"

Traditional neural networks store weight matrices in memory. NDM-TCP uses a clever mathematical trick - it generates weights on-the-fly using a deterministic formula.

This means:

  • Zero memory for model weights
  • Same behavior every time (deterministic)
  • Unique weights for each neuron connection
  • No storage, just computation

The "model" exists only as a mathematical formula, not as data in memory!

Fixed Footprint Architecture

Total memory per TCP connection: ~70 bytes (actual struct size varies by compiler alignment, typically 72-88 bytes)

To put this in perspective:

  • A single emoji in a text message: ~4 bytes
  • NDM-TCP's entire state per connection: ~70 bytes
  • Traditional ML model weights alone: 50-500 MB (700,000× larger!)
  • NDM-TCP for 10,000 connections: 700 KB total
  • Traditional ML for 10,000 connections: 500+ GB total

Memory Breakdown

Total: ~69 bytes (theoretical)
With compiler alignment padding: 72-88 bytes (actual)

Why the variation? Compilers insert padding bytes between struct fields to ensure proper memory alignment for the CPU architecture:

  • Padding for alignment: CPUs access memory more efficiently when data is aligned to word boundaries (4 or 8 bytes)
  • Architecture-dependent: 32-bit systems use 4-byte alignment, 64-bit systems use 8-byte alignment
  • Automatic optimization: The compiler does this automatically for performance reasons

The exact size will be printed when you load the kernel module - this ensures transparency and helps verify it fits within the kernel's size limit.

No Memory Growth Over Time

Unlike reinforcement learning approaches (like DQN-based systems) that accumulate experience in replay buffers, NDM-TCP uses a circular buffer strategy:

  • Hour 1: Uses ~70 bytes
  • Hour 100: Still uses ~70 bytes
  • Year 1: Still uses ~70 bytes

The memory footprint is identical whether handling the first packet or the millionth. This makes NDM-TCP predictable and reliable for long-running connections.

Kernel-Space Efficiency

NDM-TCP fits entirely within the Linux kernel's ICSK_CA_PRIV_SIZE buffer—a small memory region (typically 88-128 bytes) allocated per TCP connection.

This means:

  • ✅ No dynamic memory allocation during operation
  • ✅ No risk of memory fragmentation
  • ✅ No garbage collection pauses
  • ✅ Zero external dependencies (no Python, TensorFlow, or CUDA)

Memory Comparison

Algorithm Memory per Connection Dependencies
TCP Reno ~40 bytes None
TCP Cubic ~50 bytes None
TCP BBR ~80 bytes None
NDM-TCP ~72-88 bytes (varies by compiler/arch) None
PCC Vivace ~1 KB User-space library
DRL-based TCP 50+ MB Python, TensorFlow/PyTorch

Note: NDM-TCP's actual size depends on your system architecture and compiler.


CPU Usage: Efficient by Design

Computational Analysis

NDM-TCP performs two main operations per ACK:

1. Entropy Calculation (Every 8 Packets)

Operations:

  • Find min/max in 16-element array: ~32 comparisons
  • Create 16-bin histogram: ~16 divisions + 16 array writes
  • Calculate Shannon entropy: ~16 log₂ approximations + multiplications

Estimated cost: ~200-300 CPU cycles (amortized to ~25-40 cycles per packet)

2. Neural Network Forward Pass (Every ACK)

Operations:

  • Input normalization: 8 operations
  • Hidden layer: 8 neurons × 8 inputs = 64 multiplications + 64 additions
  • Activation functions: 8 tanh approximations (~5 ops each)
  • Recurrent connections: 8 multiplications + additions
  • Output layer: 8 multiplications + additions
  • Final activation: 1 sigmoid approximation

Estimated cost: ~250-350 CPU cycles per ACK

Total Overhead per ACK

~300-400 CPU cycles (entropy amortized + neural network)

On a modern 2 GHz processor, this translates to approximately 0.15-0.2 microseconds per packet—negligible compared to typical network round-trip times (10-100 milliseconds).

Integer-Only Mathematics

NDM-TCP avoids expensive floating-point operations by using fixed-point integer arithmetic.

Benefits:

  • 3-5× faster than floating-point math on processors without FPU
  • Fully deterministic execution time (no variable-latency FP operations)
  • Zero dependency on math libraries or hardware acceleration
  • Works on embedded processors without FPU

No Training Overhead

Unlike deep learning systems that require:

  • Periodic retraining (hours to days of GPU time)
  • Batch processing of experience replay
  • Gradient descent updates
  • Model checkpointing

NDM-TCP uses real-time adaptation—neurons adjust their internal state based on network patterns as they happen, with zero offline training required.


Performance Benchmarks

Real-World Test Results

All tests conducted using iperf3 on Linux with various network conditions:

Test 1: Clean Network (Loopback, No Artificial Impairment)

Protocol Transfer Bitrate Retransmissions
NDM-TCP 838 MB 702 Mbps 10
TCP Cubic 825 MB 692 Mbps 20 (2× more retrans)
TCP Reno 740 MB 620 Mbps 22 (2.2× more retrans)

Result: NDM-TCP achieved highest throughput with 50-55% fewer retransmissions

Test 2: Constrained Network (20ms delay ±5ms, 0.5% loss, 50 Mbit cap)

Protocol Transfer Bitrate Retransmissions
NDM-TCP 120 MB 50.4 Mbps 43
TCP Cubic 120 MB 50.5 Mbps 94 (2.2× more retrans)
TCP Reno 119 MB 50.1 Mbps 101 (2.3× more retrans)

Result: NDM-TCP maintained throughput with 54-57% fewer retransmissions under stress

Why Fewer Retransmissions Matter

Each retransmission represents:

  • ❌ Wasted bandwidth
  • ❌ Increased latency (waiting for retransmit)
  • ❌ Extra CPU cycles (processing duplicate packets)
  • ❌ Battery drain (on mobile devices)

NDM-TCP's 50%+ reduction in retransmissions means:

  • ✅ More efficient bandwidth utilization
  • ✅ Lower application latency
  • ✅ Reduced CPU usage overall (despite NN overhead)
  • ✅ Better battery life on mobile devices

CPU Overhead Analysis

Overhead Comparison

Based on the algorithmic complexity and operations count:

Algorithm Estimated Cycles/ACK Overhead vs Reno
TCP Reno (baseline) ~15-20 1.0×
TCP Cubic ~25-35 ~1.7×
TCP BBR ~80-120 ~5×
NDM-TCP ~300-400 ~18× vs Reno, ~10× vs Cubic
DRL-based TCP ~50,000+ ~2,500×+

Note: While NDM-TCP has higher per-packet CPU overhead than traditional algorithms, it's still 125× more efficient than deep learning approaches.

Real-World Impact

The overhead becomes meaningful only at very high packet rates:

100 Mbps Network (Typical Internet)

Packet rate: ~8,300 packets/sec
NDM-TCP CPU: 8,300 × 350 cycles = 2.9M cycles/sec
On 2 GHz CPU: 0.14% of one core ✅
Enter fullscreen mode Exit fullscreen mode

Verdict: Completely negligible

1 Gbps Network (High-Speed)

Packet rate: ~83,000 packets/sec

NDM-TCP CPU: 83,000 × 350 cycles = 29M cycles/sec
On 2 GHz CPU: 1.4% of one core ✅

Verdict: Still very low

10 Gbps Network (Datacenter)

Packet rate: ~830,000 packets/sec
NDM-TCP CPU: 830,000 × 350 cycles = 290M cycles/sec

On 2 GHz CPU: 14.5% of one core ⚠️

Verdict: Noticeable but acceptable

100 Gbps Network (Extreme)

Packet rate: ~8.3M packets/sec
NDM-TCP CPU: 8.3M × 350 cycles = 2.9B cycles/sec
On 2 GHz CPU: 145% of one core (needs 2 cores) ⚠️

Verdict: Would need optimization for single connection

Justifying the Overhead

The CPU overhead is justified by the gains:

Clean Network Test:

  • NDM-TCP: 10 retransmissions
  • Cubic: 20 retransmissions (100% more)

Each avoided retransmission saves:

  • Packet processing: ~200-500 cycles
  • Network stack overhead: ~500-1000 cycles
  • Application notification: ~100-300 cycles
  • Total saved: ~800-1800 cycles per avoided retransmission

Return on Investment:

  • NDM-TCP spends: ~300 extra cycles per ACK
  • NDM-TCP avoids: 10 retransmissions @ ~1000 cycles each = 10,000 cycles saved
  • Net benefit: Positive

Memory Scalability

Server with 10,000 Concurrent Connections

Algorithm Per Connection Total for 10k Scaling
TCP Reno ~40 bytes 400 KB Linear
TCP Cubic ~50 bytes 500 KB Linear
NDM-TCP ~72-88 bytes ~720-880 KB Linear
BBR ~80 bytes 800 KB Linear
DRL-based TCP ~50 MB 500 GB Non-linear (grows with experience)

NDM-TCP uses less than 1 MB for 10,000 connections—comparable to a single smartphone photo, while DRL approaches would require half a terabyte!

The exact memory usage scales linearly with connections and depends on your system architecture. On a typical 64-bit Linux system with ~88 bytes per connection, 10,000 connections use just 880 KB total.


The Engineering Trade-off

NDM-TCP occupies a unique sweet spot in the performance-intelligence spectrum:

More Intelligent Than:

  • Classic algorithms (Reno, Cubic, Vegas) that can't distinguish random noise from actual congestion
  • Simple heuristics that make fixed decisions regardless of network patterns

More Efficient Than:

  • Deep learning systems (DRL-based TCP) that require massive computational resources
  • User-space ML solutions that involve kernel-userspace context switching overhead

The Result:

Adaptive, entropy-aware congestion control that:

  • ✅ Distinguishes noise from congestion using Shannon entropy
  • ✅ Adapts in real-time with neural network intelligence
  • ✅ Runs efficiently on any hardware from IoT to datacenter
  • ✅ Uses minimal, fixed memory (no growth over time)
  • ✅ Requires no external dependencies or training infrastructure

Why This Matters

Use Cases Where NDM-TCP Excels

Wireless Networks (WiFi, 4G/5G)

  • High variability in RTT due to signal fluctuations
  • Entropy analysis distinguishes interference from congestion
  • Result: Fewer unnecessary slowdowns

Satellite Links

  • Long delays with variable jitter
  • Traditional TCP over-reacts to delay variations
  • Result: Better throughput on high-latency paths

Mobile Devices

  • Battery life matters
  • 50% fewer retransmissions = 50% less radio activity
  • Result: Extended battery life

IoT and Embedded Systems

  • Limited CPU and memory
  • ~70 bytes per connection is acceptable
  • Result: ML-enhanced TCP on resource-constrained devices

Congested Home Networks

  • Multiple devices competing
  • Bufferbloat and variable delays common
  • Result: Smarter adaptation to real network conditions

Acknowledgments

All performance tests were conducted transparently with real hardware and documented results. The approach is fully open for peer review and independent verification.

Test Environment: Linux kernel with iperf3, various network conditions simulated with tc netem

Code: Pure C implementation as Linux kernel module, GPL v2 licensed

Results: Reproducible with the provided benchmarksentation: Linux Kernel Module | License: GPL v2*

Top comments (0)