InstaTunnel

Posted on Mar 28

The TCP-over-TCP Tax: An Architectural Autopsy

#architecture #networking #performance #systems

IT
InstaTunnel Team
Published by our engineering team
The TCP-over-TCP Tax: An Architectural Autopsy
The TCP-over-TCP Tax: An Architectural Autopsy
Your tunnel isn’t slow because of your ISP; it’s slow because your packets are stuck in a “double-retransmission” loop. To a systems engineer, high-speed fiber feels like a dial-up connection the moment you wrap a TCP stream inside another TCP stream. This phenomenon, known in networking circles as the TCP-over-TCP Tax (or more dramatically, the TCP Meltdown), is a classic architectural anti-pattern.

In this autopsy, we will dissect the mathematical and algorithmic reasons why SSH tunnels, OpenVPN-TCP, and other nested TCP architectures fail under even minor packet loss, and why modern alternatives like WireGuard and QUIC are the only cure for “sluggish” tunnels.

The Anatomy of Encapsulation To understand the tax, we must first look at the stack. When you run an SSH tunnel or a TCP-based VPN, you aren’t just sending data; you are encapsulating a full Inner TCP state machine inside an Outer TCP state machine.

In a standard non-tunneled connection, TCP manages flow control and reliability directly over the IP layer (which is “best-effort” and unreliable). However, in a tunnel:

The Inner TCP (your application) thinks it is talking to a remote host. It manages sequence numbers, ACKs, and a congestion window (cwnd).
The Outer TCP (the tunnel) sees the inner packets as raw payload. It adds its own sequence numbers, ACKs, and its own cwnd.
On a perfect network, this works. But the internet is never perfect.

The Mathematical Autopsy: Why 1% Loss Feels Like 90% The core of the “Tax” is how the two layers react to packet loss. In a single-layer TCP connection, the throughput is roughly governed by the Mathis Equation:

Throughput ≤ MSS / (RTT × √p)
Where: - MSS: Maximum Segment Size - RTT: Round-Trip Time - p: Probability of packet loss

When you nest TCP, the probability of loss p doesn’t just reduce throughput linearly; it triggers a synchronization conflict between two independent timers.

The Double-Retransmission Loop
Imagine a single packet is lost on the physical wire:

The Outer TCP’s Response: The tunnel detects the missing packet and stops everything to retransmit it. The tunnel’s cwnd is halved.

The Inner TCP’s Perspective: While the tunnel is busy retransmitting, the inner application packet is “stuck” in the tunnel’s buffer. The Inner TCP sees a massive spike in RTT.

The Meltdown: If the time it takes for the Outer TCP to successfully retransmit is longer than the Inner TCP’s Retransmission Timeout (RTO), the Inner TCP also decides the packet is lost. It triggers its own retransmission.

Now, you have two layers sending the same data. The tunnel is already congested, and the application just doubled the load. This creates a feedback loop where the buffers fill up with redundant retransmissions, pushing the effective RTT to the seconds range.

According to recent research, the control loops in the inner and outer TCPs interfere destructively with each other because the outer TCP masks the connection’s health from the inner TCP. This interference is what transforms minor packet loss into catastrophic performance degradation.

Head-of-Line (HoL) Blocking: The Sequential Prison TCP is a reliable byte stream protocol. It guarantees that the application receives data in the exact order it was sent. This is its greatest strength and, in a tunnel, its fatal flaw.

The Sequential Queue
If Packet #1 is lost but Packets #2, #3, and #4 arrive safely, the TCP stack cannot hand Packets #2–4 to the application. They must sit in the buffer until Packet #1 is retransmitted and arrives.

In an SSH tunnel, this effect is global. If you are multiplexing multiple streams (e.g., a database connection and a web request) through one SSH tunnel, a single lost packet in the database stream will block the web request from finishing, even if the web packets arrived perfectly.

Mathematics of the Wait
The probability that a packet is delayed by HoL blocking in a stream with n outstanding packets and loss rate p is roughly 1 - (1-p)^n. As n (the window size) increases to fill a high-bandwidth pipe, the likelihood of a stall approaches 100%, even with p < 0.1%.

This problem becomes even more pronounced in modern high-bandwidth environments where window sizes are necessarily large to fill the bandwidth-delay product.

Real-World Impact: Recent Benchmarks and Case Studies WireGuard vs. Traditional VPNs Recent comprehensive studies provide concrete evidence of the TCP-over-TCP tax. In VMware environments, WireGuard demonstrated superior TCP throughput at 210.64 Mbps compared to OpenVPN’s 110.34 Mbps, with significantly lower packet loss of 12.35% versus 47.01%.

Field testing has shown even more dramatic results. In practical benchmark conditions, WireGuard was on average 3.3 times faster than OpenVPN, demonstrating the real-world cost of the TCP-over-TCP architecture.

Performance at Scale
Modern VPN solutions have evolved significantly. On gigabit networks, Netmaker achieved consistent data transfer speeds averaging 7.88 Gbits/sec, nearly identical to kernel WireGuard alone at 7.89 Gbits/sec. This near-native performance is possible only because WireGuard avoids the TCP-over-TCP tax entirely.

The performance gap widens further under challenging network conditions. WireGuard achieves its performance through several key factors: a lean design of approximately 4,000 lines of code compared to OpenVPN’s tens of thousands, modern ChaCha20-Poly1305 encryption that runs efficiently on all processors, and kernel integration that processes packets without expensive context switches.

Real-World Deployment Results
Consumer-grade hardware shows impressive WireGuard performance. Recent benchmarks on modern routers demonstrate that WireGuard can achieve nearly full link rate on symmetric gigabit fiber connections, with performance around 1,080 Mbps even on mid-range hardware.

The “Meltdown” and Congestion Control Conflict TCP algorithms like CUBIC or NewReno are designed to probe for bandwidth until they see a drop. When two such algorithms are nested, they fight for the same resource:

The Outer TCP tries to fill the pipe.
The Inner TCP also tries to fill the pipe.
When a drop occurs, both back off.
Because the Outer TCP buffers the Inner TCP’s ACKs, the Inner TCP’s RTT estimation becomes “noisy.” It cannot accurately calculate the bandwidth-delay product (BDP).

This “ACK Compression” makes it impossible for the inner connection to ever reach a stable, high-speed steady state. This design can fail when stacking TCP connections, and this type of network slowdown is known as TCP meltdown, which happens when a slower outer connection causes the upper layer to queue up more retransmissions than the lower layer is able to process.

The MTU/MSS “Silent Killer” Even if your packet loss is zero, you might still be paying a “fragmentation tax.”

A standard Ethernet frame is 1500 bytes. SSH and VPNs add headers (encryption, encapsulation). If the resulting packet is 1520 bytes, it must be fragmented into two packets.

Fragmentation: - Doubles the packet count - Doubles the interrupt overhead on the CPU - If one fragment is lost, the entire original packet is discarded

For an SSH tunnel, you must “clamp” the Maximum Segment Size (MSS) to ensure the inner TCP segments are small enough to fit inside the tunnel’s payload without fragmentation.

Example: IPTables MSS Clamping

Prevent fragmentation by clamping MSS to Path MTU

iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

The Modern Solution: UDP-Based Encapsulation The engineering consensus is clear: Tunnels should be stateless.

Why WireGuard Wins
WireGuard uses UDP. If an encrypted packet is lost, WireGuard doesn’t care. It doesn’t retransmit. It doesn’t have a congestion window. It simply passes the responsibility of “reliability” back to the Inner TCP.

Advantages:

No Double-Retransmission: Only one layer (the application) handles recovery.
No HoL Blocking: Packet #2 can be decrypted and delivered even if Packet #1 is missing.
Kernel Integration: WireGuard lives in the Linux kernel (since version 5.6+), avoiding the context-switching overhead that plagues userspace SSH tunnels.
UDP does not retransmit anything by itself, so in the case of UDP, only the inner TCP connection retransmits lost packets, meaning nothing stacks up and the problem is avoided.

Why QUIC (HTTP/3) is the Future
QUIC essentially builds a “smarter” TCP on top of UDP. It supports Multiplexed Streams without HoL Blocking. If one stream in a QUIC tunnel loses a packet, other streams continue unaffected.

QUIC Performance Breakthroughs
Recent real-world deployments demonstrate QUIC’s advantages:

Google’s original paper describing QUIC’s basic mechanisms reported an 8% average reduction in the time it took to load Google search results on desktop and 3.6% on mobile, with up to 16% faster loading for the slowest 1% of users.

For video streaming, the impact is even more dramatic. When looking at YouTube video streaming, researchers found up to 20% less video stalling in countries such as India.

Major CDN providers have validated these benefits at scale. For one large Akamai media customer during a European football live-streaming event popular in Latin America, about 69% of HTTP/3 connections reached a throughput of 5 Mbps or more compared to only 56% of HTTP/2 connections.

Connection Establishment Speed
By using QUIC, HTTP/3 can establish connections up to 33% faster compared to HTTP/2 by combining the transport and TLS handshakes into a single step, reducing the number of round trips needed.

HTTP/3 Adoption in 2024-2025
The protocol has moved from experimental to mainstream. By 2024-2025, major web browsers—Chrome, Firefox, Safari, Edge—all support HTTP/3 by default, with tens of percent of web requests now using HTTP/3, showing growth year-over-year.

Major internet companies are leading the charge. Large companies such as Google and Meta are heavily using HTTP/3, meaning a large chunk of current Internet traffic already uses HTTP/3 today.

Real-World HTTP/3 Performance
Comprehensive benchmarks show consistent improvements. HTTP/3 was faster than its predecessor in every tested case, with the true multiplexed nature meaning there is no Head-of-line blocking happening anywhere on the stack.

Mobile users see particularly significant benefits. A 2025 Akamai report shows HTTP/3 reduces latency by 30% on mobile networks, addressing one of the most challenging environments for traditional TCP.

Why are major platforms investing in the more expensive QUIC/HTTP/3 deployment? QUIC and HTTP/3 are much more expensive to host, costing more CPU time and memory than TCP and HTTP/2 due to more extensive encryption, but this cost is apparently sufficiently offset by the protocols’ benefits.

This is why tools like cloudflared (Cloudflare Tunnels) have migrated to QUIC as their default transport.

The Engineer’s Checklist: Avoiding the Tax
If your tunnel feels sluggish despite your fiber connection, run through this architectural autopsy:
Stop using TCP-based VPNs
Switch to WireGuard or Tailscale. If you must use OpenVPN, switch the transport from TCP to UDP.

Why it matters: Despite being considered the more reliable protocol of the two, a TCP VPN ironically only works reliably as long as the line is near perfect, and even if it doesn’t break down completely, there will eventually be duplicate retransmissions that increase latency and noticeably decrease VPN throughput.

Audit your SSH Tunnels
For development, ssh -L is fine. For production data movement, it is a bottleneck. Consider socat over UDP or a QUIC-based proxy.
Check your MTU
Use the following command to find the largest packet that passes without fragmentation:

ping -M do -s 1472
If your tunnel is active, this number will be lower (usually 1420 or 1380). Adjust your MTU settings accordingly.

Check for “Bufferbloat” Nested TCP exacerbates bufferbloat. Use fq_codel or CAKE as your queue discipline (qdisc) on the tunnel interface to mitigate latency spikes:

Set CAKE qdisc on tunnel interface

tc qdisc replace dev wg0 root cake bandwidth 100mbit

Consider Modern Alternatives
For web applications, leverage HTTP/3 where possible. For custom tunneling needs, evaluate QUIC-based solutions rather than defaulting to SSH or OpenVPN-TCP.
Understanding the Trade-offs
When TCP-over-TCP Might Still Be Used
Despite all the drawbacks, TCP-over-TCP tunnels persist in specific scenarios:

Firewall Traversal: Some corporate networks only allow outbound TCP connections on port 443.
Legacy Systems: Existing infrastructure may not support UDP-based protocols.
Simplicity: SSH tunnels are ubiquitous and require no additional software.
However, recent developments have addressed these concerns. Before committing to solving TCP-over-MPTCP issues, the standard recommendation is clear: if your inner VPN must be TCP, then run it over a UDP-based outer tunnel to sidestep the meltdown problem.

The Performance Reality
Understanding the specific conditions where TCP-over-TCP fails is crucial for system design:

TCP tunnel is a technology that aggregates and transfers packets sent between end hosts as a single TCP connection, but since most applications running on end hosts generally use TCP, two TCP congestion controls operate simultaneously and interfere with each other.

The result? A TCP meltdown results in an outer TCP with a severely reduced congestion window, an inflated retransmission timeout, and a full send buffer, indicating that the inner TCP cannot write and that ACKs are not being sent in either direction.

Looking Forward: The Post-TCP Internet The shift from TCP to UDP-based protocols represents a fundamental rethinking of internet transport:

QUIC’s Architectural Innovations
QUIC heavily integrates with the Transport Layer Security (TLS) protocol, and with QUIC, TLS also encrypts large parts of the QUIC protocol itself, meaning that metadata such as packet numbers and connection-close signals which were visible to all middleboxes in TCP are now only available to the client and server in QUIC.

This integration provides both security and performance benefits, reducing connection setup latency while increasing privacy.

Stream Multiplexing Done Right
Streams are known to HTTP but not to TCP, and individual frames of different streams are numbered and transmitted as ordered segments in TCP, so if a segment gets lost in transit, TCP will resend that after a timeout, blocking all other segments in the connection.

QUIC solves this by making streams a first-class concept at the transport layer, eliminating the head-of-line blocking that plagues HTTP/2 over TCP.

Mobile and Unreliable Networks
QUIC can persist a connection across network changes, unlike TCP which requires the same endpoint (IP and port) throughout the connection. This allows seamless handoffs when users switch between WiFi and mobile data—a common scenario TCP was never designed to handle.

Conclusion
The TCP-over-TCP Tax is a fundamental conflict of interest between two layers of reliability. By trying to be “too reliable,” the tunnel becomes unusable.

The evidence is overwhelming: - WireGuard consistently outperforms TCP-based VPNs by 2-3x - HTTP/3 delivers measurable improvements in real-world deployments - Major internet platforms have adopted QUIC despite higher CPU costs - Modern kernels include WireGuard natively, signaling industry consensus

In the world of hard engineering, the fastest path is often the one that knows when to let a packet go. Use UDP for your tunnels, let TCP handle the data at the application layer, and stop paying the tax.

The future of internet transport is clear: stateless tunneling with UDP-based protocols like WireGuard and QUIC. As networks become faster and more complex, the architectural lessons of the TCP-over-TCP problem become even more relevant. Choose your protocols wisely, understand the trade-offs, and build systems that work with network realities rather than against them.

Additional Resources
WireGuard Official Documentation
IETF QUIC Working Group
RFC 9000 - QUIC Transport Protocol
RFC 9114 - HTTP/3
Cloudflare Learning Center - What is QUIC?
Testing Your Own Setup
To benchmark your VPN or tunnel performance:

Test throughput with iperf3

iperf3 -c -t 30 -P 4

Test latency

ping -c 100

Check for packet loss

mtr -c 100
Compare results with and without your tunnel active to quantify the performance impact.

TCP-over-TCP performance tax, tunnel latency math, Head-of-Line Blocking (HoL), SSH tunnel benchmarks 2026, WireGuard vs SSH speed, TCP Meltdown effect, congestion control conflict, retransmission timeout (RTO) math, packet loss in tunnels, TCP window scaling, additive increase multiplicative decrease (AIMD), UDP-based tunneling, QUIC vs TCP performance, HTTP/3 tunnel benchmarks, network stack autopsy, tunneling overhead calculation, MTU fragmentation tunnels, packet encapsulation tax, double ack storm, TCP keepalive vs tunnel heartbeat, low-latency networking 2026, Wi-Fi 7 tunnel jitter, 6G network slice latency, LocalCan performance data, zrok vs ngrok speed, high-throughput tunneling, kernel-space vs user-space tunnels, BBR congestion control, CUBIC vs Reno for tunnels, stream multiplexing overhead, zero-copy networking, eBPF network monitoring, XDP packet processing, WireGuard throughput formula, Padhye's formula for throughput, network bufferbloat, tail drop vs RED, persistent tunnel stability, distributed systems networking, VPC peering vs tunneling, hardware-accelerated crypto tunnels, PQC tunnel overhead, latency-sensitive devops, 2026 networking trends, architectural autopsy, infrastructure engineering, network protocol debt, optimized dev experience (DevEx), microservices network latency

DEV Community