I Froze a TCP Connection for 10 Minutes and Migrated It to Another Server

#ai #opensource #machinelearning #aws

Spot instances are 80% cheaper, but AWS kills them with a 2-minute warning.

If you are running stateless web requests, that’s fine. But if you are running modern LLM reasoning workloads—where a single request can take minutes to process—a 2-minute warning is a death sentence. Losing a node means losing gigabytes of computed KV Cache and instantly snapping the client's connection. User experience goes to zero.

I wanted to fix this.

Demo recorded on localhost with ZENO_MOCK_GPU=1. Real cross-machine results (2× g4dn.xlarge, us-east-1, T4 16GB) in the benchmarks section.

The Core Concept

At its physical core, a TCP connection’s identity isn't tied to a Linux process. It is just a state machine—about 80 bytes of sequence numbers, MSS, and window sizes living in the kernel.

Tools like CRIU have done live migration for years, and CRIU even includes libsoccr for socket-level checkpoint/restore. libccmc builds on the same TCP_REPAIR foundation but adds the missing pieces specifically for cross-machine migration: an eBPF zero-window connection hold, a TIOCOUTQ flush barrier, and PAWS timestamp continuity.

How it works (The Mechanics)

Here is the exact sequence to atomically extract a live socket without dropping a single byte:

Drain the pipe: A TC egress program rewrites outgoing packets to advertise Window=0. The client stops sending.
Flush the buffer: We poll TIOCOUTQ until it hits 0. This guarantees the client has ACKed every byte we've sent. No unacknowledged ghosts left behind.
The Extraction: We use the Linux TCP_REPAIR socket option to export the 80 bytes of state (Send/Recv sequences, Window, and crucially, Timestamp offsets to prevent PAWS drops on the new machine).
The eBPF Illusion: We close() the socket on the source. Normally, the kernel would instantly fire an RST. Instead, an eBPF XDP program intercepts incoming client probes and replies with valid Window=0 ACKs. The client's TCP stack enters a persist timer.
The Resurrection: We send the 80 bytes to the target server. It calls TCP_REPAIR to forge a new socket directly into the ESTABLISHED state.
VIP Drift: We reassign the VPC Private IP to the target ENI via cloud APIs. The client's next packet hits the new server.

Result: The client sees a ~200ms network hiccup. The data stream continues seamlessly.

libccmc vs. CRIU

CRIU is an incredible piece of engineering. But its primary workflow freezes the entire world—process, memory pages, file descriptors. CRIU was not designed for this specific scenario, especially when your process is holding a GPU lock or NVIDIA UVM memory (which is always true for LLM inference).

While CRIU's libsoccr handles local socket extraction, libccmc acts as a complete cross-machine scalpel. It pairs socket extraction with active network-level client backpressure. You could extract a socket from one server process and restore it in a completely separate process on another machine—all while the client is securely held in a zero-window wait.

Feature	CRIU (Standard)	libccmc
Checkpoint Size	~5 MB	80 bytes
Host Dependency	Tied to PID & Memory	Fully Decoupled (Agnostic)
Client Handling	Passive Drop	Zero-Window backpressure
Infrastructure	Part of larger CRIU framework	Standalone C library

The Real-World Numbers

I ran this across two AWS g4dn.xlarge instances in us-east-1 (same VPC) using NVIDIA T4 (16GB) GPUs, transferring an active vLLM SSE stream.

TCP State Export: < 1 ms
KV Cache D2H Transfer: 7.5 ms (2.4 MB, TinyLlama 1.1B, pinned memory)
Data Plane Migration Time: < 12 ms (Total time to freeze and transfer state + KV)
AWS IP Reassignment: A few seconds (Dominated by cloud API latency, not the migration itself)
Client Experience: curl saw exactly 0 RSTs. Tokens continued seamlessly after migration, zero gaps, zero duplicates.
Survival Limit: I tested the eBPF Zero-Window illusion under stress. The connection survived for 10 minutes in a suspended state.

Open Source & What's Next

libccmc is open source (Apache 2.0 + GPL for eBPF components). It’s a standalone C library with zero GPU or vLLM dependencies. I've tested it with SSE streams; in principle, it should work with any long-lived stateful TCP connection (WebSockets, gRPC, etc.).

👉 GitHub: https://github.com/DongSunchao/libccmc

The Bigger Picture: I built this library as the foundation for ZenoMigrate. While libccmc cleanly handles the 80-byte connection, migrating production LLMs requires orchestrating gigabytes of KV cache (which scales linearly with model size). If you are running LLM inference on AWS Spot instances and want zero-disconnection migration with full GPU KV Cache preservation, I am currently building the end-to-end orchestration system.

👉 Join the ZenoMigrate Waitlist here:https://tally.so/r/Y5zKJd