<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: sunchao dong</title>
    <description>The latest articles on DEV Community by sunchao dong (@sunchao_dong).</description>
    <link>https://dev.to/sunchao_dong</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3889759%2F4ded6627-9095-42c5-bfc1-13ec5e9908e8.png</url>
      <title>DEV Community: sunchao dong</title>
      <link>https://dev.to/sunchao_dong</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sunchao_dong"/>
    <language>en</language>
    <item>
      <title>I Froze a TCP Connection for 10 Minutes and Migrated It to Another Server</title>
      <dc:creator>sunchao dong</dc:creator>
      <pubDate>Tue, 21 Apr 2026 13:05:20 +0000</pubDate>
      <link>https://dev.to/sunchao_dong/i-froze-a-tcp-connection-for-10-minutes-and-migrated-it-to-another-server-31i8</link>
      <guid>https://dev.to/sunchao_dong/i-froze-a-tcp-connection-for-10-minutes-and-migrated-it-to-another-server-31i8</guid>
      <description>&lt;p&gt;Spot instances are 80% cheaper, but AWS kills them with a 2-minute warning.&lt;/p&gt;

&lt;p&gt;If you are running stateless web requests, that’s fine. But if you are running modern LLM reasoning workloads—where a single request can take minutes to process—a 2-minute warning is a death sentence. Losing a node means losing gigabytes of computed KV Cache and instantly snapping the client's connection. User experience goes to zero.&lt;/p&gt;

&lt;p&gt;I wanted to fix this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fms62rspa3o8km0enthsr.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fms62rspa3o8km0enthsr.gif" alt=" " width="1166" height="823"&gt;&lt;/a&gt;&lt;br&gt;
Demo recorded on localhost with ZENO_MOCK_GPU=1. Real cross-machine results (2× g4dn.xlarge, us-east-1, T4 16GB) in the benchmarks section.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Concept
&lt;/h2&gt;

&lt;p&gt;At its physical core, a TCP connection’s identity isn't tied to a Linux process. It is just a state machine—about 80 bytes of sequence numbers, MSS, and window sizes living in the kernel.&lt;/p&gt;

&lt;p&gt;Tools like CRIU have done live migration for years, and CRIU even includes &lt;code&gt;libsoccr&lt;/code&gt; for socket-level checkpoint/restore. libccmc builds on the same &lt;code&gt;TCP_REPAIR&lt;/code&gt; foundation but adds the missing pieces specifically for cross-machine migration: an eBPF zero-window connection hold, a &lt;code&gt;TIOCOUTQ&lt;/code&gt; flush barrier, and PAWS timestamp continuity.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works (The Mechanics)
&lt;/h2&gt;

&lt;p&gt;Here is the exact sequence to atomically extract a live socket without dropping a single byte:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Drain the pipe: A TC egress program rewrites outgoing packets to advertise &lt;code&gt;Window=0&lt;/code&gt;. The client stops sending.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Flush the buffer: We poll &lt;code&gt;TIOCOUTQ&lt;/code&gt; until it hits &lt;code&gt;0&lt;/code&gt;. This guarantees the client has ACKed every byte we've sent. No unacknowledged ghosts left behind.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The Extraction: We use the Linux &lt;code&gt;TCP_REPAIR&lt;/code&gt; socket option to export the 80 bytes of state (Send/Recv sequences, Window, and crucially, Timestamp offsets to prevent PAWS drops on the new machine).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The eBPF Illusion: We &lt;code&gt;close()&lt;/code&gt; the socket on the source. Normally, the kernel would instantly fire an &lt;code&gt;RST&lt;/code&gt;. Instead, an eBPF XDP program intercepts incoming client probes and replies with valid &lt;code&gt;Window=0&lt;/code&gt; ACKs. The client's TCP stack enters a persist timer.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The Resurrection: We send the 80 bytes to the target server. It calls &lt;code&gt;TCP_REPAIR&lt;/code&gt; to forge a new socket directly into the &lt;code&gt;ESTABLISHED&lt;/code&gt; state.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;VIP Drift: We reassign the VPC Private IP to the target ENI via cloud APIs. The client's next packet hits the new server.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Result: The client sees a ~200ms network hiccup. The data stream continues seamlessly.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  libccmc vs. CRIU
&lt;/h2&gt;

&lt;p&gt;CRIU is an incredible piece of engineering. But its primary workflow freezes the entire world—process, memory pages, file descriptors. CRIU was not designed for this specific scenario, especially when your process is holding a GPU lock or NVIDIA UVM memory (which is always true for LLM inference).&lt;/p&gt;

&lt;p&gt;While CRIU's &lt;code&gt;libsoccr&lt;/code&gt; handles local socket extraction, &lt;code&gt;libccmc&lt;/code&gt; acts as a complete cross-machine scalpel. It pairs socket extraction with active network-level client backpressure. You could extract a socket from one server process and restore it in a completely separate process on another machine—all while the client is securely held in a zero-window wait.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;CRIU (Standard)&lt;/th&gt;
&lt;th&gt;libccmc&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Checkpoint Size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~5 MB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80 bytes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Host Dependency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tied to PID &amp;amp; Memory&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Fully Decoupled&lt;/strong&gt; (Agnostic)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Client Handling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Passive Drop&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Zero-Window backpressure&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Infrastructure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Part of larger CRIU framework&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Standalone C library&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Real-World Numbers
&lt;/h2&gt;

&lt;p&gt;I ran this across two AWS &lt;code&gt;g4dn.xlarge&lt;/code&gt; instances in &lt;code&gt;us-east-1&lt;/code&gt; (same VPC) using &lt;strong&gt;NVIDIA T4 (16GB) GPUs&lt;/strong&gt;, transferring an active vLLM SSE stream.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TCP State Export:&lt;/strong&gt; &amp;lt; 1 ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KV Cache D2H Transfer:&lt;/strong&gt; 7.5 ms (2.4 MB, TinyLlama 1.1B, pinned memory)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Plane Migration Time:&lt;/strong&gt; &amp;lt; 12 ms (Total time to freeze and transfer state + KV)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS IP Reassignment:&lt;/strong&gt; A few seconds (Dominated by cloud API latency, not the migration itself)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client Experience:&lt;/strong&gt; &lt;code&gt;curl&lt;/code&gt; saw exactly 0 &lt;code&gt;RST&lt;/code&gt;s. Tokens continued seamlessly after migration, zero gaps, zero duplicates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Survival Limit:&lt;/strong&gt; I tested the eBPF Zero-Window illusion under stress. The connection survived for 10 minutes in a suspended state.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Open Source &amp;amp; What's Next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;libccmc&lt;/strong&gt; is open source (Apache 2.0 + GPL for eBPF components). It’s a standalone C library with zero GPU or vLLM dependencies. I've tested it with SSE streams; in principle, it should work with any long-lived stateful TCP connection (WebSockets, gRPC, etc.).&lt;/p&gt;

&lt;p&gt;👉 GitHub: &lt;a href="https://github.com/DongSunchao/libccmc" rel="noopener noreferrer"&gt;https://github.com/DongSunchao/libccmc&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Bigger Picture&lt;/strong&gt;: I built this library as the foundation for &lt;strong&gt;ZenoMigrate&lt;/strong&gt;. While &lt;code&gt;libccmc&lt;/code&gt; cleanly handles the 80-byte connection, migrating production LLMs requires orchestrating gigabytes of KV cache (which scales linearly with model size). If you are running LLM inference on AWS Spot instances and want zero-disconnection migration with full GPU KV Cache preservation, I am currently building the end-to-end orchestration system.&lt;/p&gt;

&lt;p&gt;👉 Join the ZenoMigrate Waitlist here:&lt;a href="https://tally.so/r/Y5zKJd" rel="noopener noreferrer"&gt;https://tally.so/r/Y5zKJd&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>aws</category>
    </item>
  </channel>
</rss>
