Ethan Rodrigo

Posted on Jun 26

Teleporting Virtual Machines: Flipping the Script

#virtualmachine #cloud #computerscience

In the previous post in this series, Teleporting Servers, we examined how to move a live virtual machine between physical hosts using the pre-copy method. Pre-copy transfers the VM’s memory in multiple rounds while the source keeps running. When the number of dirtied pages falls below a configurable writable working set (WWS) threshold—or a preset maximum number of iterations is reached—the VM is suspended and its CPU state plus any remaining dirty pages are sent to the target host; that final transfer is the service downtime phase.

The Pre-Copy Bottleneck

We were good with pre-copy, right? Why do we need a new method? Well, pre-copy takes seconds and in virtual world, a second is an eternity. Let's see why pre-copy take time.

What is the cap to stop the iterations of transferring memory? Right, it's WWS floor. But what if the smallest WWS reached is too large? The downtime will be high and the processes that needed to be continued will take some time. This means that pre-copy is good only if the workload of the VM is read-intensive (not too much of page dirtying) and if the workload is write-intensive the downtime will be higher so is the applications' performance.

The Post-Copy Flip

Take the analogy from the previous post, the moving in example. What if you take your essentials in the first run and start living in the new apartment? Then you can move your other big items later and hand over the key to the owner.

Now apply that to the VM migration. First you transmit the VM's processor state to the target and start the VM there. Then actively push the VM's memory pages from the source to target. Meanwhile, if there's any page faults they are sent over the network from the source.

The Four Pillars of Post-Copy

Post-Copy under the hood uses 4 methods to make the migration process efficient.

Demand Paging - This is when a page fault occurs and it is requested from the source over the network.
Active Push - The pages are sent to the target from the source even without a page fault occurring to make sure the target will not be dependent on the source as soon as possible.
Prepaging - This is more like a forecasting technique used to identify the page access pattern and get the needed pages even before there's a page fault.
Dynamic Self-Ballooning (DSB) - Why are you sending the free pages over when you can just drop them? DSB takes care of that.

Predicting the Future with Bubbles

As mentioned above, pre-paging forecasts what pages will be faulted on the target. Let's see what happens behind the scenes.

As mentioned earlier, Prepaging is used to make pages available at the target before they are faulted on by the running VM. The effectiveness of the prepaging is measured by the percentage of the page faults that requires an explicit page request to be sent over the network to the source. Smaller the percentage, better the prepaging algorithm.

But how the pages needed in the future are decided? Computer programs don't usually access memory completely at random. If a program needs to read data at Memory Address 100, there is a very high probability that in the next millisecond, it's going to need Address 101, 102, and 103. This is called "spatial locality."

Now there are two methods in which the pages needed to sent are decided.

Bubbling with a Single Pivot

What happens when you throw a rock into a pond? There will be ripples spreading outwards in circles. Now compare it with a network fault. Initially the pages are sent over in one direction (0, 1, 2, ...), where the pivot is page 0. When a network page fault occurs, the pivot moves there. From there the pages are sent in both directions, front and back. Assume that page fault is 50. The pivot will be 50 and then the pages will be sent in backwards 49, 48, 47, ... and also frontwards 51, 52, 53, ... Whenever a pages which has been transferred is met, it is skipped, ensuring the pages are transmitted only once.

Bubbling with Multiple Pivots

Now imaging the VM having multiple processes running, the new VM would fault on page on multiple locations. Thus there need to be multiple pivots, causing multiple bubbles. Each bubble will expand around an independent pivot. If one edge of a bubble comes across a page that is already transmitted that edge will -be stopped. As for the efficiency, it has found limiting the number of pivots to around 7 is a good idea. Therefore whenever a new pivot is occurred and the limit is hit, the new pivot will replace the old one.

As for the direction of the bubble growth, it has found that forward expansion is essential, backwards-only expansion is counter productive and bi-directional expansion performs just right.

What to Send First?

Generally, Linux maintains two linked lists in which pages are accessed in Least Recently Used (LRU) order; one for active pages and one for inactive pages. There's a kernel daemon periodically swapping pages around the two lists. Later on, the inactive list is swapped out of RAM to the SWAP device. This is quite helpful in post-copy implementation, while deciding what pages to send first.

But Linux is lazy. If there's enough memory and no swap device, Linux just sits there and leave the lists unsorted. And the migration's pseudo-paging device is turned on last millisecond of the migration, there's no time to organize the list and migration algorithm has no idea which pages are active and which are inactive.

Therefore the developers have implemented a kernel thread which runs in the background long before the migration starts.

each time an application touches a page, it flips a tiny switch of the page called a "Referenced bit"
Then while the thread goes through memory and if it sees a page with Referenced bit turned on, it clears the bit and moves the page to the top of the Active List.
If a page sits there for a long time without its referenced bit being turned on it will be slides down in to the inactive list.

Setting the Trap: Catching Page Faults

There are three ways of trapping page faults by demand-paging component of post-copy.

Shadow Paging

Hyperviser have a read-only page table for each VM that matches its pseudo-physical pages to the physical page frames. If the VM tries to read or write a page that hasn't arrived yet, the hypervisor catches the violation, stops the vm, fetch the page and lets it continue.

Page Tracking

Page Tracking; During downtime, all the pages on target VM are marked as not present in their PTE. When the VM wakes up, it throws errors on everything, a custom software intercepts. This needs a lot of hacking into the guest OS.

Pseudo-paging

As soon as migration is started, the memory pages of the migrating VM at the source are swapped out to an in-memory pseudo-paging device, which resides on the guest kernel. Then the CPU state and non-pageable memory are transferred to the target during downtime. But there's a catch: if the OS detects that even a single ounce of its physical memory is missing, it panics (a kernel panic) and crashes. The solution? A heist.

If you found out there was a priceless diamond in a capital museum, how would you steal it without tripping the alarms? The easiest way is replacing the original with a replica of the exact same weight.

The same goes for pseudo-paging. If the OS won't let us take the memory, we just replace the original data pages with empty pages. This is called the MFN Exchange. Here's how it works:

Before the migration starts, the hypervisor goes to its reserves, gathers a massive pile of completely empty, useless physical memory (the sandbags), and temporarily doubles the VM's memory reserve. Next, all the running applications are temporarily frozen so they stop writing new data.

The guest OS is then instructed to swap out its memory. As it does this, the hypervisor intercepts the pointers. It takes the VM's internal addresses (PFNs) and reconnects them to those empty, useless physical chips (MFNs). The VM is satisfied. Meanwhile, the hypervisor takes the real physical chips holding the actual data (the diamonds) and quietly hands them over to Domain 0 to be beamed to the new server.

With the memory safely stolen, the whole VM is suspended for just a few milliseconds and its "brain" is sent over to the target. Once awake on the new server, if the OS hits one of those empty replica pages, it throws a "page fault." A third-party software driver (the MemX client) intercepts this error and immediately pulls the missing data across the network from the old server's Domain 0.

Handle the Free Memory

Transferring a large number of free pages is a waste of resources and would increase the total migration time regardless of the migration algorithm you use. Also, if the moved VM asked for a brand new empty page, there will be a page fault and an empty page will be fetched from the source wasting time as once arrived, that page is overwritten anyway.

A technique called ballooning is used for resizing the memory allocation of a VM. Usually there is a balloon driver in the guest kernel. It can either ask the guest for free memory and give them back to the hypervisor (inflate the balloon), or request pages from the hypervisor and return them to the guest (deflate the balloon).

Dynamic Self-Ballooning

Now that mechanism is used to avoid transmission of free pages during both pre and post copy migrations. The VM performs ballooning continuously over its execution lifetime - and its called Dynamic Self-Ballooning (DSB).

DSB has three components

Inflate the balloon - VM has a kernel-level DSB thread that allocates as much as free memory as possible and hand them over to the hypervisor.
Detect memory pressure - Memory pressure means some entity needs to access a free page.
Deflate the balloon - In response to a memory pressure the balloon must be partially deflated, i.e. reverse of inflating. DSB thread re-populates the free memory from the hypervisor and then release them to the guest kernel.

How to detect memory pressure?

Imaging you're a manager of a renowned restaurant. There are 100 tables in the restaurant, and the Reserved sign is put on 95 of them, and 5 are left as if anyone come those tables can be given. Now a massive VIP party walks in, you get panic and shouts "Get 20 tables and seats for the VIPs".

Same applies while detecting memory pressure. The DSB process (the manager) takes up (inflate the balloon) to 95% of the available free memory (if taken 100% there would be out-of-memory trigger). Then if an application (VIPs) asks for memory, and if there's no free memory a panic triggers. And that panic is a "Memory Pressure". In Linux if this panic occurs it sirens an alarm, a shrinker function. It's like yelling "Hey, if anyone has unused cache, throw them away". The DSB periodically do this process to make sure the OS run smoothly.

Now the developers of post-copy implemented the balloon driver such that it can listen to this alarm, so it can deflate as needed and the application run smoothly.

The Reality Check: Did it Actually Work?

The developers didn't just build this in theory; they put their Post-Copy prototype through the wringer with some heavy, real-world server applications. Here is the final scorecard:

The Triumphs (Where Post-Copy Wins):

The Bandwidth Savior: Because Post-Copy guarantees every page is transferred exactly once, it absolutely crushes Pre-Copy on network efficiency for write-heavy workloads. No more endless loops of re-sending dirtied data.
Bubbling is Magic: The multi-directional "Bubbling" algorithm, combined with the LRU (Least Recently Used) list sorting, worked beautifully. It predicted the VM's needs so well that it eliminated a massive chunk of the sluggish network page faults.
DSB is a Cheat Code: Dynamic Self-Ballooning drastically reduced the total amount of memory that needed to be transferred, speeding up the entire migration process from start to finish.

The Catch (Where it Stumbled):

The Downtime Spike: In a perfect world, Post-Copy downtime is near zero. However, because their specific "pseudo-paging" hack couldn't swap out the core, protected kernel memory, they had to pause the system to send that chunk over manually. This resulted in a slightly higher downtime than they wanted.
Read-Heavy Workloads: If your server is mostly just reading data (like a static web server), good old-fashioned Pre-Copy is still the reigning champion for overall speed.

The Final Verdict: Which should you choose?

There is no single silver bullet. They are different tools for different jobs.

If your server is a calm, read-intensive machine, use Pre-Copy. It's safe, reliable, and keeps downtime incredibly low.

But if your server is a chaotic, write-intensive beast—like a massive database actively changing gigabytes of memory every second—Post-Copy is the ultimate getaway vehicle.

Conclusion

We discussed how post-copy can be a complementary tool for pre-copy, and how it was implemented.

Note: This blog post is a summary of the research paper Post-Copy Live Migration of Virtual Machines, and the sole purpose of the post is to summarize what I have learnt by reading this paper.

DEV Community