Say you want to upgrade your laptop's RAM, what's the first thing you do if you have a new RAM stick in your hand? Turn off the laptop, install the new RAM, and restart. Pretty easy? Now imagine you're a system engineer at Amazon AWS. You are asked to upgrade the RAM of a physical machine, but it got some virtual machines running on it, what would you do? If you stop the server even for a second, the users will notice immediately. How to upgrade the RAM without shutting down the services?
The Zero-Downtime Dilemma
One thing you can do is to migrate the processes running on those virtual machines to a new ones. But the problem is the old virtual machine may still be needed for some network and system calls. Thus you can't shutdown the old machine easily.
What if you could take a virtual machine from one physical host and “teleport” it to another—without anyone noticing?
That’s not magic. It’s called live migration. With live migration, the users won't be having any issues (nor will they even know) if you replace the host of their virtual world. And on the other hand system engineers will have least time worrying about long migration time.
How Do We Live Migrate VMs?
But how is that possible? First let's consider two parties which we have to make sure are happy. The end user; there should not be a downtime. And the operator of the data center; They should do it in the least total migration time. If both those factors are achieved, everyone is happy.
To run the VM on a new machine, several things must be copied from the original. The most important one is memory, because it represents the VM's entire working state. Therefore first let's see three ways the memory can be transferred.
Push: Push the memory pages across the network while the source Virtual Machine (VM) is still running.
Stop-and-copy: Stop the source VM and copy the pages across the new VM. Then start the new VM.
Pull: Start the new VM while the old one is still running. And if the new VM access a page that hasn't copied yet, make a page fault and pull it across the network from the source VM.
None of these approaches work well on their own. For example if the old VM is completely stopped and start the new one, the downtime for the users is high, so is the total migration time. On the other hand if the new VM is started and pulled as needed, the total migration time is high.
The Winning Approach
So none of these approaches work well alone. The solution is to combine them; pre-copy, A combination of iterative push phases with a very short stop-and-copy phase. The memory pages are sent over the network in an iterative manner and after that the old host is suspended and sent the processor state to the new host so that it can be start.
The 5 Phases of Live Migration
- Step 01: Reservation
A request is issued to migrate an OS from host A to host B. A calls B and says, "Hey, here's something for you". The host B confirms if the necessary resources are available and reserve a VM container of that size. "Yeah I got it, send over".
NOTE: Failure to secure resources means the VM simply continues to run on A.
- Step 02: Iterative Pre-Copy
First iteration, all the pages are transferred to B. Subsequent iterations copy the dirtied pages (pages that have been modified) during the previous transfer.
- Step 03: Stop-and-Copy
Suspend the running OS instance at A and redirect its network traffic to B. Then any remaining inconsistent memory pages are transferred to B along with the CPU state. Now both A and B have suspended copies of the VM. Still the A's copy is the primary and will resume in case of failure.
- Step 04: Commitment
Host B indicates to A that the OS image is successfully received. Host A acknowledges this and it may now discard the original VM. Host B becomes the primary host now.
- Step 05: Activation
The migrated VM on B is now activated. Post-migration code runs to reattach device drivers to the new machine and advertise the moved IP address.
The Bottleneck: WWS
Great, we can replace the VMs anywhere we want easily. Not really. It's too good to be perfect. There's a catch.
Writable Working Set
Consider this analogy. You're moving to a new apartment on the weekend. You have bought the apartment and all you have to do is move the items in your old apartment to the new one. But the problem is you can't move all at once, it may take few days, and you can't hinder your routine tasks, like brushing your teeth, washing yourself and eating.
Now in the virtual world, the routine tasks are the one that dirties the pages rapidly. Even after copying memory, the VM continues running—meaning new changes are constantly being made. And those are called writable working set.
The "Moving"
Let's say for the first round you moved the furniture and other big items you don't use daily into your new apartment. And for the second round you move your clothes and your sports items. Then for the last round you can take all the utensils you use daily and put them in your car and give the keys to the owner and move yourself to the new apartment. There you can unpack those utensils again and live normally.
Same goes for the VMs, there will be some rounds to send the memory that doesn't change rapidly. Then once the algorithm decides the WWS is small enough the system is frozen and moved (stop-and-copy) to the new VM.
Knowing When to Freeze: The Stop Triggers
In practice, systems rely on dynamic heuristics to decide when to stop.
The "good enough rule"; Once a round is completed, the software asks itself, 'how long will it take to send the remaining pages?'. If the answer is a couple of milliseconds, it just freezes and does the final copy
Hitting the WWS floor; Let's say the first round copied 413 MB, the second one did 112 MB, third 15 MB, fourth 14 MB and fifth 15 MB. The system knows it has hit the WWS floor as the dirtied memory isn't shrinking anymore. Thus it stops the iteration and initiate the final copy.
Setting a hard limit; What if there's a process which never stops writing to the memory. That's when the developers has put a cap on how many times the system will tolerate failure. It just aggressively slows down the rogue process and forces the migration process to finish.
Under the Hood: Two Ways to Build It
There are two ways migration can be done. In each method, managing the dirty pages and freeze are different.
Managed Migration
Managed migration is performed by migration daemons running in the management VMs (one special virtual machine in a physical host that is used for the administration and control of the other machines) of the source and destination hosts. '
Setting the Trap: Shadow Page Tables
As mentioned in migration steps, the copying is done in rounds. The first round copies the whole memory and the subsequent rounds only copies the dirtied pages during the previous round. To keep track of those pages dirtied during the rounds a dirty bitmap is used by the Virtual Machine Manager (the one who creates and runs the virtual machines) at the start of each round.
The VMM uses shadow page tables which is populated using the page table (a map to identify which memory page is used and which are not) of the guest OS, making all the page table entries there read-only. If the guest OS tries to write to a page, there will be a page fault. Then the VMM intercepts the fault, checks if the guest OS actually has permission to write, and if so, logs it in the dirty bitmap and allow the guest OS to write and continue working.
Once a round is over, the bitmap is cleared and the shadow page table is destroyed and recreated.
When it's determined that the pre-copy is no longer beneficial (using heuristics as mentioned above), a control message requesting to suspend itself is sent to the OS. Once the OS has done this, VMM informs the control software, the dirty bitmap is scanned one last time for inconsistent memory page, and these are transferred to the destination together with the VM's check-pointed CPU-register state.
When the final information is received at the destination, the VM state on the source machine can safely be discarded. The control software on the destination machine scans the memory map and rewrites the guest's page table. Execution is then resumed by starting the new VM at the point that the old VM checkpointed itself.
Self Migration
Self migration is done within the OS being migrated (migratee). It's a bit similar to managed migration. At each pre-copying round every PTE is write-protected. The OS maintains a dirty bitmap tracking dirtied pages. But the OS also manages other page faults like this. Thus to distinguish there is a reserved spare bit in each PTE to indicate that it's only for dirty-logging purposes.
In managed migration the migratee can be suspended and obtain a consistent checkpoint. But in self migration the OS must continue to run in order to transfer its final state. Thus the final stage is two stepped. First all the OS activities except for migration are disabled and a final scan is performed to find the dirty bitmap. Any pages that are dirtied during this final scan are copied to the shadow buffer. As second final stage, the shadow buffer is transferred as the OS checkpoint.
The Cheat Code: Paravirtualization
Paravirtualization is a virtualization technique where the guest operating system is modified to communicate directly with the hypervisor rather than using hardware emulations.
There are two ways it is beneficial for live migrations.
Stunning Rogue Process
There may be cases that some processes dirty memory at a rate which can't be transfer via Ethernet, and those are called 'rogue' applications.
We can mitigate this by forking a monitoring thread within the OS kernel to monitor the WWS of individual processes and take actions if required.
Freeing Page Cache Pages
There will be some part of memory that are free but have used to store cache pages. Those are irrelevant while migrating a VM and will reduce the performance. Thus we can just clear the cache and send. But if the contents of the pages be needed again, there will be a little time consumption to read them from the disk again.
The Invisible Backbone of the Modern Cloud
It sounds like magic, but the engineers who designed this system managed to get the downtime for a migrating server down to as little as 60 milliseconds. That is faster than the blink of an eye.
When you read through the mechanics—from shadow page tables trapping memory faults to the operating system dropping its own cache to speed up the move—you realize it isn't magic at all. It is just incredibly clever systems engineering.
Today, this foundational concept is the invisible backbone of the modern internet. It is the reason AWS, Google Cloud, and Azure can perform massive hardware upgrades, balance millions of terabytes of traffic, and maintain physical servers without ever dropping your Netflix stream or pausing your online game. The next time you are playing a multiplayer game without a single glitch, just remember: the physical server hosting your match might have just been teleported to a completely different rack miles away, and you didn't even notice.
Note: This post is a simplified breakdown and interpretation of the original research. If you want to dive into the deep technical math and architecture, I highly recommend reading the original 2005 paper by Christopher Clark and his team at Cambridge University.
Top comments (0)