KVM storage live migration

One of the features I’ve been most excited about with this new server setup is KVM live migration using virsh migrate with copy-storage-all. This is like a regular VM live migration, but you can do it even if you don’t have a shared storage for your KVM hosts (shared storage is usually a dedicated NAS for VM disk images). I don’t want to be dependent on too much hardware and I feel like a NAS will add a lot of gear, both in terms of the actual NAS which you might want to add some redundancy to, but also in terms of network equipment (like do you want redundant switches?).

My goal with this setup is to be able to do some hardware maintenance on hosts, as well as upgrading of the KVM host software and firewall software without having to take everything offline. If there is some unexpected hardware failure everything will still go offline and there is no fast way to recover from that but I’m OK with that for now. I also want the setup to be reasonably simple, so I wanted to stay away from any kind of clustering file systems that probably will not work well on just two hosts.

This is a simplified view of the setup I went for:

Blinkenshell 2021 KVM host setup

Both KVM hosts were actually installed about the same time and has been running since the server migration. The second internet connection was added later however, and I wasn’t able to test the internet failover part before we went live. I did however do some tests migrating VMs between the hosts, but was not possible to completely power off one of the KVM hosts since that would result in losing internet connectivity. A couple of weeks ago I added the second internet connection and have been working on the firewall setup to make failover possible, I might do a separate post about this if someone is interested.

So, how do you actually migrate a VM between two KVM hosts without having a shared storage? First we should probably discuss the actual software setup in a bit more detail. I’m running Ubuntu hosts with KVM and manage my VMs using libvirt, either by using virsh on the command line or preferably using GUI via Virtual Machine Manager / virt-manager. I’ve also enabled libvirtd tcp socket on a dedicated VLAN on the 10G trunk between the hosts to make the copying faster. I did this by overriding a few settings in the ubuntu included systemd libvirtd-tcp.socket: systemctl edit libvirtd-tcp.socket

[Socket]
ListenStream=
ListenStream=<local-copy-vlan-ip>:16509
IPTTL=1

And enable the socket: systemctl enable libvirtd-tcp.socket.

The magic is then accomplished by adding the option --copy-storage-all to the virsh migrate command, something like this:

virsh migrate --p2p testvm1 --undefinesource --persistent \
 --copy-storage-all qemu+ssh://<other-host-ip>/system tcp://<other-host-ip>

There are lots of options to virsh migrate depending on what you want to do. In my case I want to migrate the VM testvm1 to a new host, and I want to have the VM running permanently on the new host until I make some other change. undefinesource and persist will make it so that the VM configuration is removed from the old host and only appears on the new host after, where it will be kept in a persistent state (instead of just a temporary move). copy-storage-all will make sure to copy the contents of all local disks attached to the VM, but you have to make sure to create empty qcow2/raw disk files on the new host with the correct sizes before you can start the migration! It’s important they have the same name and size as on the original host. I also include some options to have the actual data transfer on this dedicated copy VLAN.

In my case most VMs are very small, and a copy including storage of a very small VM can complete in something like 30 seconds. A large VM with a few hundred gigabytes of storage takes almost 20 minutes to copy. The storage I’m using is a few generations older NVMe SSD, in RAID1 setup they manage to write at around 1GB/s which fits pretty nicely with 1x10G NIC for the network. I actually connected 2x10G and was hoping to get 2GB/s but the write speed after RAID was a bit slower, but still good enough for my scenario.

The actual news in this post is that yesterday was the first time I actually completely powered off the KVM1 host to connect a UPS, and I managed to use this storage live migration to move triton (the SSH server) off from the KVM host, perform my maintenance, and move triton back without disconnecting any IRC sessions! So both KVM migration and firewall failover worked, yay! It’s still kind of a tricky maneuver so I fully expect I will mess it up next time 😀 Stay tuned

This entry was posted in hardware, maintenance, network. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *