Using Backup Exec to backup virtual environments that use Hyper-V 2 (Windows Server 2008 R2) as hypervisor may be not as simple as it first looks.
In this paper I want to cover some of the reasons why partners and customers struggle getting reliable backups in those environments, especially when using clustered Hyper-V installations.
Clustered Shared Volume (CSV)
When you use Hyper-V in a clustered environment you will, in almost every environment, use Clustered Shared Volumes (CSVs) to mount your virtual machines.
Without going into deep detail, CSV is an add-on to NTFS that allows the access from multiple computers to the same storage volume, at the same time. This means, that multiple hypervisors can host their virtual machines on the same volume.
To prevent conflicts every CSV has an owner assigned who decides which computer may access which block on the volume at a given time.
In Windows Server 2008 R2 CSVs are only supported in Hyper-V environments and may not hold any other data, than those used for virtualization.
Snapshotting a virtual machine in Hyper-V 2 means that the machine’s memory is dumped on the hard drive of the hypervisor and that the virtual hard drive (VHD) of the machine is set to read only. In addition, a new hard drive file with the extension .AVHD is created and any future changes to the hard drive are written to that file.
If you take a second snapshot from that machine, the memory is dumped once again, the first .AVHD file is set to read only and a second .AVHD file is created.
So far there should be no issues; the process is similar to the way other hypervisors, like VMware behave. The difference (and the problematic part) is during the snapshot removal:
When you remove a snapshot via the Hyper-V management console, the snapshot is deleted from the console, but the .AVHD files remain on the disk. They are only removed, when you shut down the virtual machine. And when I say shutdown, I mean shutdown – powered down/off, a simple reboot/restart of the machine does not remove the files.
Only when the machine is really powered off, Hyper-V is able to merge the .AVHD file(s) into the VHD file and remove the .AVHD files.
With this caveat in mind it’s easy to understand that it might me not a good idea to use VM snapshots for backup purposes. (Unless you want to allow your backup software to shut down your virtual machines right after the backup has finished…)
So, what Backup Exec has to do to circumvent this caveat, is to do the snapshots not per VM but instead per volume, which means per CSV by using VSS technologies on the hypervisor itself.
Redirected access means, that traffic to a CSV is not written directly to it by the use of Fiber Channel or whatever link the servers have to the storage, but is instead forwarded to the CSV’s owner via LAN and then written by the owner to the disc on behalf of the original server.
This technology is used if a server loses its entire connection to the storage, i.e. because someone pulls all the Fiber Channel cables at the same time. Instead of having all the VMs hosted on the crashing server, the server just redirects the data stream via LAN to the owner of the CSV.
There is another situation where Redirected Access is used by Hyper-V. This is when there is the need to take a snapshot from one of the CSVs. When the command for taking a snapshot is received by the CSV’s owner, the server sets the CSV into Redirected Access, blocking all other servers from directly accessing the Volume. When this is done, the snapshot is taken.
When the snapshot is deleted again, the server automatically removes the flag Redirected Access from the CSV allowing the other servers to access the volume again in a direct way.
Backup Example 1 (simple)
In this example, there is a VM named A that is hosted on Hyper-V node A and has its VHDs laying on CSV 2.
When the backup is started Backup Exec contacts the Cluster to find out which node is responsible for the VM. Backup Exec then contacts the node and asks for a snapshot of the volume(s) that host the files belonging to that particular VM.
In this example, Server A would be asked to do a snapshot of CSV 2.
To be able to do so, Server A needs to be the owner of CSV 2, so the first step is to move CSV 2 to Server A. This is done without the need for a downtime, so the VMs will not be impacted.
After the ownership is corrected Server A sets CSV 2 to Redirected Access taking the snapshot and forwarding the data for the backup to Backup Exec.
When the backup job is finished the snapshot is cleaned up and the Redirected Access mode is removed from the volume.
Backup Example 2 (still simple)
In this second example we also have two VMs on our cluster. We also have a backup strategy that forces us to backup both VMs at the same time in two different jobs. (Remember, this is just an example to show you some potential problems)
The backup of Hyper-V A will run the same way as in the first example and the backup of Hyper-V B is even more simple, since Server B is already the owner of the CSV that holds the data to be backed up.
So in this example, there shouldn’t be any real problems.
Let’s move on to the next one…
Backup Example 3 (getting more difficult)
We have the same amount of VMs, as with example 2. The difference is that someone created a second hard drive for Hyper-V B and put it on CSV 2.
So, the backup of Hyper-V A is running the same way, as it did before and all the time.
But, when we’re trying to backup Hyper-V B, Server B needs to move the ownership for CSV 2 to himself to be able to create the snapshot. Since the backup of Hyper-V A is running and therefore CSV 2 is in Redirected Access, this move cannot be done. So Server B has to wait that the backup of Hyper-V A finishes and Server A is removing the Redirected Access for CSV 2, before the backup of Hyper-V B can be started…
Backup Example 4 (real life)
In real life, we do find multiple VMs per host and quite some of those VMs have multiple VHDs on the CSVs.
Without going into details, have a look at the picture above and imagine, how you would need to setup your backup jobs to assure, they are not affecting each other.
Solution (or at least a way to get it running)
After trying lots of ideas and monitoring quite a lot of jobs, the best (and only) solution I found to solve the dependency problem in Hyper-V is to strictly sort the machines and their resources.
So try to put the files for all machines that are hosted on Server A to a dedicated CSV, set Server A as the owner of that volume and create a backup job for all machines on that CSV. Do the same with the machines hosted on Server B and so on:
This will prevent the servers from the need to wait for other servers to finish their backups and will allow you to successfully run multiple backup jobs at the same time (one per CSV).
The newly released Version of Windows Server, Windows Server 2012 also brings a new version of Hyper-V with it: Hyper-V 3.0.
In Hyper-V 3 one the most important feature (at least from a backup perspective) is the ability to do online merge of snapshots. This means, that the AVHD files created at the time where the snapshot is taken, are removed from the disk, when the snapshot is deleted from the virtual machine.