VMware backups of VMs with multiple disks may read from the incorrect VMDK under specific conditions
Severity
Medium
Description
VMware virtual machines with multiple disks (VMDKs) may, under specific conditions, have limited data corruption from some data being read from the incorrect VMDK during backup. As this issue may occur under specific conditions and impacts a few bytes of data, most of the recovery operations are successful.
No error will occur during the backup. The corruption may be discovered in subsequent backups during Accelerator checksum verification in some jobs that would result in status 84.
The issue may be present if all the following conditions are met.
The VM being backed up has multiple disks.
Backup uses or has used NBD or NBDSSL transport mode.
VMware policies have indexing/mapping enabled ('Enable file recovery from VM backup' check box is checked).
This issue occurs only when using specific NetBackup versions.
Affected NetBackup Versions
NetBackup 10.0
NetBackup 10.0.0.1
NetBackup 10.1
NetBackup 10.1.1
NetBackup 10.2
NetBackup 10.2.0.1
Problem Fixed in:
NetBackup 10.3
Action Required
To prevent corruption in future images with the affected NetBackup versions installed, execute the following commands as root (Linux) or administrator (Windows).
1. On all NetBackup Linux VMware Backup hosts, execute as root:
# echo VIX_LRU_READ_NUMBER_OF_BLOCKS = 0 | /usr/openv/netbackup/bin/nbsetconfig
2. To confirm the setting is correct, execute this command.
# /usr/openv/netbackup/bin/nbgetconfig -x
3. On all NetBackup Windows VMware Backup hosts, execute at Administrator command prompt:
echo VIX_LRU_READ_NUMBER_OF_BLOCKS = 0 | "C:\Program Files\Veritas\NetBackup\bin\nbsetconfig.exe"
4. To confirm the setting is correct, execute this command.
"C:\Program Files\Veritas\NetBackup\bin\nbgetconfig.exe" -x
5. Upgrading to NetBackup 10.3 will also prevent corruption of future images WITHOUT making the above configuration change.
Once the above configuration change or upgrade to NetBackup 10.3 has been completed, future new backups will not experience this issue.
For NetBackup versions 10.0 - 10.2.0.1 virtual machine backup images that are suspected of corruption, establish a new baseline for future backups.
- If Accelerator is being used, then perform a full backup using the "Accelerator forced rescan" option.
- If Accelerator is not being used, then perform a full backup.
- Guidance for Accelerator Forced Rescan:
- Perform this in a phased manner prioritizing the most important VMs.
- An Accelerator Forced Rescan run will not increase storage usage, but the job duration will be comparable to a non-Accelerator full backup.
Fix Scenario Considerations:
Scenario One
VMware backup images created with NetBackup 10.0 – 10.2.0.1 installed
Action: Make VIX_LRU_READ_NUMBER_OF_BLOCKS configuration change
Action: Do rebase of suspect images
Scenario Two
VMware backup images created with NetBackup 10.0 – 10.2.0.1 installed
Action: Upgrade to NetBackup 10.3
Action: Do rebase of suspect images
Notes related to reported corruption:
- Data corruption is limited to no more than 256KB per disk.
- The problem was observed during internal testing, and most attempts of single file recoveries as well as full VM restores from impacted images were successful. Proactive expiration of older images is not imperative nor suggested.
- Possible symptoms include the inability to boot a restored VM. Even in that case, an individual VMDK restore or a single file restore from the same image is likely to be successful.
- NetBackup 10.3 corrects the problematic behavior for new VMware backup images.
Detailed Questions and Answers:
Q1: Are all my VMware backup images corrupted?
No. The size of potential corruption is very small and almost all the impacted images can still be used for recovery purposes. We do not recommend expiring old VMware backup images.
Q2: Do we understand the root cause of this corruption?
Yes. In NetBackup 10.0 a read cache was introduced in the VxMS plugin to address a mapping performance issue. If indexing/mapping disk reads and data movement disk reads occur in a particular way, it may run into this problem. Though the configuration scenario presented may be common, we do not believe that this problem would have much impact on recovery operations.
Q3: What are the details surrounding the conditions that may cause this possible corruption?
The issue occurs only for nbd/nbdssl transport when there are multiple disks and mapping / file indexing is enabled. SAN and HotAdd transport modes are not impacted.
When using nbd/nbdssl transport modes, disks are closed and reopened during the backup, and a VMDK handle may get reused for a different disk than intended.
If that happens, and a data movement read is triggered for the exact same offset and length for which a mapping/indexing read was done earlier and the old data is still in the cache, then those blocks would have the wrong data in the backup image.
The read cache is small and is only used for reads that occur within an aligned 4KB extent. The cache consists of 64 4KB buffers. During data movement, as small reads occur that are not in the cache, they will replace the least recently used cache buffers. For corruption to occur, a read needs to be made for one of the buffers that was cached during mapping before the cache has been overwritten.
Q4: Will the 256KB of corruption prevent a 10TB restored VM from booting or otherwise being usable?
Not necessarily. During internal automated testing where the issue was found, there were only a couple of KBs of corruption that occurred at the beginning of the disk. We think this was in the boot sector or partition table which prevented the VM from booting. Most of the data was still recoverable. Uncorrupted VMDKs are intact from either the full VM restore OR selective VMDK restores. On the corrupted VMDKs, most files are still recoverable via Single File Restore (SFR).
Q5: Does this impact a policy where mapping (Enable file recovery from VM backup) is disabled?
No. This only impacts VMware policies where mapping is enabled, i.e., the 'Enable file recovery from VM backup' check box is checked.