Data corruption seen after a node encounters ‘Data Corruption Protection Activated’ error

Article: 100028678
Last Published: 2013-02-18
Ratings: 1 1
Product(s): InfoScale & Storage Foundation

Problem

Data corruption is seen on AIX systems when correct Dynamic LUN reconfiguration procedure is not followed. The node logs following message in the vxconfigd log, and if any I/O has happened on the problematic disks or if I/O continues on these disks, data corruption will occur.  DiskGroup deport would flush the diskgroup through in-correct path and can result in overwriting private region with other diskgroup information. After deport, diskgroup fails to import because of the private region corruption. Vxdisk list would show disks with altused and udid_mismatch flag set.

Error Message

# vxdisk scandisks
VxVM vxdisk ERROR V-5-1-16007  Data Corruption Protection Activated - User Corrective Action Needed
To recover, first ensure that the OS device tree is up to date (requires OS specific commands).
Then, execute 'vxdisk rm' on the following devices before reinitiating device discovery:
        emc0_1069, emc0_1065, emc0_1066, emc0_1067, emc0_1068

# vxdisk -e list
DEVICE       TYPE           DISK        GROUP        STATUS               OS_NATIVE_NAME   ATTR
emc0_10a0    auto:cdsdisk   -            -           online               hdisk58          std
emc0_10a1    auto:cdsdisk   -            -           online altused shared udid_mismatch hdisk59          std
emc0_10a2    auto:cdsdisk   -            -           online shared udid_mismatch hdisk60          std
emc0_10a3    auto:cdsdisk   -            -           online altused shared udid_mismatch hdisk61          std
emc0_10a4    auto:cdsdisk   -            -           online altused shared udid_mismatch hdisk62          std
emc0_106a    auto:cdsdisk   emc0_106a    sdg         online shared        hdisk7           std
emc0_106b    auto:cdsdisk   emc0_106b    sdg         online shared        hdisk8           std
emc0_106c    auto:cdsdisk   emc0_106c    sdg         online shared        hdisk9           std
 

Cause

If a SAN/Storage administrator performs Dynamic LUN reconfiguration such that a host loses one set of paths coming from a controller, and then, after the SAN reconfiguration, a different set of paths reappear on the same controller, then the AIX OS will reuse the same device numbers as before without any validation of LUN Serial Numbers. If the new set of LUNs visible on this controller is not the same as before, DMP does not know that the same device number corresponds to a different LUN and data corruption can result. Following is one scenario which can cause DCPA error followed by Data corruption.

SAN/Storage Admin un-maps a Masking View (zone) that causes some or all DMP nodes to lose one path. DMP will mark this path as failed for all the LUNs in the Masking view. I/O will still go through the other available paths.

SAN/Storage Admin now re-provisions a different set of LUNs (Storage Group, LUN Group) to the same HBA controller on the host(s). The new LUN set can be a sub-set of the old view, a new set of LUNs, or even a mix of the two. The problem is that AIX OS can reuse the same device numbers, and therefore it is not guaranteed that the device that has reappeared is in fact the same.

SAN/Storage Admin now re-provisions the LUNs to a different target and LUN number. This can be because additional LUNs are added or some LUNs were removed from the Masking view. OS device tree is changed i.e.  the OS-assigned device name (e.g. rhdisk#) is different after the above change.

Let us assume LUN 0xA100 to 0xA109 are original LUNs assigned to disks with serial numbers ABCD1000 to ABCD1009. Assume that there are two paths via HBA controller fscsi1 and fscsi2. So, from OS device tree these devices are mapped as rhdisk1 through rhdisk20.

SAN team adds 10 more LUNs 0x0100 to 0x0109 and they want to rebalance LUNs at the same time.

SAN team deletes Masking view for 0xA100 to 0xA109 (which was for controller fscsi1). Host would mark devices as offline for controller fscsi1 and DMP will mark  paths for device rhdisk1 to rhdisk10 as failed. I/O will still go through fscsi2 path.

SAN team adds Masking view for the 10 new LUNs 0x0100 to 0x0109 and existing LUNs 0xA100 to 0xA109 to fscsi1. AIX OS will assign OS device names (rhdisk#)  1 through 10 for first 10 LUNs and 21 to 30 for the next 10 LUNs. Because the rhdisk# changed for the original LUNs and there are new set of LUNs taking rhdisk1 to 10 now, VxVM will trigger DCPA error as soon as System Admin runs vxdctl enable on the node.

The in-core DMP device tree mapping for rhdisk1 to 10 still remembers LUN serial numbers 0xA100 to 0xA109 but it now sees LUN serial numbers as 0x0100 to 0x0109.

DCPA error is logged when the LUN serial number does not match, but if the device number matches then it is a case of device number reuse. The paths for which LUN Serial Number change was observed will be disabled. But, currently (in 5.1SP1 & 6.0 releases), DMP fails while attempting to disable the problematic path as they are busy. So, the problematic path remains active and I/O continues down the in-correct path leading to Data corruption. An enhancement will be done in 6.1 releases to override the "busy" state of the device to forcefully disable the problematic paths.

 

Solution

As long as the device numbers are not re-used, Dynamic LUN re-configuration is possible but in cases like above, it is a must to first remove the DMP device and OS device before re-using the same device numbers.

Refer to the DMP Admin Guide for detailed steps on Dynamic LUN re-configuration procedure.

 


Was this content helpful?