Data corruption is seen on AIX systems when correct Dynamic LUN reconfiguration procedure is not followed. The node logs following message in the vxconfigd log, and if any I/O has happened on the problematic disks or if I/O continues on these disks, data corruption will occur. DiskGroup deport would flush the diskgroup through in-correct path and can result in overwriting private region with other diskgroup information. After deport, diskgroup fails to import because of the private region corruption. Vxdisk list would show disks with altused and udid_mismatch flag set.
# vxdisk scandisks
VxVM vxdisk ERROR V-5-1-16007 Data Corruption Protection Activated - User Corrective Action Needed
To recover, first ensure that the OS device tree is up to date (requires OS specific commands).
Then, execute 'vxdisk rm' on the following devices before reinitiating device discovery:
emc0_1069, emc0_1065, emc0_1066, emc0_1067, emc0_1068
# vxdisk -e list
DEVICE TYPE DISK GROUP STATUS OS_NATIVE_NAME ATTR
emc0_10a0 auto:cdsdisk - - online hdisk58 std
emc0_10a1 auto:cdsdisk - - online altused shared udid_mismatch hdisk59 std
emc0_10a2 auto:cdsdisk - - online shared udid_mismatch hdisk60 std
emc0_10a3 auto:cdsdisk - - online altused shared udid_mismatch hdisk61 std
emc0_10a4 auto:cdsdisk - - online altused shared udid_mismatch hdisk62 std
emc0_106a auto:cdsdisk emc0_106a sdg online shared hdisk7 std
emc0_106b auto:cdsdisk emc0_106b sdg online shared hdisk8 std
emc0_106c auto:cdsdisk emc0_106c sdg online shared hdisk9 std
If a SAN/Storage administrator performs Dynamic LUN reconfiguration such that a host loses one set of paths coming from a controller, and then, after the SAN reconfiguration, a different set of paths reappear on the same controller, then the AIX OS will reuse the same device numbers as before without any validation of LUN Serial Numbers. If the new set of LUNs visible on this controller is not the same as before, DMP does not know that the same device number corresponds to a different LUN and data corruption can result. Following is one scenario which can cause DCPA error followed by Data corruption.
SAN/Storage Admin un-maps a Masking View (zone) that causes some or all DMP nodes to lose one path. DMP will mark this path as failed for all the LUNs in the Masking view. I/O will still go through the other available paths.
SAN/Storage Admin now re-provisions a different set of LUNs (Storage Group, LUN Group) to the same HBA controller on the host(s). The new LUN set can be a sub-set of the old view, a new set of LUNs, or even a mix of the two. The problem is that AIX OS can reuse the same device numbers, and therefore it is not guaranteed that the device that has reappeared is in fact the same.
SAN/Storage Admin now re-provisions the LUNs to a different target and LUN number. This can be because additional LUNs are added or some LUNs were removed from the Masking view. OS device tree is changed i.e. the OS-assigned device name (e.g. rhdisk#) is different after the above change.
Let us assume LUN 0xA100 to 0xA109 are original LUNs assigned to disks with serial numbers ABCD1000 to ABCD1009. Assume that there are two paths via HBA controller fscsi1 and fscsi2. So, from OS device tree these devices are mapped as rhdisk1 through rhdisk20.
SAN team adds 10 more LUNs 0x0100 to 0x0109 and they want to rebalance LUNs at the same time.
SAN team deletes Masking view for 0xA100 to 0xA109 (which was for controller fscsi1). Host would mark devices as offline for controller fscsi1 and DMP will mark paths for device rhdisk1 to rhdisk10 as failed. I/O will still go through fscsi2 path.
SAN team adds Masking view for the 10 new LUNs 0x0100 to 0x0109 and existing LUNs 0xA100 to 0xA109 to fscsi1. AIX OS will assign OS device names (rhdisk#) 1 through 10 for first 10 LUNs and 21 to 30 for the next 10 LUNs. Because the rhdisk# changed for the original LUNs and there are new set of LUNs taking rhdisk1 to 10 now, VxVM will trigger DCPA error as soon as System Admin runs vxdctl enable on the node.
The in-core DMP device tree mapping for rhdisk1 to 10 still remembers LUN serial numbers 0xA100 to 0xA109 but it now sees LUN serial numbers as 0x0100 to 0x0109.
DCPA error is logged when the LUN serial number does not match, but if the device number matches then it is a case of device number reuse. The paths for which LUN Serial Number change was observed will be disabled. But, currently (in 5.1SP1 & 6.0 releases), DMP fails while attempting to disable the problematic path as they are busy. So, the problematic path remains active and I/O continues down the in-correct path leading to Data corruption. An enhancement will be done in 6.1 releases to override the "busy" state of the device to forcefully disable the problematic paths.
As long as the device numbers are not re-used, Dynamic LUN re-configuration is possible but in cases like above, it is a must to first remove the DMP device and OS device before re-using the same device numbers.
Refer to the DMP Admin Guide for detailed steps on Dynamic LUN re-configuration procedure.