DMP Product Defect: Disabling DMP iostat collection when the recoveryoption is set to timebound can result in miscalculation of start and end error analysis times (DMP timing out the I/O too soon)
Problem
When the DMP iostat collection is disabled and the recoveryoption option is defined as "timebound", DMP can incorrectly calculate the length of time an I/O took to complete and hence timeout the in-flight I/O too soon.
The issue is only applicable to Linux platforms and potentially VMware environments. On all other platforms there is only one way to obtain the time, hence why the issue will not occur on other platforms.
The problem will occur when all three of the below conditions are met:
1. Recovery option of timebound is set against the impacted enclosure
2. DMP IOSTAT daemon is stopped.
3. We have received an I/O error from below layer.
If the SCSI layer takes longer than 300 seconds to fail the I/O, then DMP will not retry the I/O.
As a result of the DMP I/O threshold timeout being exhausted, the corresponding plex is detached and marked with the State "DETACHED IOFAIL".
In the event, that the surviving plex is also impacted as a result of a DMP I/O threshold timeout, the last remaining attached plex is not detached, however, the volume is.
When the DMP iostat collection is stopped, the calculation believes the DMP I/O timeout was exhausted . Therefore the I/O was not retried against other active paths and had to be failed.
Note: The volume will only be detached when a klog write error on volume is encountered, otherwise, I/O error messages will continued to be reported in the syslog file.
Error Message
LINUX syslog sample message:
May 30 19:26:06 barney kernel: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300 secs) I/O with start 46130801458 and end 461309f92d8 time.
Time conversion:
End time minus start time
461309f92d8 – 46130801458 = 1F7E80 (HEX) = 2064000 (DEC) / 1000000 usec (microseconds) = 2 seconds.
In this instance, DMP is timing out after approximately 2 seconds only, due to the miscalculation issue.
Cause
LINUX:
As a result of disabling the DMP iostat collection, the start and end times of the DMP buffer will be obtained using different sources.
As the DMP function is not comparing like for like values, thus, the unexpected result is returned and the I/O is failed.
Solution
A supported hotfix has been made available for this issue for 6.x. Please contact Veritas Technical Support to obtain this fix.
1. - HF on top of 6.0.300.200
2. - HF on top of 6.0.5
3. - HF on top of 6.1.0.100
Note: 6.1.1 (MR1) for VxVM includes the fix.
Workarounds:
1. Change the recoveroption to use fixedretry instead of timebound per enclosure.
# vxdmpadm setattr enclosure <enclosure-name> recoveryoption=fixedretry retrycount=3
The advantage of using fixedretry is that DMP would now not use time taken to decide to fail an I/O and hence this issue would be avoided.
The disadvantage is that if I/O does take a long time to complete at the lower layers, then the upper layers might have to wait an undetermined amount of time before the I/O is returned.
Or
2. If the timebound recovery option is preferred, enable dmp iostat collection
# vxdmpadm iostat start
If the user decides to manually disable DMP iostat collection via the ‘/etc/rc.local’ boot script on Linux for their own reasons, Veritas would advise applying the hot-fix.
Sample /etc/rc.local file content:
#!/bin/sh## This script will be executed *after* all the other init scripts.# You can put your own initialization stuff in here if you don't# want to do the full Sys V style init stuff.touch /var/lock/subsys/local/sbin/vxdmpadm iostat stopecho 0 > /proc/sys/kernel/hung_task_timeout_secsvxdmpadm setattr enclosure emc_clariion0 recoveryoption=timebound iotimeout=300vxdmpadm setattr enclosure emc_clariion1 recoveryoption=timebound iotimeout=300vxdmpadm setattr enclosure emc0 recoveryoption=timebound iotimeout=300vxdmpadm setattr enclosure emc1 recoveryoption=timebound iotimeout=300
Applies To
The issue is specific to Linux and VMware environments only. The defect is only applicable to Veritas Volume Manager (VxVM) 6.x.x and not 5.x.x.
VxVM 6.2 already has the fix included.
Linux Example:
#vxdmpadm getattr enclosure emc2ENCLR_NAME ATTR_NAME DEFAULT CURRENT - -
============================================================================
emc2 iopolicy MinimumQ MinimumQ
emc2 partitionsize 512 512
emc2 use_all_paths - -
emc2 recoveryoption[throttle] Nothrottle[0] Nothrottle[0]
emc2 recoveryoption[errorretry] Timebound[300] Timebound[300]
emc2 redundancy 0 0
emc2 dmp_lun_retry_timeout 0 0
emc2 failovermode