DMP Product Defect: Disabling DMP iostat collection when the recoveryoption is set to timebound can result in miscalculation of start and end error analysis times (DMP timing out the I/O too soon)

Article: 100012921
Last Published: 2015-01-28
Ratings: 0 0
Product(s): InfoScale & Storage Foundation

Problem

When the DMP iostat collection is disabled and the recoveryoption option is defined as "timebound", DMP can incorrectly calculate the length of time an I/O took to complete and hence timeout the in-flight I/O too soon.

The issue is only applicable to Linux platforms and potentially VMware environments. On all other platforms there is only one way to obtain the time, hence why the issue will not occur on other platforms.

 
The problem will occur when all three of the below conditions are met:

1.        Recovery option of timebound is set against the impacted enclosure

2.        DMP IOSTAT daemon is stopped.

3.        We have received an I/O error from below layer.


If the SCSI layer takes longer than 300 seconds to fail the I/O, then DMP will not retry the I/O.

As a result of the DMP I/O threshold timeout being exhausted, the corresponding plex is detached and marked with the State "DETACHED IOFAIL".
In the event, that the surviving plex is also impacted as a result of a DMP I/O threshold timeout, the last remaining attached plex is not detached, however, the volume is.

When the DMP iostat collection is stopped, the calculation believes the DMP I/O timeout was exhausted . Therefore the I/O was not retried against other active paths and had to be failed.

Note: The volume will only be detached when a klog write error on volume is encountered, otherwise, I/O error messages will continued to be reported in the syslog file.

 

Error Message


LINUX syslog sample message:

May 30 19:26:06 barney  kernel: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300 secs) I/O with start 46130801458 and end 461309f92d8 time.

Time conversion:

End time                    minus      start time

461309f92d8               –              46130801458              =           1F7E80      (HEX)     =          2064000  (DEC)    / 1000000  usec (microseconds)    =       2 seconds.


In this instance, DMP is timing out after approximately 2 seconds only, due to the miscalculation issue.

Cause


LINUX:

As a result of disabling the DMP iostat collection, the start and end times of the DMP buffer will be obtained using different sources.
As the DMP function is not comparing like for like values, thus, the unexpected result is returned and the I/O is failed.

 

Solution

A supported hotfix has been made available for this issue for 6.x. Please contact Veritas Technical Support to obtain this fix.

1.        - HF on top of 6.0.300.200

2.        - HF on top of 6.0.5

3.        - HF on top of  6.1.0.100


Note: 6.1.1 (MR1) for VxVM includes the fix.


Workarounds:

1.       Change the recoveroption to use fixedretry instead of timebound per enclosure.

# vxdmpadm setattr enclosure <enclosure-name> recoveryoption=fixedretry retrycount=3


The advantage of using fixedretry is that DMP would now not use time taken to decide to fail an I/O and hence this issue would be avoided.
The disadvantage is that if I/O does take a long time to complete at the lower layers, then the upper layers might have to wait an undetermined amount of time before the I/O is returned.

Or  

2.       If the timebound recovery option is preferred, enable dmp iostat collection

# vxdmpadm iostat start

 


If the user decides to manually disable DMP iostat collection via the ‘/etc/rc.local’ boot script on Linux for their own reasons, Veritas would advise applying the hot-fix.

Sample /etc/rc.local file content:
 

#!/bin/sh## This script will be executed *after* all the other init scripts.# You can put your own initialization stuff in here if you don't# want to do the full Sys V style init stuff.touch /var/lock/subsys/local/sbin/vxdmpadm iostat stopecho 0 > /proc/sys/kernel/hung_task_timeout_secsvxdmpadm setattr enclosure emc_clariion0 recoveryoption=timebound iotimeout=300vxdmpadm setattr enclosure emc_clariion1 recoveryoption=timebound iotimeout=300vxdmpadm setattr enclosure emc0 recoveryoption=timebound iotimeout=300vxdmpadm setattr enclosure emc1 recoveryoption=timebound iotimeout=300

Applies To

The issue is specific to Linux and VMware environments only. The defect is only applicable to Veritas Volume Manager (VxVM) 6.x.x and not 5.x.x.
VxVM 6.2 already has the fix included.

Linux Example:

#vxdmpadm getattr enclosure emc2
ENCLR_NAME      ATTR_NAME                     DEFAULT        CURRENT
============================================================================
emc2           iopolicy                      MinimumQ       MinimumQ
emc2           partitionsize                 512            512
emc2           use_all_paths                 -              -
emc2           recoveryoption[throttle]      Nothrottle[0]  Nothrottle[0]
emc2           recoveryoption[errorretry]    Timebound[300] Timebound[300]
emc2           redundancy                    0              0
emc2           dmp_lun_retry_timeout         0              0
emc2           failovermode             
     -              -

 

Was this content helpful?