Solaris hosts that use the 'sd' SCSI driver for 5.1SP1 VxVM disks must increase the Enclosure Timebound timeout setting to a value greater than the default 300 seconds. When a Solaris host uses the sd driver, the SCSI timeout (with the defaults) is 300 seconds [ sd_io_time x sd_retry_count ]. A SCSI timeout value of 300 seconds will cause the Enclosure Timebound timeout to fail the entire device rather than have the SCSI timeout fail the device path. The SCSI variable sd_io_time is commonly set to at least 60 seconds in /etc/system as part of configuration requirements from array vendors (EMC and Hitachi, for example). This setting is most important for external, SAN connected disks that are most likely to experience a unresponsive disk device due to a transport failure (i.e. SAN fabric failure). Internal disk (with only a single path to disk) are affected (the device will get disabled) however there are no alternate paths to try and SCSI would have disabled the device at the 300 second point anyway.
This tunable is Enclosure level. The change can be done online and is non-disruptive. The setting is persistent across reboots.
NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 49/0x32
sd_io_time defaults to 60 seconds. sd_retry_count default is 5. 60 seconds x 5 retries = 300 seconds before a SCSI command reports failure to DMP. This means that the Enclosure Timebound timeout is triggered before or at the same time SCSI has timed out the command. This leads to unnecessary device failure instead of the expected path failure.
This value should be high enough to allow the underlying SCSI driver to fail a path to a device before the iotimeout expires. If the iotimeout expires before the path is failed ( unresponsive device due to SAN fabic failure, etc.) alternate paths are not tried and the DMP node (whole device) will be disabled. A value of more than 300 would be required to ensure the other paths are tried before the device is disabled.
The total SCSI timeout defaults for the Solaris embedded driver ssd are different in that there is a built in 20 second delay for port events and the ssd_retry_count is 3, yeilding a total unresponsive device timeout of:
[ (ssd_io_time ) 60 X (ssd_retry_count ) 3 + 20 second FCP timer = 200 seconds total ]
Symantec's recommendation is to increase the value of the iotimeout to at least 15 seconds longer than the sd driver timeout period (300 seconds). Care must be taken to not exceed the allowed timeout in the upper layer applications running ( i.e. Database instance or other application using the DMP device ). Use the command 'vxdmpadm getattr enclosure <enclosure_name>' to obtain a display of the current settings. Timebound with a 300 second iotimeout is the default.
#vxdmpadm getattr enclosure emc0
ENCLR_NAME ATTR_NAME DEFAULT CURRENT
emc0 iopolicy MinimumQ MinimumQ
emc0 partitionsize 512 512
emc0 use_all_paths - -
emc0 failover_policy Global Global
emc0 recoveryoption[throttle] Nothrottle Nothrottle
emc0 recoveryoption[errorretry] Timebound Timebound <-- iotimeout at 300 seconds
emc0 redundancy 0 0
emc0 dmp_lun_retry_timeout 0 0
emc0 failovermode - -
For this example, we increase the value to 315 seconds:
#vxdmpadm setattr enclosure emc0 recoveryoption=timebound iotimeout=315
The change is instant-online, non-disruptive and persistent across reboots.
Solaris hosts using the 'sd' SCSI driver with 5.1SP1 and later for SAN disks. Solaris hosts that use the embedded LeadVille driver 'ssd' are not affected by this issue. Internal drives (excepting Fibre attach) that use the sd driver have only one path to the device therefore a path failure is fatal in any case. Both SCSI and Enclosure Timebound iotimeout would return fatal errors at the default 300 seconds.