How to reduce the time taken by Veritas Dynamic Multi-pathing (DMP) to react to disabled and enabled controller related events
Description
The following article outlines how to reduce the time taken by Veritas Dynamic Multi-pathing (DMP) to react to disabled and enabled controller related events.
Dynamic Multi-Pathing (DMP) maintains a kernel task that re-examines the condition of paths at a specified interval. The type of analysis that is performed on the paths depends on the checking policy that is configured.
The default dmp_restore_policy is check_disabled and will check every 300 seconds.
Oracle (Solaris) and DMP be tuned to reduce the time window to safely disable paths associated with a disabled switch port event. Whilst ensuring the in-flight I/O’s are processed correctly.
1.] The path restoration thread analyzes all paths in the system and revives the paths that are back online, as well as disabling the paths that are inaccessible. The command to configure this policy is:
# vxdmpadm settune dmp_restore_policy = check_all
2.] The dmp_restore_interval tunable parameter specifies how often the path restoration thread examines the paths
# vxdmpadm settune dmp_restore_interval = 150
Where alternate (different) controller paths are available and are enabled, failed in-flight I/O’s will be retired down a different controller path.
Solaris (Oracle) Setting:
To reduce the file spent beneath DMP by the sd and ssd drivers, the sd_io_time and ssd_io_time can be set to 0x1E (30 Seconds)
Update the /etc/system file with the new values and reboot the server for the values to take affect:
set sd: sd_io_time = 0x1E # 30sec
set ssd: ssd_io_time = 0x1E # 30sec
With the above settings in place, users can expect DMP to be quicker in reacting to disabled and enabled events as follows:
Disable path:
17:47:06 - Switch Port Disabled
17:49:44 - Disabled all paths for enabled controller
Enable path:
18:03:03 - Switch Port Enabled
18:05:33 - DMP Enabled all paths for disabled controller
Why does it take Veritas Dynamic Multi-pathing (DMP) time to react to disabled and enabled controller related events?
Refer to article (100051997) for additional technical content.
ADDITIONAL TESTS
----------------------------------------------------------------------------------------------------------------------------------------------------------
A series of tests were also conducted to illustrate the difference in timings, when different DMP tunables and Solaris settings were deployed:
## Test 1 - Attempt 1:
Settings:
- vxdmpadm setattr enclosure huawei-xsg10 recoveryoption=fixedretry retrycount=1
- set ssd:ssd_io_time=0x3C
- vxdmpadm settune dmp_restore_interval=300
Timings:
15:41:10 Switch port is disabled
15:41:16 Link offline message received from the OS
15:41:39 DMP path disable message received
15:42:23 Switch port is enabled
15:42:29 Link online message received from the OS
15:44:10 DMP path enable message for all paths
Outcome: Acceptable time
## Test 1 - Attempt 2:
15:44:47 Switch port disabled
15:44:53 Link offline message from OS
15:45:12 DMP path disable message received for SOME of the paths
15:47:25 DMP path disable message received for the remaining paths
15:47:51 Switch port enabled
15:47:57 Link online message from OS
15:50:00 DMP path enable message for all paths
Outcome Acceptable time, however, delay in all paths being disabled
----
## Test 2 - Attempt 1:
Configuration:
- vxdmpadm setattr enclosure huawei-xsg10 recoveryoption=fixedretry retrycount=1
- vxdmpadm settune dmp_restore_interval=300
- set ssd:ssd_io_time=0x1E
NOTE: Since the value of "ssd_io_time" was changed, the server was restarted
15:54:51 Switch port disabed
15:54:58 Link offline message from OS
After 15 mins waiting, we force the rescan and that is when DMP disabled the paths
16:11:47 Switch port enabled
16:11:53 Message Link online from OS
16:12:34 DMP path enabled message for all paths
Outcome: The paths were never disabled
## Test 2 - Attempt 2:
16:13:30 The switch port is disabled
16:13:36 Message Link offline of Operating System
16:19:23 DMP disable message for all paths
16:20:09 The switch port is enabled
16:20:15 Message Link online of Operating System
16:21:54 DMP enable message for all paths
Outcome: Approximately 6 min to detect disabled paths
----
## Test 3 - Attempt 1:
Setting:
- vxdmpadm setattr enclosure huawei-xsg10 recoveryoption = fixedretry retrycount = 1
- set ssd: ssd_io_time = 0x1E
- vxdmpadm settune dmp_restore_interval = 150
16:23:12 The switch port is disabled
16:23:18 Operating System offline link message
16:24:25 DMP disable message from paths
16:25:05 The switch port is enabled
16:25:11 Online Operating System Link Message
16:26:56 DMP enable message from all paths
Outcome: Quick response times
## Test 3 - Attempt 2:
16:27:20 The switch port is disabled
16:27:26 Operating System offline link message
16:30:11 DMP disable message for some paths
16:32:41 DMP disable message from the rest of the paths
16:33:07 The switch port is enabled
16:33:13 Online Operating System Link Message
16:35:12 DMP enable message from all paths
Outcome: More than 5 min to detect all disabled paths
----
## Test 4 - Attempt 1:
Setting:
- vxdmpadm setattr enclosure huawei-xsg10 recoveryoption = timebound iotimeout = 300
- set ssd: ssd_io_time = 0x1E
- vxdmpadm settune dmp_restore_interval = 150
18:42:43 The switch port is disabled
18:42:49 Operating System offline link message
18:43:52 DMP disable message for some paths
After a while waiting for no change, "vxdisk scandisks" ran to detect remaining disabled disabled paths.
Outcome: More than 5 min to detect all disabled paths
Recommendations:
In light of the above inconsistent test results, Veritas recommends the following tunable values in an effort to examine all path states and reduce the time interval to examine path states.
To ensure the path restoration thread analyzes all paths in the system and revives the paths that are back online, as well as disabling the paths that are inaccessible set the dmp_restore_policy to "check_all".
Use:
# vxdmpadm settune dmp_restore_policy = check_all
# vxdmpadm settune dmp_restore_interval = 150
As the dmp_restore_interval tunable specifies how often the path restoration thread examines the paths presented to the host. If the system has more than 500 LUNs with multiple paths, the "dmp_restore_interval" setting may need to be increased to a higher value (either above 150 or back to default value of 300).