Veritas Dynamic Multi-Pathing (DMP) fails to successfully failover I/O during array activity when using EMC NDM and NDU
Problem
When performing array maintenance (updates) in some rare cases, Veritas DMP (Dynamic Multi-pathing) may fail to successfully failover I/O to alternate active paths, resulting in unintended Veritas file systems faults (outages).
This could be a result of many external factors.
EMC has provided recommendations to aid with such interoperability issues.
Error Message
Cause
NDM (Non-Disruptive Migration) overview
NDM is designed to help automate the process of migrating host applications to a PowerMax, VMAX All-Flash, or VMAX3 enterprise storage array with no downtime.
NDM leverages VMAX SRDF replication technologies to move the application data to the new storage array.
During this activity window, DMP may unintentionally fault a DMPNODE resulting in an unwanted file system fault (outage).
EMC are recommending the following DMP tunable parameters be amended when performing EMC NDM migrations:
1. dmp_path_age 0
2. dmp_health_time 0
During the NDM activity, EMC doesn't want Veritas DMP to send any test INQUIRIES to the volumes (LUNs) where paths are marked as "dead" (disabled).
By modifying the above DMP tunables, this allows DMP to avoid such checks thus potentially avoiding any unwanted disabling of DMPNODEs.
This solution has been tested and qualified by both Elab and Symmetrix engineering and is outlined in the following best practice document:
NOTE: At the time of writing this article, the details are outlined on Page 168.
Parameter
|
Description
|
dmp_health_time
|
DMP detects intermittently failing paths, and prevents I/O requests from being sent on them. The value of dmp_health_time represents the time in seconds for which a path must stay healthy. The default value is 60 seconds.
A value of 0 prevents DMP from detecting intermittently failing paths.
|
dmp_path_age
|
The time for which an intermittently failing path needs to be monitored as healthy before DMP again tries to schedule I/O requests on it.
The default value is 300 seconds.
A value of 0 prevents DMP from detecting intermittently failing paths.
|
dmp_restore_interval
|
The interval attribute specifies how often the path restoration thread examines the paths. Specify the time in seconds. The default value is 300.
The value of this tunable can also be set using the vxdmpadm start restore command.
|
Solution
The DMP parameter values can be modified using the following syntax:
# vxdmpadm settune dmp_tunable=value
# vxdmpadm gettune [dmp_tunable]
The tunables should limit inquires being executed on the path when the migration or NDU activities are happening.
# vxdmpadm settune dmp_health_time=0
# vxdmpadm settune dmp_path_age=0
# vxdmpadm settune dmp_restore_interval=10
The only problem here would be even if a path were to fail with an IO error, Veritas would not then enable the path automatically. The Admin/User would have to enable it once the activity is completed.
The dmp_restore_interval impact is limited since it controls after how much time the disabled paths should be probed. Setting it to 10 would cause less inquiries to be triggered frequently on the disabled path.
To limit inquires being triggered on the path then the following command must be used:
# vxdmpadm stop restore
NOTE: Once the array activity window is complete, the DMP parameters should be set to back to the original values.
NDU (Non-disruptive upgrade) Activity
A non-disruptive upgrade (NDU) is an update to software or hardware that does not interrupt access to data or system services. An NDU does not require the system to be rebooted when the upgrade process is completed.
Summary of commands prior to starting array activity:
# vxdmpadm settune dmp_health_time=0
# vxdmpadm settune dmp_path_age=0
# vxdmpadm settune dmp_restore_interval=10
# vxdmpadm stop restore
Summary of commands post array activity:
# vxdmpadm settune dmp_health_time=60
# vxdmpadm settune dmp_path_age=300
# vxdmpadm settune dmp_restore_interval=300
# vxdmpadm start restore