EMC Clariion Disk Array in Asymmetric Logical Unit Access (ALUA) mode with Dynamic Multipathing (DMP) failover_poliy of "global" can cause unnecessary I/O failure
Problem
EMC Clariion Disk Array in Asymmetric Logical Unit Access (ALUA) mode with Dynamic Multipathing (DMP) failover_ policy of "global" can cause unnecessary I/O failure
The following is a description of the I/O failure caused by a "global" failover policy even though an alternate path is still available. Failover policy can be checked with the following vxdmpadm command.
# vxdmpadm getattr enclosure emc_clariion0 failover_policyENCLR_NAME DEFAULT CURRENT
=============================================
emc_clariion0 Global Global
The following is a diagram of the connection between host system I/O controllers (fscsi0 and fscsi1) and the diskarray service processors (SPA and SPB). The two LUNs (LUN15 and LUN136) are balanced between the two service processors. SPA is the owner of LUN15 and SPB is the owner of the LUN136.
------------------------ ------------------------- -----------------------
| Node 05 (CVM Master) | | Node 09 (CVM Slave) | | Node 10 (CVM Slave) |
------------------------ ------------------------- -----------------------
fscsi0 fscsi1 fscsi0 fscsi1 fscsi0 fscsi1
\ . / . / .
LUN15-Pri \ ................... . / . LUN15-Sec
LUN136-Sec \ / . . / . LUN136-Pri
\ /------/ . . / .
\ / . . / .
\ / . . / .
\ / /--------------------------------/ .
\ / / . . .
\ / / . . ...............
\ / / . . .
SPA (LUN15 Owner) SPB (LUN136 Owner)
-------------------------------------------------------
| | . . | |
| | .................... . | |
| | . . | |
| | . | |
| | . . | |
| | .................... . | |
| | . . | |
| LUN15 LUN136 |
| |
| Clariion Diskarray in ALUA Mode |
| |
-------------------------------------------------------
The systems have two controllers (fscsi0 and fscsi1) connecting to the Clariion Service Processors (SPA and SPB) respectively.
=====================================================================================================
Node 05 is the CVM master. Node 09 and 10 are CVM slaves. LUN15 has SPA as the default owner, while LUN136 has SPB as the default owner. Initially, CVM chose the default owner SP (Primary Path) as the ACTIVE path.
Node 05 (CVM Master)fscsi0 ---> SPA (Primary Path) ---> LUN15 ACTIVE
fscsi1 --- SPB (Secondary Path) --- LUN15
fscsi0 --- SPA (Secondary Path --- LUN136
fscsi1 ---> SPB (Primary Path) ---> LUN136 ACTIVE
Node 09 (CVM Slave)
fscsi0 ---> SPA (Primary Path) ---> LUN15 ACTIVE
fscsi1 --- SPB (Secondary Path) --- LUN15
fscsi0 --- SPA (Secondary Path --- LUN136
fscsi1 ---> SPB (Primary Path) ---> LUN136 ACTIVE
Node 10 (CVM Slave)fscsi0 ---> SPA (Primary Path) ---> LUN15 ACTIVE
fscsi1 --- SPB (Secondary Path) --- LUN15
fscsi0 --- SPA (Secondary Path --- LUN136
fscsi1 ---> SPB (Primary Path) ---> LUN136 ACTIVE
Path Failure occurred on controller fscsi0 (connected to SPA) on Node 09 (CVM Slave)
===================================================================
Node 05 (CVM Master) decided to switch to SPB for LUN15 because Node 09 has problems accessing LUN15 using SPA.
Node 05 (CVM Master)fscsi0 --- SPA (Primary Path) --- LUN15 << switched from SPA to SPB
fscsi1 ---> SPB (Secondary Path) ---> LUN15 ACTIVE
fscsi0 --- SPA (Secondary Path --- LUN136
fscsi1 ---> SPB (Primary Path) ---> LUN136 ACTIVE
Logs from the DMP Event Log (/etc/vx/dmpevents.log).
Wed Feb 2 20:54:27.875: CURPRI set to secondary for Dmpnode emc_clariion0_15 without quiescing
Node 09 (the node with controller failure) caused the CVM to switch to use SPB for LUN15 globally because of the global failover_policy.
Node 09 (CVM Slave)fscsi0 -X- SPA (Primary Path) -X- LUN15 DISABLED <<< controller failure on Node 09
fscsi1 ---> SPB (Secondary Path) ---> LUN15 ACTIVE
fscsi0 -X- SPA (Secondary Path) -X- LUN136
fscsi1 ---> SPB (Primary Path) ---> LUN136 ACTIVE
Wed Feb 2 20:55:27.807: Failover initiated for Dmpnode emc_clariion0_15 without quiescing
Wed Feb 2 20:55:29.025: CURPRI set to secondary for Dmpnode emc_clariion0_15 without quiescing
Node 10 also switched to use SPB because of the global failover_policy.
Node 10 (CVM Slave)fscsi0 --- SPA (Primary Path) --- LUN15 <<< switched from SPA to SPB
fscsi1 ---> SPB (Secondary Path) ---> LUN15 ACTIVE
fscsi0 --- SPA (Secondary Path --- LUN136
fscsi1 ---> SPB (Primary Path) ---> LUN136 ACTIVE
Wed Feb 2 20:55:24.834: CURPRI set to secondary for Dmpnode emc_clariion0_15 without quiescing
Path Failure occurred on controller fscsi1 (connected to SPB) on Node 10 (another CVM slave)
=========================================================================
Node 05 (CVM Master) decided to switch access to LUN15 back to SPA because Node 10 has problems accessing LUN15 using SPB.
Node 05 (CVM Master)fscsi0 ---> SPA (Primary Path) ---> LUN15 ACTIVE
fscsi1 --- SPB (Secondary Path) --- LUN15 <<< switched from SPB to SPA
fscsi0 --- SPB (Secondary Path) --- LUN136
fscsi1 ---> SPB (Primary Path) ---> LUN136 ACTIVE
Wed Feb 2 20:58:50.260: CURPRI set to primary for Dmpnode emc_clariion0_136 without quiescing
Wed Feb 2 20:58:51.786: CURPRI set to primary for Dmpnode emc_clariion0_15 without quiescing
Node 09 now can't access LUN15 because the ACTIVE path is switched back to SPA which has failed on Node 09.
Node 09 (CVM Slave)
fscsi0 -X-> SPA (Primary Path) -X-> LUN15 DISABLED <<< No active path because controller failed
fscsi1 --- SPB (Secondary Path) --- LUN15 <<< No active path because CVM Master chose SPA
fscsi0 -X- SPB (Secondary Path) -X- LUN136
fscsi1 ---> SPB (Primary Path) ---> LUN136 ACTIVE
Wed Feb 2 20:58:54.454: CURPRI set to primary for Dmpnode emc_clariion0_136 without quiescing
Wed Feb 2 20:58:54.513: CURPRI set to NULL for Dmpnode emc_clariion0_15 without quiescing
Wed Feb 2 20:58:58.381: I/O error occured (errno=0x6) on Dmpnode emc_clariion0_15
Node 10 can't access LUN 136 because the ACTIVE path failed on Node 10.
Node 10 (CVM Slave)
fscsi0 ---> SPA (Primary Path) ---> LUN15 ACTIVE
fscsi1 -X- SPB (Secondary Path) -X- LUN15
fscsi0 --- SPA (Secondary Path) --- LUN136 <<< No active path because CVM Master chose SPB
fscsi1 -X-> SPB (Primary Path) -X-> LUN136 DISABLED <<< No active path because controller failed on Node 10
Wed Feb 2 20:57:52.150: Failover initiated for Dmpnode emc_clariion0_136 without quiescing
Wed Feb 2 20:57:53.311: CURPRI set to NULL for Dmpnode emc_clariion0_136 without quiescing
Wed Feb 2 20:57:53.381: CURPRI set to primary for Dmpnode emc_clariion0_15 without quiescing
Wed Feb 2 20:57:57.181: I/O error occured (errno=0x6) on Dmpnode emc_clariion0_136
Error Message
Wed Feb 2 20:57:57.181: I/O error occured (errno=0x6) on Dmpnode emc_clariion0_136
Cause
The problem is caused by the global DMP failover_policy.
Solution
For ALUA diskarray, this problem can be avoided by setting the DMP failover_policy to local.
# vxdmpadm set enclosure <enclosure name> failover_policy=local
If the DMP path failover policy is set to “local”, then each node in the cluster sets the Current Primary Path (CURPRI) based on path accessibility on that particular node.
Applies To
The DMP attribute failover_policy is only used in a Cluster Volume Manager (CVM) environment. It has no effect on local (non-CVM shared) Veritas Volume Manager (VxVM) disks.