EMC Clariion Disk Array in Asymmetric Logical Unit Access (ALUA) mode with Dynamic Multipathing (DMP) failover_poliy of "global" can cause unnecessary I/O failure

Article: 100004962
Last Published: 2011-12-16
Ratings: 0 0
Product(s): InfoScale & Storage Foundation

Problem

EMC Clariion Disk Array in Asymmetric Logical Unit Access (ALUA) mode with Dynamic Multipathing (DMP) failover_ policy of "global" can cause unnecessary I/O failure

The following is a description of the I/O failure caused by a "global" failover policy even though an alternate path is still available. Failover policy can be checked with the following vxdmpadm command.

# vxdmpadm getattr enclosure emc_clariion0 failover_policy
ENCLR_NAME     DEFAULT        CURRENT
=============================================
emc_clariion0  Global         Global

 

The following is a diagram of the connection between host system I/O controllers (fscsi0 and fscsi1) and the diskarray service processors (SPA and SPB).  The two LUNs (LUN15 and LUN136) are balanced between the two service processors.   SPA is the owner of LUN15 and SPB is the owner of the LUN136.

------------------------ ------------------------- -----------------------
| Node 05 (CVM Master) | | Node 09 (CVM Slave) | | Node 10 (CVM Slave) |
------------------------ ------------------------- -----------------------
fscsi0 fscsi1 fscsi0 fscsi1 fscsi0 fscsi1
\ . / . / .
LUN15-Pri \ ................... . / . LUN15-Sec
LUN136-Sec \ / . . / . LUN136-Pri
\ /------/ . . / .
\ / . . / .
\ / . . / .
\ / /--------------------------------/ .
\ / / . . .
\ / / . . ...............
\ / / . . .
SPA (LUN15 Owner) SPB (LUN136 Owner)
-------------------------------------------------------
| | . . | |
| | .................... . | |
| | . . | |
| | . | |
| | . . | |
| | .................... . | |
 | | . . | |
| LUN15 LUN136 |
| |
| Clariion Diskarray in ALUA Mode |
| |
-------------------------------------------------------

The systems have two controllers (fscsi0 and fscsi1) connecting to the Clariion Service Processors (SPA and SPB) respectively.
=====================================================================================================

Node 05 is the CVM master.   Node 09 and 10 are CVM slaves.   LUN15 has SPA as the default owner, while LUN136 has SPB as the default owner.   Initially, CVM chose the default owner SP (Primary Path) as the ACTIVE path.

Node 05 (CVM Master)
fscsi0 ---> SPA (Primary Path)   ---> LUN15      ACTIVE
fscsi1 ---  SPB (Secondary Path) ---  LUN15  
fscsi0 ---  SPA (Secondary Path  ---  LUN136
fscsi1 ---> SPB (Primary Path)   ---> LUN136     ACTIVE
 

 

Node 09 (CVM Slave)
fscsi0 ---> SPA (Primary Path)   ---> LUN15      ACTIVE
fscsi1 ---  SPB (Secondary Path) ---  LUN15
fscsi0 ---  SPA (Secondary Path  ---  LUN136
fscsi1 ---> SPB (Primary Path)   ---> LUN136     ACTIVE

 

Node 10 (CVM Slave)
fscsi0 ---> SPA (Primary Path)   ---> LUN15      ACTIVE
fscsi1 ---  SPB (Secondary Path) ---  LUN15
fscsi0 ---  SPA (Secondary Path  ---  LUN136
fscsi1 ---> SPB (Primary Path)   ---> LUN136     ACTIVE


Path Failure occurred on controller fscsi0 (connected to SPA) on Node 09 (CVM Slave)
===================================================================

Node 05 (CVM Master) decided to switch to SPB for LUN15 because Node 09 has problems accessing LUN15 using SPA.

Node 05 (CVM Master)
fscsi0 ---  SPA (Primary Path)   ---  LUN15                << switched from SPA to SPB
fscsi1 ---> SPB (Secondary Path) ---> LUN15      ACTIVE
fscsi0 ---  SPA (Secondary Path  ---  LUN136
fscsi1 ---> SPB (Primary Path)   ---> LUN136     ACTIVE

Logs from the DMP Event Log (/etc/vx/dmpevents.log). 

Wed Feb  2 20:54:27.875: CURPRI set to secondary for Dmpnode emc_clariion0_15 without quiescing

Node 09 (the node with controller failure) caused the CVM to switch to use SPB for LUN15 globally because of the global failover_policy.

Node 09 (CVM Slave)
fscsi0 -X-  SPA (Primary Path)   -X-  LUN15      DISABLED    <<< controller failure on Node 09
fscsi1 ---> SPB (Secondary Path) ---> LUN15      ACTIVE
fscsi0 -X-  SPA (Secondary Path) -X-  LUN136    
fscsi1 ---> SPB (Primary Path)   ---> LUN136     ACTIVE

 

Wed Feb  2 20:55:27.807: Failover initiated for Dmpnode emc_clariion0_15 without quiescing
Wed Feb  2 20:55:29.025: CURPRI set to secondary for Dmpnode emc_clariion0_15 without quiescing

 

Node 10 also switched to use SPB because of the global failover_policy.

Node 10 (CVM Slave)
fscsi0 ---  SPA (Primary Path)   ---  LUN15             <<< switched from SPA to SPB
fscsi1 ---> SPB (Secondary Path) ---> LUN15     ACTIVE
fscsi0 ---  SPA (Secondary Path  ---  LUN136
fscsi1 ---> SPB (Primary Path)   ---> LUN136    ACTIVE

 

Wed Feb  2 20:55:24.834: CURPRI set to secondary for Dmpnode emc_clariion0_15 without quiescing

Path Failure occurred on controller fscsi1 (connected to SPB) on Node 10 (another CVM slave)
=========================================================================

Node 05 (CVM Master) decided to switch access to LUN15 back to SPA because Node 10 has problems accessing LUN15 using SPB.

Node 05 (CVM Master)
fscsi0 ---> SPA (Primary Path)   ---> LUN15     ACTIVE
fscsi1 ---  SPB (Secondary Path) ---  LUN15            <<< switched from SPB to SPA
fscsi0 ---  SPB (Secondary Path) ---  LUN136
fscsi1 ---> SPB (Primary Path)   ---> LUN136    ACTIVE

 

Wed Feb  2 20:58:50.260: CURPRI set to primary for Dmpnode emc_clariion0_136 without quiescing
Wed Feb  2 20:58:51.786: CURPRI set to primary for Dmpnode emc_clariion0_15 without quiescing

 

Node 09 now can't access LUN15 because the ACTIVE path is switched back to SPA which has failed on Node 09.

 

Node 09 (CVM Slave)
fscsi0 -X-> SPA (Primary Path)   -X-> LUN15     DISABLED  <<< No active path because controller failed
fscsi1 ---  SPB (Secondary Path) ---  LUN15               <<< No active path because CVM Master chose SPA
fscsi0 -X-  SPB (Secondary Path) -X-  LUN136  
fscsi1 ---> SPB (Primary Path)   ---> LUN136    ACTIVE

 

Wed Feb  2 20:58:54.454: CURPRI set to primary for Dmpnode emc_clariion0_136 without quiescing
Wed Feb  2 20:58:54.513: CURPRI set to NULL for Dmpnode emc_clariion0_15 without quiescing
Wed Feb  2 20:58:58.381: I/O error occured (errno=0x6) on Dmpnode emc_clariion0_15

 

Node 10 can't access LUN 136 because the ACTIVE path failed on Node 10.


Node 10 (CVM Slave)
fscsi0 ---> SPA (Primary Path)   ---> LUN15    ACTIVE
fscsi1 -X-  SPB (Secondary Path) -X-  LUN15                
fscsi0 ---  SPA (Secondary Path) ---  LUN136             <<< No active path because CVM Master chose SPB
fscsi1 -X-> SPB (Primary Path)   -X-> LUN136   DISABLED  <<< No active path because controller failed on Node 10

  

Wed Feb  2 20:57:52.150: Failover initiated for Dmpnode emc_clariion0_136 without quiescing
Wed Feb  2 20:57:53.311: CURPRI set to NULL for Dmpnode emc_clariion0_136 without quiescing
Wed Feb  2 20:57:53.381: CURPRI set to primary for Dmpnode emc_clariion0_15 without quiescing
Wed Feb  2 20:57:57.181: I/O error occured (errno=0x6) on Dmpnode emc_clariion0_136

 

Error Message

Wed Feb  2 20:57:57.181: I/O error occured (errno=0x6) on Dmpnode emc_clariion0_136

Cause

The problem is caused by the global DMP failover_policy.

Solution

For ALUA diskarray, this problem can be avoided by setting the DMP failover_policy to local.

# vxdmpadm set enclosure <enclosure name>  failover_policy=local

If the DMP path failover policy is set to “local”, then each node in the cluster sets the Current Primary Path (CURPRI) based on path accessibility on that particular node.

 

Applies To

The DMP attribute failover_policy is only used in a Cluster Volume Manager (CVM)  environment.   It has no effect on local (non-CVM  shared)  Veritas Volume Manager (VxVM)  disks.

Was this content helpful?