CVM Cluster join cannot be established unless the nodes are started in a specific order and sequence

Article: 100017152
Last Published: 2012-01-24
Ratings: 0 0
Product(s): InfoScale & Storage Foundation

Problem

CVM Cluster join cannot be established unless the nodes are started in a specific order and sequence. This particular scenario happens when the nodes within a cluster do not see the same number of paths to the disk.


For example, node A sees 2 paths to all of the data disk  group whereas node B sees only 1 path to one of the data disk group.
 

# /usr/sbin/vxdmpadm getsubpaths dmpnodename=c2t0d6s2
NAME STATE PATH-TYPE[M] CTLR-NAME ENCLR-TYPE ENCLR-NAME ATTRS
================================================================================
c2t0d6s2 ENABLED - c2 HDS9960 HDS99600
 
# /usr/sbin/vxdmpadm getsubpaths dmpnodename=c2t0d6s2
NAME STATE PATH-TYPE[M] CTLR-NAME ENCLR-TYPE ENCLR-NAME ATTRS
================================================================================
c2t0d6s2 ENABLED - c2 HDS9960 HDS99600 -
c3t1d6s2 ENABLED - c3 HDS9960 HDS99600 -
 
This is the only LUN in the configuration which differs between the two nodes shown as an example. 
 
This requirement is not "explicitly" documented but is implied to be a requirement in CVM / SFORAC environment. All nodes should have equal number of HBA's ( 2 in this case ) for proper CVM/DMP operations ( i.e. failover failback )
 
The analogy of this is that when node A starts first, it sees 2 paths to the disk, hence, node A should become the master. Now, when node B is trying to join, it expects to see 2 paths to the disk as well, but it does not, hence unable to form a cluster join due to the reason of less number of paths visible to the joining slave node. 
 
However, if node B starts first and it only sees 1 path to the disk. The node B would become the master. Now, when node A tries to join, it sees all the paths which node B sees, (which is fine), as well as the other extra path which node A can only see and node B can NOT see. Hence, the cluster can be formed, as the master (node B) does not know anything about the additional paths (from the master point of view).

Error Message

Here are the sequence of logged events in /var/adm/messages when a similar situation occurs. It can be seen here that port v and w have established membership (ie: membership 01) but simply produces the connection time out error message which leads to the CVM join to fail.

Aug 15 03:21:16 H22RRMOBDB01 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port v gen  143b11b membership 01
Aug 15 03:21:27 H22RRMOBDB01 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port w gen  143b11d membership 01
Aug 15 03:21:27 H22RRMOBDB01 vxvm:vxconfigd: [ID 511694 daemon.error] V-5-1-8756 allow join for node 1 failed: Connection timed out
Aug 15 03:21:27 H22RRMOBDB01 vxvm:vxconfigd: [ID 448643 daemon.notice] V-5-1-3765 master: cluster join complete for node 1
Aug 15 03:21:27 H22RRMOBDB01 vxvm:vxconfigd: [ID 699813 daemon.notice] V-5-1-7899 CVM_VOLD_CHANGE command received
Aug 15 03:21:27 H22RRMOBDB01 vxvm:vxconfigd: [ID 322665 daemon.notice] V-5-1-7961 establishing cluster
Aug 15 03:21:27 H22RRMOBDB01 vxvm:vxconfigd: [ID 277465 daemon.notice] V-5-1-8062 master: not a cluster startup
Aug 15 03:24:42 H22RRMOBDB01 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port w gen  143b11e membership 0
Aug 15 03:24:42 H22RRMOBDB01 gab: [ID 674723 kern.notice] GAB INFO V-15-1-20038 Port w gen  143b11e k_jeopardy ;1
Aug 15 03:24:42 H22RRMOBDB01 gab: [ID 513393 kern.notice] GAB INFO V-15-1-20040 Port w gen  143b11e    visible ;1
Aug 15 03:24:42 H22RRMOBDB01 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port v gen  143b11c membership 0
Aug 15 03:24:42 H22RRMOBDB01 gab: [ID 674723 kern.notice] GAB INFO V-15-1-20038 Port v gen  143b11c k_jeopardy ;1
Aug 15 03:24:42 H22RRMOBDB01 gab: [ID 513393 kern.notice] GAB INFO V-15-1-20040 Port v gen  143b11c    visible ;1
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 699813 daemon.notice] V-5-1-7899 CVM_VOLD_CHANGE command received
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 778436 daemon.error] V-5-1-4109 -1 returned from volcvm_establish
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 886039 daemon.error] V-5-1-4852 cluster_establish: timed out
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 391371 daemon.error] V-5-1-11111 kernel_fail_join() : master_takeover is -1
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 565473 daemon.notice] V-5-1-9543 Timeout is not reset: another reconfig in progress
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 322665 daemon.notice] V-5-1-7961 establishing cluster
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 277465 daemon.notice] V-5-1-8062 master: not a cluster startup
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 451250 daemon.notice] V-5-1-8061 master: no joiners
Aug 15 03:24:42 H22RRMOBDB01 vxvm:vxconfigd: [ID 738708 daemon.notice] V-5-1-4123 cluster established successfully
 

 

 

Solution

To address this issue, please make sure all the nodes see the same number of paths to all disks/LUNs in shared disk groups.
 

Was this content helpful?