CVM node fails to join cluster after a reboot when using VMware virtual disks as GUEST does not have disk.EnableUUID parameter set

Article: 100012772
Last Published: 2023-11-10
Ratings: 4 0
Product(s): InfoScale & Storage Foundation

Problem

SFCFS (Storage Foundation Cluster File System) 6.0.x slave nodes may fail to join the CVM cluster after a reboot.

Error messages in the system log indicate "slave: missing disk <disk_id>" or "Cannot find disk on slave node," even though all shared disks are present and can be accessed on the slave node.

 

Error Message

Reported by the system log during cluster startup:

May 11 02:54:02 node1 vxvm:vxconfigd: V-5-1-7899 CVM_VOLD_CHANGE command received
May 11 02:54:07 node1 kernel: GAB INFO V-15-1-20036 Port w[GAB_USER_CLIENT (refcount 0)] gen   f3423b membership 01
May 11 02:54:07 node1 vxvm:vxconfigd: V-5-1-8222 slave: missing disk 1395546193.23.node1
---SNIP---
May 11 02:54:07 node1 vxvm:vxconfigd: V-5-1-7830 cannot find disk 1395546193.23.node1
May 11 02:54:07 node1 vxvm:vxconfigd: V-5-1-11092 cleanup_client: (Cannot find disk on slave node) 222
May 11 02:54:07 node1 vxvm:vxconfigd: V-5-1-11467 kernel_fail_join() : Reconfiguration interrupted: Reason is retry to add a node failed (13, 0)
May 11 02:54:07 node1 kernel: VxVM vxio V-5-0-164 Failed to join cluster testclus, aborting
---SNIP---
May 11 02:54:07 node1 kernel: GAB INFO V-15-1-20032 Port v closed
May 11 02:54:07 node1 vxvm:vxconfigd: V-5-1-7901 CVM_VOLD_STOP command received
May 11 02:54:07 node1 kernel: GAB INFO V-15-1-20032 Port w closed
May 11 02:54:18 node1 Had[5739]: VCS ERROR V-16-20006-1005 (node1) CVMCluster:cvm_clus:monitor:node - state: out of cluster reason: Cannot find disk on slave node: retry to add a node failed


 

With debug enabled on vxconfigd, the debug log contains these messages:

05/11 20:51:56:  VxVM vxconfigd DEBUG  V-5-1-26865 dup_priv_to_disk_attrs: setting INVALID udid for disk attributes - diskid = 1395546193.23.node1
05/11 20:51:57:  VxVM vxconfigd DEBUG  V-5-1-18475 priv_build_tagid: add tag udid_other
---SNIP---
05/11 20:51:57:  VxVM vxconfigd DEBUG  V-5-1-18518 vold_get_udid_type: Found LIBNAME otherdisks for disk sdd sdd
05/11 20:51:57:  VxVM vxconfigd DEBUG  V-5-1-18509 vold_get_type_udid_entry: Found OTHER_DISK match: 2 udid_other VMware%5FVirtual%20disk%5FOTHER%5FDISKS%5Fnode2%5F%2Fdev%2Fsdd
05/11 20:51:57:  VxVM vxconfigd DEBUG  V-5-1-26865 dup_priv_to_disk_attrs: setting INVALID udid for disk attributes - diskid = 1376023067.9.node2
05/11 20:51:57:  VxVM vxconfigd DEBUG  V-5-1-22899 vold_disk_check_udid_mismatch: disk sdd IGNORING UDID_MISMATCH
priv region UDID VMware%5FVirtual%20disk%5FOTHER%5FDISKS%5Fnode2%5F%2Fdev%2Fsdd
current ddl UDID VMware%5FVirtual%20disk%5FOTHER%5FDISKS%5Fnode1%5F%2Fdev%2Fsdd (type 2)
---SNIP---
05/11 20:51:57:  VxVM vxconfigd DEBUG  V-5-1-22968 da_find_slave_diskid: find diskid 1395546193.23.node1 flags 0x4
05/11 20:51:57:  VxVM vxconfigd WARNING  V-5-1-8222 slave: missing disk 1395546193.23.node1
05/11 20:51:57:  VxVM vxconfigd WARNING  V-5-1-7830 cannot find disk 1395546193.23.node1
05/11 20:51:57:  VxVM vxconfigd DEBUG  V-5-1-27084 setup_remote_disks failed. Aborting join.
05/11 20:51:57:  VxVM vxconfigd DEBUG  V-5-1-5963 slave_abort called reason is 222;
05/11 20:51:57:  VxVM vxconfigd ERROR  V-5-1-11092 cleanup_client: (Cannot find disk on slave node) 222
05/11 20:51:57:  VxVM vxconfigd ERROR  V-5-1-11467 kernel_fail_join() : Reconfiguration interrupted: Reason is retry to add a node failed (13, 0)
05/11 20:51:57:  VxVM vxconfigd DEBUG  V-5-1-681 IOCTL CLUSTER_FAIL_JOIN: return 0(0x0)

 

The "vxdisk list" output shows the UDID of the disk being different across cluster nodes (the UDID contains the host name):

( on CVM master node )

# vxdisk list sdg |egrep '^(disk|flags|udid):'
disk:      name=sdg id=1395546193.23.node1
flags:     online ready private autoconfig shared autoimport imported
udid:      VMware%5FVirtual%20disk%5FOTHER%5FDISKS%5Fnode2%5F%2Fdev%2Fsdg

( on CVM slave node )

# vxdisk list sdg |egrep '^(disk|flags|udid):'
disk:      name= id=1395546193.23.node1
flags:     online ready private autoconfig shared autoimport
udid:      VMware%5FVirtual%20disk%5FOTHER%5FDISKS%5Fnode1%5F%2Fdev%2Fsdg

 

Cause
 

When VMDK disks are shared by multiple hosts, the ESX server appends the virtual server name to the disk serial number. This causes a UDID mismatch in the cluster environment, with each cluster node seeing different serial numbers.

 

Solution

A fix is included in either of these updates:

  • VRTSaslapm package version 6.0.100.100, or higher.
  • SFCFS, SFHA or SFRAC (Storage Foundation Cluster File System, Storage Foundation High Availability or Storage Foundation for Oracle RAC) to version 6.0.5, or higher, on all cluster nodes that are using VMware virtual disks as a storage subsystem.

The latest version of VRTSaslapm package can be downloaded from Veritas SORT. Refer to the README file in the package details page for installation procedures:
https://sort.Veritas.com/asl/latest

 

In addition:

Enabling disk UUID on virtual machines

You must set the disk.EnableUUID parameter for each VM to "TRUE". This step is necessary so that the VMDK always presents a consistent UUID to the VM, thus allowing the disk to be mounted properly. For each of the virtual machine nodes (VMs) that will be participating in the cluster, follow the steps below from the vSphere client:

 

To enable disk UUID on a virtual machine

  1. Power off the guest.

  2. Select the guest and select Edit Settings.

  3. Select the Options tab on top.

  4. Select General under the Advanced section.

  5. Select the Configuration Parameters... on right hand side.

  6. Check to see if the parameter disk.EnableUUID is set, if it is there then make sure it is set to TRUE.

    If the parameter is not there, select Add Row and add it.

  7. Power on the guest.

 

NOTE: 

If the EnableUUID parameter is not set in VMware Virtual Machine Properties for the disks, then inquiry pages become un-available for such devices.

Therefore, all these devices will get claimed under the other_disks category. The udid_mismatch or clone disk related flags may be visible as a result.

 

Applies To

This issue applies to cluster nodes under these conditions:

  • Nodes running Storage Foundation Cluster File System High Availability (SFCFSHA) for Linux (all 6.0.X releases prior to 6.0.5)
  • Storage Foundation for Oracle RAC (SF Oracle RAC) for Linux (all 6.0.X releases prior to 6.0.5)
  • Nodes running on VMware virtual servers (guests)
  • Shared LUNs are backed by VMware virtual disk (.vmdk) files
     

Was this content helpful?