Problem
In this instance, CVM is failing to start on a single host as one or more disks have conflicting cluster IDs for the same disks accessible and imported on other nodes in the existing cluster.
In a CVM cluster, the vxdg -s import option imports a disk group as cluster-sharable.
When attempting to start the CVM service group on a node, it fails stating "the disks are in use by another cluster" while starting the cluster or "No valid disk found containing disk group: retry to add a node failed" if one or more nodes rebooted from the running cluster.
The vxdg -s import operation is only valid if the CVM clustering components are active on the importing host.
Ensure that all the disks in a shared disk group are physically accessible by all hosts. A host which cannot access all the disks in a shared disk group cannot join the cluster.
Disks in a shared disk group are stamped with the ID of the cluster and with the shared flag.
When a host joins the cluster, it automatically imports disk groups whose disks are stamped with the cluster ID.
Error Message
When attempting to online the CVM service group the following log message can be seen:
reason: Disk in use by another cluster: retry to add a node failed
Or
Jul 1 17:32:09 VCS ERROR V-16-20006-1005 CVMCluster:cvm_clus:monitor:node - state: out of cluster#012reason: No valid disk found containing disk group: retry to add a node failed
Cause
When verifying the cluster ID for all shared disks, one of the nodes in the cluster reports a cluster ID mismatch.
In this instance, the correct cluster ID should be reflected as "fred". However, a subset of shared disks are reporting a different incorrect cluster ID of "barney".
How to display the cluster ID for all shared disks
# for i in $disk; do echo $i; vxdisk list $i | grep -i clusterid; done
3pardata0_60
clusterid: barney
3pardata0_61
clusterid: barney
3pardata0_62
clusterid: fred
3pardata0_63
clusterid: fred
The cluster ID mismatch needs to be corrected and aligned across all the nodes in the Veritas cluster.
The /etc/vx/diag.d/vxprivutil utility can be used to validate the cluster ID written in the disks private region (on-disk).
diskid: 1478039917.53.charlie
group: name=datadg id=1478040527.89.charlie
flags: shared autoimport cds
hostid: barney <<<< should be stating fred
version: 3.1
iosize: 512
public: slice=3 offset=65792 len=503229520
private: slice=3 offset=256 len=65536
update: time=1574987544 seqno=0.75
ssb: actual_seqno=0.0
headers: 0 240
configs: count=1 len=51360
logs: count=1 len=4096
tocblks: 0
tocs: 16/65520
Defined regions:
config priv 000048-000239[000192]: copy=01 offset=000000 enabled
config priv 000256-051423[051168]: copy=01 offset=000192 enabled
log priv 051424-055519[004096]: copy=01 offset=000000 enabled
lockrgn priv 055520-055663[000144]: part=00 offset=000000
tagid priv 065488-065503[000016]: tag=udid_asl=3PARdata%5FVV%5FA536%5F003C0001A536
NOTE: In the above example, the incorrect cluster ID of " barney" is shown instead of the expected cluster ID of " fred".
Occasionally, the cluster ID seen in the VxVM kernel by running " vxdisk list <disk-name>" may disagree with what is actually written on-disk.
NOTE: It is critical that all possible hosts referencing the conflicting cluster ID are checked, ensuring the shared disk group is not imported on the hosts with conflicting clusterid.
Solution
To ensure the VxVM disk group configuration structure available in the kernel is refreshed on-disk, run "vxdg flush <dg-name>" from the master node(by making sure master node has the correct cluster id stamped on all the disks).
The flush operation should update the on-disk content, correcting any kernel and on-disk mismatch, therefore refreshing the conflicting cluster ID mismatch.
Following the flush operation, validating the expected cluster ID is now showing the correct details using the above for loop and vxprivutil list results.
diskid: 1478039917.53.charlie
group: name=datadg id=1478040527.89.charlie
flags: shared autoimport cds
hostid: fred <<<< the correct cluster ID is now shown
version: 3.1
iosize: 512
public: slice=3 offset=65792 len=503229520
private: slice=3 offset=256 len=65536
update: time=1574989544 seqno=0.76
ssb: actual_seqno=0.0
headers: 0 240
configs: count=1 len=51360
logs: count=1 len=4096
tocblks: 0
tocs: 16/65520
Defined regions:
config priv 000048-000239[000192]: copy=01 offset=000000 enabled
config priv 000256-051423[051168]: copy=01 offset=000192 enabled
log priv 051424-055519[004096]: copy=01 offset=000000 enabled
lockrgn priv 055520-055663[000144]: part=00 offset=000000
tagid priv 065488-065503[000016]: tag=udid_asl=3PARdata%5FVV%5FA536%5F003C0001A536