InfoScale and Storage Foundation for Windows (SFW) log the following error: "Reservation refresh has been suspended for cluster disk group"
Problem
Servers that are part of a Windows Failover Cluster (WFC) or Microsoft Cluster Server (MSCS) report an error to the System Event Log indicating a Reservation Suspension. In addition to this, the volumes contained in the affected Diskgroup(s) are no longer accessible, resulting in application errors due to volume access failures, and possibly resource faults in the cluster. This issue is seen in the following versions:
- SFW 6.0 and 6.1
- InfoScale 7.0, 7.1, and 7.3
Note: Starting with InfoScale 7.4, this functionality is disabled by default and this issue should no longer be seen going forward.
Error Message
Event ID: 52
Source: VXIO
Description: Cluster software communication timeout. Reservation refresh has been suspended for cluster disk group "DGGUID"
V-40-49157-52
vxio: Cluster software communication timeout. Reservation refresh has been suspended for cluster disk group "DGGUID"
Cause
When a VMDg (Volume Manager Diskgroup) resource is online, the SFW/InfoScale software (Volume Manager) expects to receive an isAlive request from the cluster. The process is as follows:
- RHS sends an isAlive request to vxres.dll (VMDg resource Type registered with the cluster). By default, each VMDg resource should receive an isAlive request from the cluster's RHS process every 60 seconds.
- vxres.dll communicates directly with Volume Manager to verify the Diskgroup and Volumes are online and accessible.
- Volume Manager returns the state to vxres.dll which then informs RHS of the status (Success or Failure)
By default, if Volume Manager does not get an isAlive request for 10 cycles (60 seconds x 10 cycles = 10 minutes), then it assumes this cluster node may be in a hung state, so it suspends its reservation refresh to allow another node in the cluster to perform a takeover if necessary.
Solution
Microsoft:
There have been several recent issues with RHS not performing isAlive requests in the cluster which resulted in this issue being seen frequently, and the main driving factor of us disabling this functionality going forward. According to Microsoft, this has been fixed and confirmed in the following:
KB4075212: Informed my Microsoft Engineering the RHS fix was included in this KB; however it is undocumented. This fix should be included in Cumulative KBs from this point.
KB4013429: Informed by a customer that Microsoft Support confirmed the fix was included in this KB, but also recommended they patch up to: KB4462917. Again, all future Cumulative fixes from Microsoft should include these RHS patches.
Even if the RHS fixes from Microsoft are put in place, it is strongly recommended that the workaround or fixes (depending on the SFW/InfoScale software version) are implemented. This reservation suspension functionality was well intentioned, but has shown to cause more harm than good.
VERITAS:
SFW 6.0/6.1: Implement the workaround provided in the 'Workaround' section below. This will set the isAlive cycle value to an extremely large number, effectively disabling the feature.
InfoScale 7.1: Contact Support and request the available fix that will disable this functionality.
InfoScale 7.2: Contact Support and request the available fix that will disable this functionality.
InfoScale 7.3: Upgrade to 7.4 (or above) or implement the workaround provided in the 'Workaround' section below.
InfoScale 7.4 +: No action is needed as this functionality is disabled by default.
Workaround
1. Open the Registry Editor (start > run > regedit)
2. Browse to HKEY_LOCAL_MACHINE\SOFTWARE\VERITAS\VxSvc\CurrentVersion\VolumeManager
3. Create the following value:
- Name: MSCSMaxResLoopsB4Timeout
- Value Type: REG_DWORD
- Value Date: Set to 9999999 (decimal)
4. Open a command prompt and restart the vxsvc service:
- net stop vxsvc
- net start vxsvc
Note: The restarting of vxsvc may result in VMDg resources faulting when performed on an Active node. It is recommended that step be performed on Passive Nodes only, or during an outage window. However, the registry changes can be made in advance, and a restart of the vxsvc service (or reboot of the server) at a later date would then implement the change.
5. Perform the same steps on all additional cluster nodes.
Re-enabling the feature:
If the Private Fixes are applied to 7.1, 7.2, or the server is running 7.4 or above, the functionality is disabled by default. However, if this functionality is wanted, there is a command that can be run to enable/disable:
Open a command prompt (Run as Administrator), and run the following command:
vxtune clustercommunication_timeout True
To disable (which is the default setting):
vxtune clustercommunication_timeout False
This updates the following registry value automatically:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vxio\VVRParams
"CLUSTER_COMM_TIMEOUT"
0 = Disabled
1 = Enabled
Further reading:
Further information about troubleshooting messages that are reported by vxio can be found in the article that is linked in the "Related Documents" section.
Applies To
Microsoft Cluster Server and Windows Failover Cluster