InfoScale and Storage Foundation for Windows (SFW) log the following error: "Reservation refresh has been suspended for cluster disk group"

Article: 100018047
Last Published: 2019-04-23
Ratings: 1 0
Product(s): InfoScale & Storage Foundation

Problem

Servers that are part of a Windows Failover Cluster (WFC) or Microsoft Cluster Server (MSCS) report an error to the System Event Log indicating a Reservation Suspension. In addition to this, the volumes contained in the affected Diskgroup(s) are no longer accessible, resulting in application errors due to volume access failures, and possibly resource faults in the cluster. This issue is seen in the following versions:

- SFW 6.0 and 6.1
- InfoScale 7.0, 7.1, and 7.3

Note: Starting with InfoScale 7.4, this functionality is disabled by default and this issue should no longer be seen going forward.
 

Error Message

Event ID: 52
Source: VXIO
Description: Cluster software communication timeout. Reservation refresh has been suspended for cluster disk group "DGGUID"

V-40-49157-52
vxio: Cluster software communication timeout. Reservation refresh has been suspended for cluster disk group "DGGUID"
 

Cause

When a VMDg (Volume Manager Diskgroup) resource is online, the SFW/InfoScale software (Volume Manager) expects to receive an isAlive request from the cluster. The process is as follows:

- RHS sends an isAlive request to vxres.dll (VMDg resource Type registered with the cluster). By default, each VMDg resource should receive an isAlive request from the cluster's RHS process every 60 seconds. 

- vxres.dll communicates directly with Volume Manager to verify the Diskgroup and Volumes are online and accessible.

- Volume Manager returns the state to vxres.dll which then informs RHS of the status (Success or Failure)

By default, if Volume Manager does not get an isAlive request for 10 cycles (60 seconds x 10 cycles = 10 minutes), then it assumes this cluster node may be in a hung state, so it suspends its reservation refresh to allow another node in the cluster to perform a takeover if necessary.

Prior to InfoScale 7.4, the suspension of the reservation results in all volumes in the affected Diskgroup(s) being taken offline (volume removal). This was implemented after SFW 5.1 due to concerns that file system corruption could result since SFW was no longer actively monitoring the reservation thread (allowing 2 nodes to have access to the disks for a very short period of time during internal testing).
 
Due to this change, this issue now causes a true outage. In addition, since the RHS process is not actively monitoring resources (due to issues with RHS), the cluster may not show this actual outage as it is not actively monitoring any of the resources in the cluster. Because of this, it is possible that the cluster will not take action for quite some time, resulting in an excessive outage. In these scenarios, all resources show online in the cluster, even though many of them may actually be offline (e.g. VMDg resources are offline and SQL Server service has terminated due to loss of access to disk -- but all show online in the cluster).



Solution


Microsoft:
There have been several recent issues with RHS not performing isAlive requests in the cluster which resulted in this issue being seen frequently, and the main driving factor of us disabling this functionality going forward. According to Microsoft, this has been fixed and confirmed in the following:

KB4075212: Informed my Microsoft Engineering the RHS fix was included in this KB; however it is undocumented. This fix should be included in Cumulative KBs from this point.

KB4013429: Informed by a customer that Microsoft Support confirmed the fix was included in this KB, but also recommended they patch up to: KB4462917. Again, all future Cumulative fixes from Microsoft should include these RHS patches.

Even if the RHS fixes from Microsoft are put in place, it is strongly recommended that the workaround or fixes (depending on the SFW/InfoScale software version) are implemented. This reservation suspension functionality was well intentioned, but has shown to cause more harm than good.
 

VERITAS:

SFW 6.0/6.1: Implement the workaround provided in the 'Workaround' section below. This will set the isAlive cycle value to an extremely large number, effectively disabling the feature.

InfoScale 7.1: Contact Support and request the available fix that will disable this functionality.

InfoScale 7.2: Contact Support and request the available fix that will disable this functionality.

InfoScale 7.3: Upgrade to 7.4 (or above) or implement the workaround provided in the 'Workaround' section below.

InfoScale 7.4 +: No action is needed as this functionality is disabled by default.
 

Workaround

Steps for increasing the 10 minute timeout:

1. Open the Registry Editor (start > run > regedit)

2. Browse to HKEY_LOCAL_MACHINE\SOFTWARE\VERITAS\VxSvc\CurrentVersion\VolumeManager

3. Create the following value:

- Name: MSCSMaxResLoopsB4Timeout
- Value Type: REG_DWORD
- Value Date: Set to 9999999 (decimal) 

4. Open a command prompt and restart the vxsvc service:

- net stop vxsvc
- net start vxsvc

Note: The restarting of vxsvc may result in VMDg resources faulting when performed on an Active node. It is recommended that step be performed on Passive Nodes only, or during an outage window. However, the registry changes can be made in advance, and a restart of the vxsvc service (or reboot of the server) at a later date would then implement the change.

5. Perform the same steps on all additional cluster nodes.

 

Re-enabling the feature:
If the Private Fixes are applied to 7.1, 7.2, or the server is running 7.4 or above, the functionality is disabled by default. However, if this functionality is wanted, there is a command that can be run to enable/disable:

Open a command prompt (Run as Administrator), and run the following command:

vxtune clustercommunication_timeout True
 

To disable (which is the default setting):

vxtune clustercommunication_timeout False


This updates the following registry value automatically:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vxio\VVRParams
"CLUSTER_COMM_TIMEOUT"

0 = Disabled
1 = Enabled


Further reading:
Further information about troubleshooting messages that are reported by vxio can be found in the article that is linked in the "Related Documents" section.

Applies To
Microsoft Cluster Server and Windows Failover Cluster

References

JIRA : a287196 UMI : V-40-49157-52 UMI : V-203-57349-52 Etrack : 52

Was this content helpful?