Storage Foundation for Windows logs reservation refresh is suspended for a disk group

Problem

On servers in a Microsoft Cluster Server (MSCS) or Windows Failover Cluster (WFC), Storage Foundation for Windows (SFW) logs an error to the System Event Log that it is suspending the reservation refresh for the Volume Manager Disk Group (VMDG) resource(s) that are currently online on this cluster node.

In SFW version 5.1 and lower, this did NOT result in a cluster fault or application outage
In SFW version 6.0 and above, this results in a loss of access to volumes in the affected Diskgroup(s), resulting in an application outage.

Error Message

Event ID: 52

Source: VXIO

Description: Cluster software communication timeout. Reservation refresh has been suspended for cluster disk group "DGGUID"


V-40-49157-52
vxio: Cluster software communication timeout. Reservation refresh has been suspended for cluster disk group "DGGUID"

Cause

VXIO expects to receive heartbeat communications from the cluster when there are VMDG resources online, and when it does, it will maintain a SCSI reservation thread with the disks making up the online disk groups. The cluster service via the VMDG resource dll (vxres.dll) will maintain a record that the disks are SCSI reserved and the cluster monitoring cycle for LooksAlive / IsAlive will complete and show the VMDG resources as online.

If VXIO does not hear from the cluster for a set timeout period (10 minutes by default), then it will suspend the reservation refresh.

Solution

SFW 5.1 and below:

In SFW 5.1 and prior versions, this error in general will not impact cluster operations. The occurrence of this error does not cause the failure of the VMDG resources and the cluster service still has record that the VMDG resources are online and will not attempt to failover the service group.
 
While there are online VMDG resources on a cluster node, VXIO will issue a SCSI reservation request to each of the disks making up the online disk groups.  This is done every 3 seconds. VXIO in turn will receive every 3-5 seconds heartbeats from cluster resource dll (vxres.dll) via the cluster resource monitor. If the VXIO doesn't hear from the resource monitor for 10 minutes, SFW suspends the reservation thread. This means that the reservation is still held by this cluster node, and if there is no other issue relating to physical disk access, the VMDg resource will stay online and not cause a service group failover.
 
If there was an issue with the access to the disks from this cluster node, then VXIO would not be able reserve the disks and the VMDG resource would fault, or the application would not be able to read / write to the disks and the application resource would fault.   If there is a fault for a service group resource, then this node will not defend the challenge as the reservation thread is suspended, and the service group can failover.

If VXIO had assumed that it was not receiving the communications from the resource monitor because the cluster software had failed and terminated the SCSI reservation then the service group may failover unnecessarily.

Recommendations:

  • Verify that the cluster software is running, this error is indicative of high cluster node resource utilization. This can also occur when the cluster is in a 'hung' state.
  • Review event logs for indications that other applications are experiencing resource shortage
  • Review event logs for cluster service messages indicating that there are issues monitoring resources
  • Consider separate resource monitors for the VMDG resources. This can be done by going into the properties of each VMDg resource in the cluster, selecting the General tab, and checking the 'Run this resource in a separate Resource Monitor' option.
    • Note: This should be used as a troubleshooting step:
      • If the issue is still seen, further investigation will be needed by Veritas Software Support.
      • If the issue is no longer seen when the VMDg resources are running under their own resource monitors, this would show that another non-VMDg resource in the cluster is causing the RHS issue and Microsoft should be contacted to investigate.

SFW 6.0 and above:

In SFW 6.0 and on, the suspension of the reservation also results in all volumes in the affected Diskgroup(s) being taken offline (volume removal). This was implemented due to concerns that filesystem corruption could result since SFW was no longer actively monitoring the reservation thread, and this could allow another node to gain access to the disks in parallel, possibly resulting in corruption.

Due to this change, this issue now causes a true outage. In addition, since the RHS process is not actively monitoring resources, the cluster may not show this actual outage until the RHS process begins performing looksalive/isalive operations on affected resources. Because of this, it is possible that the cluster will not take action for quite some time, resulting in an excessive outage. If/when RHS does return to functioning properly, the cluster will then detect the fault/failure and take appropriate action.

Recommendations:

  • Verify that the cluster software is running, this error is indicative of high cluster node resource utilization
  • Review event logs for indications that other applications are experiencing resource shortage
  • Review event logs for cluster service messages indicating that there are issues monitoring resources
  • Consider separate resource monitors for the VMDG resources. This can be done by going into the properties of each VMDg resource in the cluster, selecting the 'Advanced Policies' tab, and checking the option 'Run this resource in a a separate Resource Monitor':
    • Note: This should be used as a troubleshooting step:
      • If the issue is still seen, further investigation will be needed by Veritas Software Support.
      • If the issue is no longer seen when the VMDg resources are running under their own resource monitors, this would show that another non-VMDg resource in the cluster is causing the RHS issue and Microsoft should be contacted to investigate.
  • Another option would be to increase the 10 minute timeout that is in place by default. It is important to note that this is a workaround of the issue and the cause of the RHS issue should be investigated and resolved as the ultimate solution; however, with the changes to 6.x and above (with volumes being taken offline), it may be necessary to increase this temporarily or permanently to avoid this problem causing an outage. With the timeout being increased, SFW will continue to maintain reservations on the disks which may result in the inability of the cluster to perform a takeover in a case where the node is truly hung. In those cases, manual intervention may be needed, for example, it may be necessary to reboot the "hung" server so that another node in the cluster will be able to get a reservation on the disks(s) in the affected Diskgroup(s) and bring the VMDg resources online.
 
Steps for increasing the 10 minute timeout (6.0 and above):
The following steps could cause VMDg resources that are online to fault. For that reason, it is strongly recommended that the cluster be taken down, or the following steps be performed on passive nodes, and once complete, move groups to those servers to apply these to the original active nodes.
  1. Open the Registry Editor (start > run > regedit)
  2. Browse to HKEY_LOCAL_MACHINE\SOFTWARE\VERITAS\VxSvc\CurrentVersion\VolumeManager
  3. Create the following value:
    • Name: MSCSMaxResLoopsB4Timeout
    • Value Type: REG_DWORD
    • Value Date: Set any value above 200 (decimal) if longer delay is required (200 loops x 3 seconds per loop = 600 seconds, or 10 minutes*
  4. Open a command prompt and restart the vxsvc service:
    • net stop vxsvc
    • net start vxsvc
  5. Perform the same steps on all additional cluster nodes.
* The MSCS reservation thread in vxio.sys executes once every 3 seconds and it checks for the "IS_ALIVE" notification from WFC/VM. If for some reason it detects that the IS_ALIVE notification is not received, it increments an internal variable and sleeps 3 seconds for the next loop. If IS_ALIVE is received on sub-sequent loops, the counter will be reset to zero. If there was an issue and IS_ALIVE is not received for a given number of loops in sequence, the reservation thread will terminate the reservation after logging an event with ID 50. This tunable controls the given number of loops in sequence to wait before self-terminating the SCSI reservation thread. The default value of 200 (when not set in the registry) will yield a timeout value of 10 minutes (3 seconds for each loop, for a total of 200 loops). By setting this to a very large number, it would essentially be disabled.
​​

Further reading:

Further information about troubleshooting messages that are reported by vxio can be found in the technote that is linked in the "Related Documents" section.
 
 

Applies To

Microsoft Cluster Server and Windows Failover Cluster

Terms of use for this information are found in Legal Notices.

Search

Survey

Did this article answer your question or resolve your issue?

No
Yes

Did this article save you the trouble of contacting technical support?

No
Yes

How can we make this article more helpful?

Email Address (Optional)