When onlining the Volume Manager Disk Group (VMDG) resource in a Windows cluster, the resouces times out. By default, onlining in Windows Failover Cluster (Windows 2008) there will be two timeouts in Pending Online Timeout at 180 seconds and the Deadlock Timeout of 300 seconds. In MSCS (Windows 2003 cluster server) there is only the Pending Online Timeout.
This issue will be apparent by the expiration of these timeouts, and Windows event logs will indicate RHS.EXE failure with vxres.dll, or in Windows 2008 Windows Error Reporting will log WSFC Resouce Deadlock messages for ONLINERESOURCE.
The occurrence of these messages will need further investigation to determine if there is an issue with vxres.dll, or with the underlying resource taking too long to come online.
This document refers to a scenario where it can be seen that the dynamic disk group and volume arrives, which is visible in the application event log which logs these arrivals (from the Storage Foundation for Windows providers).
<date> <time_t1> INFORMATION 19867(0x65154d9b) VxSvc_vxvm <server>
Importing dynamic disk group <vmdg_resource_name>.
<date> <time_t2> INFORMATION 800(0x65100320) VxSvc_pnp <server>
Device \Device\HarddiskDmVolumes\<vmdg_resource_name>\<volume> has arrived.
Assuming there are no more volumes, or all of the volumes have corresponding events logged, then VxSvc_vxvm should log the disk group is imported, and the cluster resource should go online.
When 180 seconds elapse from time_t1 to time_t3, the following event will be logged on Windows Server 2008 R2
<date> <time_t3> INFORMATION 1001(0x000003e9) Windows Error Reporting <server>
Fault bucket , type 0
Event Name: WSFC Resource Deadlock
Response: Not available
Cab Id: 0
P2: Volume Manager Disk Group
The problem is that the disk group import never finishes despite the arrival of the volumes.
During VMDG online, the disk group configuration is updated in the SFW VEA database. This process will not finish if there is another process that is holding a database lock.
Note: this issue became apparent initially with a missing disk in the disk group on failover.
The Hotfix for this issue is available in SFW 5.1 SP2 CP7.