Intentional Panic of a node in a Veritas Cluster Server cluster by GAB

Problem

Intentional Panic of a node in a Veritas Cluster Server cluster by GAB

Error Message

GAB: Port h process failed to heartbeat
GAB: Port h attempting to kill process due to client process failure

from /var/adm/messages

Solution

This technote examines why Veritas Cluster Server (VCS) triggers an intentional panic when a node becomes very busy and what tuning can be used to get finer control on VCS.
When a node panics, GAB logs the following messages in the system log:
GAB: Port h process failed to heartbeat
GAB: Port h attempting to kill process due to client process failure
When a node or domain gets so busy that the VCS engine (HAD) does not respond to other cluster members, GAB panics the node. In VCS 1.3 and later, GAB tries to kill HAD five times before panicking the node or before committing to panic the node.
This means that buffers are flushed and a core dump takes place, essentially providing diagnostic information to determine the cause of the panic. This behavior is better than a halt because a halt does not provide any diagnostic information and does not guarantee that a node is halted within a definite period of time. If the node is very busy, the halt process may get swapped out.
The HAD process regularly heartbeats to the GAB module. The HAD process may get timed out because of various reasons:
  • It may not get a chance to run because of the system load. In order to minimize this possibility, the HAD process runs as a high priority real time (RT) process.
  • It may be swapped out because of lack of physical memory. In VCS 1.3 and above, HAD startup pages are locked in memory.
  • It may be executing a system call in the kernel.
Tuning Parameters
GAB provides various tunable parameters to reduce the probability of a panic. However, we recommend using these parameters with utmost caution. In most cases, a panic is the desired behavior because it ensures data integrity and high availability in the cluster. If you do need to change these parameters, these changes must be done carefully by a VCS administrator. The parameter definitions are as follows:
VCS_GAB_TIMEOUT
This indicates the maximum amount of time GAB waits for a heartbeat from HAD before it declares that HAD is unresponsive and tries to kill the HAD process.
You can adjust this tunable by changing the environment variable VCS_GAB_TIMEOUT.  The default is 15000 milliseconds. You must restart the HAD process for the new value to take effect.
Example procedure to increase the timeout to 30 seconds (30000 milliseconds):
# VCS_GAB_TIMEOUT=30000
# export VCS_GAB_TIMEOUT
# haconf -dump -makero
# hastop -local -force
# hastart
This variable can also be exported in the /etc/rc3.d/S99vcs script before VCS starts up. We recommend that you do not increase this value beyond 30 seconds as HAD hung on a node can block HA processing throughout the cluster.
GAB IOFENCE timeout
Indicates the maximum amount of time GAB waits for the HAD process to exit before taking the next action, which could be one of the following:
  • Try to kill HAD again
  • Panic the node or commit to panic the node
Note that GAB tries to kill HAD five times before panicking the node or before committing to panic the node.  The default of this tunable is 15000 milliseconds.  It can be changed using the gabconfig command:
/sbin/gabconfig -f <iofence>
Once GAB tries to kill the HAD process, it expects the process to die within the time specified by <iofence>.
Typically, the HAD process will not die within the <iofence> timeout if
  • the HAD process is swapped out and the system is too busy to swap in the process in a timely manner in order to kill it
  • the process is executing a system call in the kernel
We recommend that you do not increase this value beyond 30 seconds as HAD hung on a node can block HA processing throughout the cluster.
GAB_ISOLATE_TIME
When HAD stops sending heartbeats to GAB, GAB tries to kill the HAD process five times, after which GAB declares that HAD is dead on the affected node. GAB also starts the GAB_ISOLATE_TIME timer at this point. The other nodes resume operations with a new cluster membership which does not include the affected node. This ensures that broadcast messages are not blocked after GAB declares HAD dead on the affected node.
The GAB_ISOLATE_TIME tunable defines the amount of time GAB waits for HAD to get killed on the isolated node before panicking the node. This tunable delays the panic for some more time.
If HAD does not recover within this timeout, GAB panics the node.
GAB_ISOLATE_TIME is specified as a GAB kernel tunable. The default value is 120000 seconds (2 minutes). The minimum value is 16 seconds; the maximum value is 240000 seconds (4 minutes).  
Preventing system panics
The panic can be completely disabled by issuing the /sbin/gabconfig -k command. If you run this command, instead of panicking the node, GAB tries to kill the HAD process every <iofence> milliseconds. In VCS 1.3 and earlier, this option was called /sbin/gabconfig -r.
We recommend that you do not increase this value beyond 30 seconds as HAD hung on a node can block HA processing throughout the cluster.
In versions before VCS 1.3, specifying the /sbin/gabconfig -r has one more side effect. If a node forms a cluster on its own and subsequently rejoins the cluster without /sbin/gabconfig -r, that node will be ejected and panicked. With the /sbin/gabconfig -r, the HAD process on that node will be killed instead.
The -f, -r, and -k options to gabconfig may be used at any time. You must run the gabconfig command with the same options on all nodes in the cluster. To reverse /sbin/gabconfig -r or -k, you must stop VCS, unload GAB with /sbin/gabconfig -U, reconfigure GAB,  and start VCS.
 
 

 

Terms of use for this information are found in Legal Notices.

Search

Survey

Did this article answer your question or resolve your issue?

No
Yes

Did this article save you the trouble of contacting technical support?

No
Yes

How can we make this article more helpful?

Email Address (Optional)