Role of "ShutdownTimeout" and "gab_isolate_time" in service group failover ...

Article: 100001375
Last Published: 2022-02-28
Ratings: 0 0
Product(s): InfoScale & Storage Foundation

Problem

Role of "ShutdownTimeout" and "gab_isolate_time" in service group failover

 

Solution

ShutdownTimeout comes into effect only after HAD has been killed while GAB_ISOLATE_TIME comes into effect after the port 'h' is closed but before the node goes down. VCS can clear the autodisable flag itself, but the machine must be able to unload GAB within 120 seconds of VCS had engine hanging or dying, which means the machine must go down.

The gab_isolate_time and ShutdownTimeout are designed for different objective.

First gab_isolate_time is to give enough time space to avoid the panic condition.

" From Cluster Server 4.0 onwards, if the client process is not killed, GAB will forcefully close (unregister) the port and start an isolate timer. If after gab_isolate_time, the process has not been killed, the system is halted. The gab_isolate_time is meant to give HAD more time to be killed and avoid the panic condition. gab_isolate_time is a kernel tunable parameter that defaults to 120 seconds. The minimum value for this timer is 16 seconds and the maximum is 240 seconds."

Secondly, ShutdownTimeout is to define the timeframe of normal reboot to maintain SG's failover.

" If a system reboots, it becomes unavailable until the reboot is complete. The reboot process kills all processes, including HAD. When the VCS process is killed, other systems in the cluster mark all service groups that can go online on the rebooted system as autodisabled. The AutoDisabled flag is cleared when the system goes offline. As long as the system goes offline within the interval specified in the Shutdown Timeout value, VCS treats this as a system reboot. The ShutdownTimeout default value of 120 can be changed by modifying the attribute."
 
The default value for " ShutdownTimeout" and " gab_isolate_time" is 2mins.

gab_isolate_time              120000  Default

hasys_display:search1    ShutdownTimeout    120
hasys_display:search2    ShutdownTimeout    120


If the leaving node does not get rebooted in 2 mins then the other nodes do not failover the service groups from leaving node.

But because both those 2 parameter's default values are same 120 seconds, so when HAD can't be killed (such as stuck in Kernel, etc.), the SG which was online in the failed system couldn't be failed over to other system.

If this is the first time that you have encountered this failover issue and if the service groups have failed over successfully in the past and it could be a manually switch or failover as a result of the node being rebooted/panicked, then we would recommend that - You should set the ShutdownTimeout to greater than the gab_isolate_time.
 
If the ' had' takes approx. 3 mins to die, so set the ShutdownTimeout to '3 mins' or 180 secs on both nodes. This should help with the auto failover of the service groups in similar situations (panic of a node) in the future without impacting the everyday functionality of the cluster.

You can modify the ShutdownTimeout using following commands:

# haconf -makerw
# hasys -modify sysname ShutdownTimeout <Newtime>
# haconf -dump -makero

Was this content helpful?