Please enter search query.
Search <product_name> all support & community content...
Article: 100001375
Last Published: 2022-02-28
Ratings: 0 0
Product(s): InfoScale & Storage Foundation
Problem
Role of "ShutdownTimeout" and "gab_isolate_time" in service group failover
Solution
ShutdownTimeout comes into effect only after HAD has been killed while GAB_ISOLATE_TIME comes into effect after the
The
First
" From Cluster Server 4.0 onwards, if the client process is not killed, GAB will forcefully close (unregister) the port and start an isolate timer. If after
Secondly,
" If a system reboots, it becomes unavailable until the reboot is complete. The reboot process kills all processes, including HAD. When the VCS process is killed, other systems in the cluster mark all service groups that can go online on the rebooted system as
port 'h'
is closed but before the node goes down. VCS can clear the
autodisable
flag itself, but the machine must be able to unload GAB within 120 seconds of VCS had engine hanging or dying, which means the machine must go down.
The
gab_isolate_time
and
ShutdownTimeout
are designed for different objective.
First
gab_isolate_time
is to give enough time space to avoid the panic condition.
" From Cluster Server 4.0 onwards, if the client process is not killed, GAB will forcefully close (unregister) the port and start an isolate timer. If after
gab_isolate_time
, the process has not been killed, the system is halted. The gab_isolate_time
is meant to give HAD more time to be killed and avoid the panic condition. gab_isolate_time
is a kernel tunable parameter that defaults to 120 seconds. The minimum value for this timer is 16 seconds and the maximum is 240 seconds."
Secondly,
ShutdownTimeout
is to define the timeframe of normal reboot to maintain SG's failover.
" If a system reboots, it becomes unavailable until the reboot is complete. The reboot process kills all processes, including HAD. When the VCS process is killed, other systems in the cluster mark all service groups that can go online on the rebooted system as
autodisabled
. The AutoDisabled
flag is cleared when the system goes offline. As long as the system goes offline within the interval specified in the Shutdown Timeout value, VCS treats this as a system reboot. The ShutdownTimeout
default value of 120 can be changed by modifying the attribute."
The default value for "
If the leaving node does not get rebooted in 2 mins then the other nodes do not failover the service groups from leaving node.
ShutdownTimeout
" and "
gab_isolate_time
" is 2mins.
gab_isolate_time 120000 Default
hasys_display:search1 ShutdownTimeout 120
hasys_display:search2 ShutdownTimeout 120
If the leaving node does not get rebooted in 2 mins then the other nodes do not failover the service groups from leaving node.
But because both those 2 parameter's default values are same 120 seconds, so when HAD can't be killed (such as stuck in Kernel, etc.), the SG which was online in the failed system couldn't be failed over to other system.
If this is the first time that you have encountered this failover issue and if the service groups have failed over successfully in the past and it could be a manually switch or failover as a result of the node being rebooted/panicked, then we would recommend that - You should set the
ShutdownTimeout
to greater than the
gab_isolate_time
.
If the '
You can modify the
had
' takes approx. 3 mins to die, so set the
ShutdownTimeout
to '3 mins' or 180 secs on both nodes. This should help with the auto failover of the service groups in similar situations (panic of a node) in the future without impacting the everyday functionality of the cluster.
You can modify the
ShutdownTimeout
using following commands:
# haconf -makerw
# hasys -modify sysname ShutdownTimeout <Newtime>
# haconf -dump -makero