All VCS HA commands are unresponsive

Article: 100008957
Last Published: 2012-07-28
Ratings: 0 0
Product(s): InfoScale & Storage Foundation

Problem

All VCS HA commands becomes unresponsive and hangs after few days of operations.

Had to restart HAD daemon to fix this issue or GAB panics the system when HAD daemon is not sending heartbeat.

Error Message

No specific error message is reported, but can see engine log is not updated.

Thread 2 of the hung HAD process will have similar to below output.

(gdb) thread 2
[Switching to thread 2 (Thread 23965)]#0  0xffffe410 in __kernel_vsyscall ()
(gdb) bt
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x00ba4783 in getifaddrs () from /lib/libc.so.6
#2  0x00b4860c in __old_glob_in_dir () from /lib/libc.so.6
#3  0x00b93fd3 in _res_hconf_reorder_addrs () from /lib/libc.so.6
#4  0x00b9454a in do_init () from /lib/libc.so.6
#5  0x082587bc in VCSSyslog (
    bufp=0x82f8060 "VCS WARNING V-16-1-51100 HAD Self Check: Excessive delay in the HAD heartbeat to GAB (10 seconds)") at Platform.C:732
#6  0x08258a70 in gab_heartbeat_alarm_handler (sig_num=14) at Platform.C:1992
#7  <signal handler called>
#8  0xffffe410 in __kernel_vsyscall ()
#9  0x00b87583 in sprofil () from /lib/libc.so.6
#10 0x00b4a6d0 in glob64@@GLIBC_2.2 () from /lib/libc.so.6
#11 0x00b49712 in glob64@@GLIBC_2.2 () from /lib/libc.so.6
#12 0x00b4a16d in glob64@@GLIBC_2.2 () from /lib/libc.so.6
#13 0x00b4e8c6 in internal_fnwmatch () from /lib/libc.so.6
#14 0x00b4e81f in internal_fnwmatch () from /lib/libc.so.6
#15 0x0821f701 in Log::write_ffdc (sev=35, whop=0x82d0f24 "gabtcp_compute_visible_membership",
    filep=0x82d0f71 "GabTcpAux.C", line=3295, flags=49152, cat=50, id=0,
    msgp=0x9019460 "membership is 0, local membership is 1") at Log.C:1650
#16 0x080e62c2 in gabtcp_compute_visible_membership () at GabTcpAux.C:3295
#17 0x080f4040 in GabTcp::lowest_master (this=0xf7115108) at GabTcp.C:420
#18 0x080692e0 in MAIN (argc=3, argv=0xffa23f54) at had.C:3260
#19 0x0806b757 in main (argc=64768, argv=0x0) at had.C:3776
(gdb)
 

Procedure to get gcore of the hung HAD process is

# gcore <pid_of_hung_had_process>

Cause

The SIGALARM handler, used by HAD daemon to check its heartbeats, invokes syslog() system call.

This system call can sometimes causes HAD to go into indefinite sleep.

 

This issue is tracked via e2747052.

Solution

When HAD daemon gets into this mode, generally GAB will try to kill and restart HAD, which will fix this issue.

In cases of single node VCS cluster, HAD daemon is not restarted by GAB and instead can be manually restarted,

by killing the hung daemons and running # hastart -onenode command.

 

This issue is fixed by removing the syslog() system call and instead using a file update by HAD daemon.

Below hotfixs are available which has the above fix.

VCS 5.1SP1RP1HF5

VCS 5.1SP1RP2HF3 

Please contact Veritas support to obtain these hotfixes

 

This issue is also fixed in next rolling patch 5.1SP1RP3.

 


Applies To

VCS 5.1SP1 / VCS 5.1SP1RP1 / VCS 5.1SP1RP2. Also applies to single node VCS clusters.
 

References

Etrack : 2747052

Was this content helpful?