Problem
VCS HAD daemon does not start and or keeps crashing on HP-UX 11.31 (IA64) systems with PHKL_41700.
Error Message
Syslog file would have below messages logged when HAD daemon crashes.
Nov 28 13:39:27 foo500b Had[17461]: VCS NOTICE V-16-1-10075 Building from remote system
Nov 28 13:39:27 foo500b Had[17461]: VCS ERROR V-16-1-50129 Operation 'MSG_NOTIFIER_NOTIFY' rejected as the node is in REMOTE_BUILD state
Nov 28 13:40:17 foo500b vmunix: GAB INFO V-15-1-20036 Port h gen 6f8124 membership ;1
Nov 28 13:40:17 foo500b vmunix: GAB INFO V-15-1-20038 Port h gen 6f8124 k_jeopardy 0
Nov 28 13:40:17 foo500b vmunix: GAB INFO V-15-1-20040 Port h gen 6f8124 visible 0
Nov 28 13:40:17 foo500b Had[17461]: VCS ERROR V-16-1-10468 Node providing snapshot has left the cluster. Local node leaving cluster to be restarted
Nov 28 13:40:18 foo500b vmunix: GAB WARNING V-15-1-20161 Port h client process killed, GAB will initiate regmon action syslog after 200 sec
Nov 28 13:40:18 foo500b vmunix: GAB INFO V-15-1-20032 Port h closed
Nov 28 13:40:18 foo500b syslog[15648]: VCS ERROR V-16-1-11103 VCS exited. It will restart
Nov 28 13:40:18 foo500b syslog[15648]: VCS ERROR V-16-1-11104 VCS has faulted 6 times since Mon Nov 28 13:40:18 2011 hashadow will not restart VCS. Correct the problem and restart
STACK of HAD daemon would be as below (located in /var/VRTSvcs/diag/had/):
(0) 0x00000000049209b0 _Z12VCSDumpStackPKc + 0x300 at Platform.C:1874 [/opt/VRTSvcs/bin/had]
(1) 0x0000000004921ee0 VCSAbrtHandler + 0xf0 at Platform.C:2091 [/opt/VRTSvcs/bin/had]
(2) 0xe00000012f05f420 ---- Signal 6 (SIGABRT) delivered ----
(3) 0x60000000c044cc10 _select_sys + 0x30 [/usr/lib/hpux32/libc.so.1]
(4) 0x60000000c0463120 _select + 0xe0 at ../../../../../core/libs/libc/shared_em_32_perf/../core/syscalls/t_select.c:21 [/usr/lib/hpux32/libc.so.1]
(5) 0x00000000048141f0 _ZN9IpmHandle6eventsEP5DListPS1_S1_S2_i + 0xb30 at Ipm.C:537 [/opt/VRTSvcs/bin/had]
(6) 0x0000000004823d50 _ZN9IpmHandle4sendEP5VListi + 0x13d0 at Ipm.C:2461 [/opt/VRTSvcs/bin/had]
(7) 0x0000000004780c40 _ZN6System12process_dumpEPvP6MsgHdr + 0xa90 at System.C:5157 [/opt/VRTSvcs/bin/had]
(8) 0x000000000422a8f0 _Z15process_messagePvP5VListi + 0x1350 at had.C:500 [/opt/VRTSvcs/bin/had]
(9) 0x0000000004247190 _Z4MAINmPPc + 0xeff0 at had.C:3236 [/opt/VRTSvcs/bin/had]
(10) 0x000000000425cfe0 main + 0x40 at had.C:3779 [/opt/VRTSvcs/bin/had]
(11) 0x60000000c00427c0 main_opd_entry + 0x50 [/usr/lib/hpux32/dld.so]
Cause
The select(2) system call changes introduced with PHKL_41700 patch causes issues with timer functionality that HAD daemon uses.
Solution
This issue is fixed by HP via PHKL_41967 patch. We suggest to install this patch, which requires system reboot.
Until the patch can be installed, below workaround can be used, that does not require reboot.
1. Enable high resolution timer functionality.
# kctune hires_timeout_enable=1
2. Restart HAD daemon
# hastart
Applies To
HP-UX 11.31 (IA64) with PHKL_41700 installed with VCS 5.1SP1