When NIS or LDAP are used for oracle user authentication, in rare conditions, the agent may hang in getpwnam_r() system call and stop managing resources. Agent heartbeat will still work thus HAD will not restart the agent.
Agent monitor or online entry points will not finish or timeout.
2011/09/09 01:36:17 VCS NOTICE V-16-1-10301 Initiating Online of Resource Ora_Oracle (Owner: Unspecified, Group: OraGrp) on System node01
Suspecting intermittent LDAP or NIS issue causing getpwnam_r() system function to hang when Agent tries to authenticate Oracle user. As this is operating system call, agent cannot cancel this thread.
In one customer case the problem was encountered on the customer setup where the agent was doing a getpwnam_r() call within the monitor entry point on Solaris and the implementation of getpwnam_r() was disabling thread-cancellation before blocking. Because the NIS server was having problems, the getpwnam_r() call was stuck and hence all service threads were disfunctional. Agent Framework could not cancel them because getpwnam_r() had disabled cancellation internally. The agent was pretty much hung but since timer-thread was successfully heartbeating with the engine no problem was detected.
Normally blocking calls should not disable cancellation.
To fix this issue we need to find and kill the Oracle Agent pid and restart it using haagent -start.
hastop command will not work as Agent is not responsive.
1. Find the Oracle Agent
# ps -ef|grep OracleAgent
4 S root 18708 1 0 75 0 - 4209 stext Mar31 ? 00:01:04 /opt/VRTSagents/ha/bin/Oracle/OracleAgent -type Oracle -agdir /opt/VRTSagents/ha/bin/Oracle
2. Kill the OracleAgent
# kill -9 18708
3. Restart the agent
# haagent -start Oracle -sys node01
VCS with LDAP or NIS used for Oracle user authentication