VCS application agent reports erroneous/bogus VCS INFO V-16-2-13075 - hence reported unexpected OFFLINE
VCS INFO V-16-2-13075 Resource(rman) has reported unexpected OFFLINE 1 times, which is still within the ToleranceLimit(5).
During the online of a resource, agent process invokes the program mentioned under StartProgram.
When agent invokes StartProgram, it redirects its STDOUT/STDERR to one of "tmp/Application-*" file, so that any output or error can be captured under engine logs.
Any process which will be invoked from StartProgram will be child of StartProgram.
As per UNIX Operating System (OS) theory, child process inherits the file descriptors from its parent process. This also means that if any process is being executed from StartProgram (which will run in background) then it will also inherit the file descriptors of parent process.
As VCS agents don't have control on new process so it won't be able to close the file descriptor of this newly created process. Ideally any process which is configured under VCS should not continuously write on to STDOUT/STRERR as that process will not be having any associated terminal. In most of the cases any process which is designed to execute as daemon will either close the STDOUT/STDERR file descriptors or it should redirect them to some specific file.
So, based on the above theory, we can clearly see from the application debug log that the file in which the output of the command is written by the agent is overwritten by some other process, ie: the program executed in StartProgram.
From the agent debug log, we can clearly see the followings:
2011/07/19 04:41:10 VCS DBG_5 V-16-50-0 Application:rman:monitor:process <ora_dbw1_rman> before squeeze() Application.C:isMonitorProcessesConfigured
2011/07/19 04:41:10 VCS DBG_5 V-16-50-0 Application:rman:monitor:inside function squeeze(),arguments passed is <ora_dbw1_rman> >>>>> argument passed is correct here...
2011/07/19 04:41:10 VCS DBG_5 V-16-50-0 Application:rman:monitor:Exiting from function squeeze, no of tokens are <1> , returning <0> Application.C:squeeze
2011/07/19 04:41:10 VCS DBG_2 V-16-50-0 Application:rman:monitor:Command prepared for getting pid is </bin/ps --cols=100000 --User=oracmss1 -o pid,args | /bin/egrep 'ora_dbw1_rman' | /bin/egrep -v /bin/grep | /usr/bin/tr -s " " " " | /bin/sed -e 's/^ //' | /bin/cut -f1 -d" ">. Application.C:processExists
2011/07/19 04:41:11 VCS DBG_5 V-16-50-0 Application:rman:monitor:New process string after removing extra space:<ora_dbw1_rman>; User:<oracmss1>. Application.C:processExists
2011/07/19 04:41:11 VCS DBG_4 V-16-50-0 Application:rman:monitor:pidstring:14011, process:ora_dbw1_rman.Application.C:getPidOfArg >>>>>>> correct pid has been picked up here
2011/07/19 04:41:12 VCS DBG_5 V-16-50-0 Application:rman:monitor:Calling VCSAgExec for pid:14011Application.C:getArgsOfPid
2011/07/19 04:41:12 VCS DBG_4 V-16-50-0 Application:rman:monitor:for pid:14011, arguments are:/home/commsc/amdi Application.C:getArgsOfPid >>>>>>> somehow the argument has changed now, and wrong arguments too
2011/07/19 04:41:12 VCS DBG_4 V-16-50-0 Application:rman:monitor:Process:ora_dbw1_rman; return state: Offline. Application.C:application_monitor
So to overcome / avoid this issue, any child processes invoked from the parent process, should redirect the STDERR/STDOUT of these processes to some file instead of using parent's process file descriptors.
In other words, the childs processes / commands invoked thru the parent process should not write its outputs to STDERR / STDOUT. They should in turn be redirected to some temporary file.