bptm consistently dies with "media manager terminated by parent process" within 10 minutes after starting
Problem
bptm consistently dies with "media manager terminated by parent process" about 10 minutes after starting
Cause
The Job Details will show the job die after about 10 minutes of writing3/6/2009 2:21:48 AM - begin writing
3/6/2009 2:31:29 AM - end writing; write time: 00:09:41
media manager killed by signal(82)
In the example above, "media manager" indicates the process.
There are two ways that bptm can get this signal, which is simply a SIGHUP (or signal 1). The most common way is from the parent bpbrm process when the backup is failing for a legitimate reason. That is not the case in this situation.
The other way is from nbjm on a master server when the bptm processes needs to be terminated, such as when a job has been cancelled.
Normally this would come from from the legitimate master server. But in this scenario, there is a second 'rogue' master server initiating the signal. The nbjm and nbrb processes running on this rogue server can see that the media server is using this drive. However, since the bptm was started from a job initiated by the legitimate master server the rogue server does not know why it should be active. As a result, it connects to bpcd on the media server and tells it to send SIGHUP to the bptm PID.
The host that is terminating bptm via bpcd can be identified with the bpcd log. In the example below, that host is called rogue-master.domain.com.
The bpcd debug log for the PID handling the request shows the connecting host, the signal being sent, and the bptm PID to be targeted:
bpcd log location
Unix: /usr/openv/netbackup/logs/bpcd/
Windows: <install_path>\Veritas\NetBackup\logs\bpcd06:20:23.320 [16367] <2> bpcd peer_hostname: Connection from host rogue-master.domain.com (131.212.32.32) port 4070
...snip...
06:20:23.930 [16367] <2> bpcd main: BPCD_SEND_SIGNAL_TO_PID
06:20:23.930 [16367] <4> bpcd main: sending signal 1 to pid 12107
The bptm debug log shows the signal being received, the bpcd PID from which it originated, and the termination of the bptm process.
bptm log location
Unix: /usr/openv/netbackup/logs/bptm
Windows: <install_path>\Veritas\NetBackup\logs\bptm06:20:27.669 [12107] <2> Media_dispatch_signal: calling catch_signal for 1 (bptm.c:23778) delay 0 seconds
06:20:27.669 [12107] <2> Media_siginfo_print: 0: delay 0 signo SIGHUP:1 code <MINUS_ONE -1) pid 16367
06:20:27.669 [12107] <16> catch_signal: media manager terminated by parent process
If the problematic media server is 8.1 or newer, certificates are required for successful inter-host communication. Run the following commands to help identify hosts which have issued certificates to the media server.nbcertcmd -listCACertDetails
nbcertcmd -listCertDetails
One set of certificates will be from the legitimate master. The other sets may reveal who the rogue server is, if the bpcd
log was not available
Solution
Once the rogue server has been identified, depending on the NetBackup version, see the steps below to disconnect the media Server from the rogue server.
For NetBackup 8.1+ media servers, take these steps to disconnect the rogue server from the media server:
1. On the media server, remove the SERVER=
entry allowing connections from the the rogue server.
2. On the media server, run: nbcertcmd -deletecertificate -hostid <hostID of the cert issued by the rogue server>
3. On the rogue server: remove configurations pointing to the media server which does not belong (storage units, SERVER
entries, nbemm entries via 'nbemmcmd -deletehost
', etc...).
For NetBackup 7.x and 8.0 media servers, take these steps to disconnect the rogue server from the media server:
1. On the media server: remove the SERVER=
entry allowing connections from the rogue server.
2. On the rogue server: remove configurations pointing to the media server which does not belong (storage units, SERVER entries, nbemm entries via 'nbemmcmd -deletehost
', etc...).