The backup is started and progressing, but then fails with status 13 after bpbrm reports the following condition.
Error bpbrm (pid=XXXXX) socket read failed: errno = 62 - Timer expired
The Job Details in the Activity Monitor shows that the backup was active and progressing until a time expired which caused the status 13.
09:00:12 started process bpbrm (pid=7249)
09:00:20 connected; connect time: 0:00:00
09:06:20 Error bpbrm (pid=7249) socket read failed: errno = 62 - Timer expired
09:06:20 end writing file read failed (13)
The bpbrm debug log confirms that the media manager has mounted and positioned the media but after some amount of time a timer expired. The log may or may not show any meta data arriving from the client before the failure. However, the gap in the timestamps will exceed the current CLIENT_READ_TIMEOUT setting on the media server. This will cause bpbrm to terminate bptm.
09:00:20.183  <2> write_file_names: successfully wrote buffer to COMM_SOCK
09:00:20.183  <2> bpbrm write_filelist: wrote CONTINUE on COMM_SOCK
09:00:20.183  <2> bpbrm write_filelist: closing COMM_SOCK 12
09:00:42.975  <2> bpbrm mm_sig: received ready signal from media manager
09:06:20.087  <16> bpbrm readline: socket read failed: errno = 62 - Timer expired
When bpbrm and bptm terminate, the network sockets to the client will be removed. As a result, the bpbkar debug log will show a broken socket when bpbkar next attempts to send image data to bptm, or meta data to bpbrm.
09:06:21.948  <16> bpbkar: ERR - bpbkar killed by SIGPIPE
09:06:21.948  <16> bpbkar: ERR - bpbkar FATAL exit status = 40: network connection broken 09:06:21.949  <4> bpbkar: INF - EXIT STATUS 40: network connection broken
During a backup, the NetBackup client sends a steady stream of meta data, to bpbrm, as it sends each file to bptm. It also sends application keep-alives to bpbrm as progress is made, since large files may take some time to transfer. If the data transfer from bpbkar to bptm slows or stops, then bpbrm will not receive further updates in a timely manner and then the timer will expire. At that point bpbrm will terminate bptm so that the storage resources can be released and then used for a different backup that make may forward progress.
Please note that if the SIGPIPE occurs in the bpbkar log before the timeout in bpbrm, then the TCP stack on the client host believes the socket is closed but the TCP stack on the mediaserver is unaware for some reason which is due to a problem in the TCP stack in one of the hosts or in the network between hosts. Such a problem should be investigated at the O/S and network levels.
If the SIGPIPE occurs after bpbrm exits, then there are three considerations.
1) As a temporary workaround, increase the CLIENT_READ_TIMEOUT on the mediaserver (only) to extend the time that bpbrm will wait for meta data updates or application keep-alives from the client. This may allow a slow backup to complete, but will not be of use if the data transfer from bpbkar to bptm has stopped completely.
2) Then closely investigate the situation to identify the root cause and resolve; some possible causes are as follows.
* Some segment of the network between the media server and client hosts is congested by traffic from other backups or applications resulting in the connection between bpbkar and bptm only getting a small amount of the total bandwidth.
* A TCP tuning or other problem between the media server and client hosts that slows the transfer of the image to bptm.
* A TCP tuning or other problem in the TCP stack one of the hosts or network device between that causes packets to stop flowing on the socket to bptm.
* Very slow delivery of data to bpbkar by the storage subsystem on which the file system that is being backed up resides.
* If the problem only affects differential backups and not full backups, then it may be that the file system is very large, but contains many small files that do not change very often. As a result, bpbkar may have to scan millions of files before finding the next file that has changed and passing it to bptm.
* A similar situation can occur if there are very large sparse files that actually contain very little data.
The bpbkar_path_tr touch file, General=2 and Verbose=5 debugging options will make the bpbkar log more detailed to see which files are being backed up at the time of the slow-down in the backup.
If there are gaps between the PrintFile and the next Select file, then truss, strace, or Process Monitor can be used to determine how many files bpbkar needs to scan before finding one to backup.
Running bpbkar, standalone to null, at the same time that the backup would normally run will show how fast the file system can be read.
On UNIX/Linux, 'netstat -na' may show congestion on the TCP sockets between client and mediaserver. Are the TCP windows closing, are the TCP queues full?
A network trace can be used to identify if/how often TCP packets are being lost and retransmitted or if acknowledgements are slow.
Using FTP or rcp to transfer files between the two hosts over TCP at the same time as the backup will show if the problem affects more than just NetBackup.
3) Once the root cause is identified and mitigated, the CLIENT_READ_TIMEOUT on the mediaserver can be reduced to the original/default value.
Note: Waiting for bptm to mount a new tape will not result in this failure as the media_not_ready raised to bpbrm from bptm when the first tape fills will suspend the timer. The timer will not reset/restart until the media_ready following the successful mount and position of the next tape.
Note: Changing the CLIENT_READ_TIMEOUT on the client host will not have any effect on this situation.
Note: Oracle RMAN backups may have an additional causes and better work-around options. See the related articles.
NetBackup 5.x, 6.x, 7.x