Backup fails with status 13 after bpbrm reports; socket read failed: errno = 62 - Timer expired

Article: 100021314
Last Published: 2021-03-18
Ratings: 1 6
Product(s): NetBackup & Alta Data Protection

Problem

The backup is started and progressing, but then fails with status 13 after bpbrm reports the following condition.

Error bpbrm (pid=XXXXX) socket read failed: errno = 62 - Timer expired

Error Message

The Job Details in the Activity Monitor shows that the backup was active and progressing until a time expired which caused the status 13.

09:00:12 started process bpbrm (pid=7249)
09:00:20 connecting
09:00:20 connected; connect time: 0:00:00
09:06:20 Error bpbrm (pid=7249) socket read failed: errno = 62 - Timer expired
09:06:20 end writing file read failed (13)

The bpbrm debug log confirms that the media manager has mounted and positioned the media but after some amount of time a timer expired.  The log may or may not show any meta data arriving from the client before the failure.  However, the gap in the timestamps will exceed the current CLIENT_READ_TIMEOUT setting on the media server.  This will cause bpbrm to terminate bptm.

09:00:20.183 [7249] <2> write_file_names: successfully wrote buffer to COMM_SOCK
09:00:20.183 [7249] <2> bpbrm write_filelist: wrote CONTINUE on COMM_SOCK
09:00:20.183 [7249] <2> bpbrm write_filelist: closing COMM_SOCK 12
09:00:42.975 [7249] <2> bpbrm mm_sig: received ready signal from media manager
09:06:20.087 [7249] <16> bpbrm readline: socket read failed: errno = 62 - Timer expired

When bpbrm and bptm terminate, the network sockets to the client will be removed.  As a result, the bpbkar debug log will show a broken socket when bpbkar next attempts to send image data to bptm, or meta data to bpbrm.

09:06:21.948 [8924] <16> bpbkar: ERR - bpbkar killed by SIGPIPE
09:06:21.948 [8924] <16> bpbkar: ERR - bpbkar FATAL exit status = 40: network connection broken
09:06:21.949 [8924] <4> bpbkar: INF - EXIT STATUS 40: network connection broken

Cause

During a backup, the NetBackup client sends a steady stream of meta data, to bpbrm, as it sends each file to bptm.  It also sends application keep-alives to bpbrm as progress is made, since large files may take some time to transfer.  If the data transfer from bpbkar to bptm slows or stops, then bpbrm will not receive further updates in a timely manner and then the timer will expire.  At that point bpbrm will terminate bptm so that the storage resources can be released and then used for a different backup that make may forward progress.

Solution

Please note that if the SIGPIPE occurs in the bpbkar log before the timeout in bpbrm, then the TCP stack on the client host believes the socket is closed but the TCP stack on the media server is unaware for some reason which is due to a problem in the TCP stack in one of the hosts or in the network between hosts. Such a problem should be investigated at the O/S and network levels.

If the SIGPIPE occurs after bpbrm exits, then there are three considerations.

  1. As a temporary workaround, increase the CLIENT_READ_TIMEOUT on the media server (only) to extend the time that bpbrm will wait for meta data updates or application keep-alives from the client.   This may allow a slow backup to complete, but will not be of use if the data transfer from bpbkar to bptm has stopped completely.
  1. Then closely investigate the situation to identify the root cause and resolve; some possible causes are as follows.
  • Some segment of the network between the media server and client hosts is congested by traffic from other backups or applications resulting in the connection between bpbkar and bptm only getting a small amount of the total bandwidth.
  • A TCP tuning or other problem between the media server and client hosts that slows the transfer of the image to bptm.
  • A TCP tuning or other problem in the TCP stack one of the hosts or network device between that causes packets to stop flowing on the socket to bptm.
  • Very slow delivery of data to bpbkar by the storage subsystem on which the file system that is being backed up resides.
  • If the problem only affects differential backups and not full backups, then it may be that the file system is very large, but contains many small files that do not change very often. As a result, bpbkar may have to scan millions of files before finding the next file that has changed and passing it to bptm.
  • A similar situation can occur if there are very large sparse files that actually contain very little data.

The bpbkar_path_tr touch file, General=2 and Verbose=5 debugging options will make the bpbkar log more detailed to see which files are being backed up at the time of the slow-down in the backup.  If there are gaps between the PrintFile and the next SelectFile, then truss, strace, or Process Monitor can be used to determine how many files bpbkar needs to scan before finding one to backup.

Running bpbkar, standalone to null, at the same time that the backup would normally run will also show how fast the file system can be read.

On UNIX/Linux, 'netstat -na' may show congestion on the TCP sockets between client and media server.  Are the TCP windows closing, are the TCP Send/Receive queues full?

On Linux, using 'netstat -naopt' repeatedly during the backup will also display any TCP timers that might be running.  Normally the 'keepalive' timer would be running.  If the storage unit slows or stops the ingest of the data that comprises the backup tar image, then the timer on the DATA connection between bpbkar and bptm.child will change to 'probe' indicating a TCP Zero Window condition until the ingest of data resumes.  If a 'probe' condition happens on the NAME connection between bpbkar and bpbrm, then the TCP stack on the media server host has stopped presenting data to bpbrm.  If either connection shows the timer is 'on' and is counting down, perhaps for many seconds or minutes, then that connection is in TCP Retransmission and there is a problem in the networking layers, either in the TCP stack on one of the hosts or on some device in between the hosts.

A network packet trace (tcpdump, snoop, wireshark, tshark) from both hosts - and perhaps the adjacent switch ports - can be used to identify if/how often TCP packets are being dropped (data packets and/or reply/ACK packets) and where along the connection path.

Using FTP or rcp to transfer files between the two hosts over TCP at the same time as the backup will show if the problem affects more than just NetBackup.

In rare instances, there may be a configuration mismatch along the connection path involving MTU and MSS and jumbo frames.  This may allow smaller frames to pass through the connection, but may prevent frames over a specific size from being transmitted end-to-end.  In this situation, the successful smaller frames allow the protocol to setup the connection successfully, but the larger frames transporting the backup image will not reach the media server (backup) or client (restore).  This situation can be detected on UNIX/Linux by using the command 'traceroute <remoteIPaddress> <size>'.  Try it first with size 1500, if successful, try 3000, 4500, 6000, 7500, 9000, and then the MTU/MSS size.  If the smaller sizes go through but the larger ones do not, there is a problem with the jumbo frame configuration someplace on/between the hosts.

  1. Once the root cause is identified and mitigated, the CLIENT_READ_TIMEOUT on the media server can be reduced to the original/default value.

Note: Waiting for bptm to mount a new tape will not result in this failure as the media_not_ready raised to bpbrm from bptm when the first tape fills will suspend the timer.   The timer will not reset/restart until the media_ready following the successful mount and position of the next tape.

Note: Changing the CLIENT_READ_TIMEOUT on the client host will not have any effect on this situation.

Note: Oracle RMAN backups may have an additional causes and better work-around options.  See the related articles.

Was this content helpful?