Status 87 and Status 40 / 636 on long running Client side deduplication jobs across a slow connection

Article: 100009509
Last Published: 2019-07-22
Ratings: 0 1
Product(s): NetBackup & Alta Data Protection

Problem

Client side deduplication backups are failing with Status 87 and/or Status 40 / 636 for remote clients across the WAN/firewall. The error occurred at either the end of a storage unit fragment or at the end of a backup when the last [or only] fragment is closed.

Backups of short duration are not affected, but those that run longer are always affected.

Error Message

The Job Details will show the lengthy runtime for the job, the slow transfer performance over the WAN, and that the job ended with status 87 after the client had transferred the entire image.

05/02/2013 18:51:04 - Info bptm (pid=5186) setting receive network buffer to 262144 bytes
05/02/2013 22:21:41 - Info bpbkar (pid=8736) bpbkar waited 18630 times for empty buffer, delayed 578492 times
05/02/2013 22:21:52 - Critical bptm (pid=5186) sts_close_handle failed: 2060017 system call failed
05/02/2013 22:21:52 - Critical bptm (pid=5186) cannot write image to disk, media close failed with status 2060017
05/02/2013 22:21:53 - Info b7pnbmed1 (pid=5186) StorageServer=PureDisk:nbmed1; Report=PDDO Stats for (nbmed1): scanned: 24659201 KB, CR sent: 0 KB, CR sent over FC: 0 KB, dedup: 100.0%
05/02/2013 22:21:53 - Critical bptm (pid=5186) sts_close_server failed: error 2060005 object is busy, cannot be closed
05/02/2013 22:23:32 - Info bptm (pid=5186) EXITING with status 87 <----------
05/02/2013 22:23:34 - Info bpbkar (pid=8736) done. status: 87: media close error
05/02/2013 22:23:34 - end writing; write time: 3:32:21 media close error  (87)

Alternatively, the Job Details may show lengthy runtime, several errors communicating from the media to master servers, and then a status 40 which leads to a server side shutdown before the client to storage handshaking is done. Notice also that nbjm logged a status 636.

05/02/2013 23:52:25 - Info bpbkar (pid=16661) 40000 entries sent to bpdbm
05/03/2013 01:39:50 - Error bpbrm (pid=3763) could not write FILE ADDED message to OUTSOCK
05/03/2013 01:40:17 - Error bpbrm (pid=3763) could not write FILE ADDED message to OUTSOCK
05/03/2013 01:40:19 - Info bpbkar (pid=16661) 45000 entries sent to bpdbm
05/03/2013 01:40:26 - Error bptm (pid=3806) image copy failed: error 2060017: system call failed
05/03/2013 01:40:27 - Error bpbrm (pid=3763) from client client01: ERR - bpbkar exiting because backup is aborting
05/03/2013 01:40:30 - Critical bptm (pid=3806) sts_close_handle failed: 2060017 system call failed
05/03/2013 01:40:30 - Info media01 (pid=3806) StorageServer=PureDisk:media01; Report=PDDO Stats for (media01): scanned: 25534554 KB, CR sent: 856 KB, CR sent over FC: 0 KB, dedup: 100.0%
05/03/2013 01:40:31 - Critical bptm (pid=3806) sts_close_server failed: error 2060005 object is busy, cannot be closed
05/03/2013 01:40:40 - Error bptm (pid=3806) media manager terminated by parent process
05/03/2013 01:40:43 - Error bpbrm (pid=3763) could not write EXIT STATUS to OUTSOCK
05/03/2013 01:40:43 - Info bpbkar (pid=16661) done. status: 40: network connection broken
read from input socket failed  (636)

This nbostpxy debug log (OID 112) shows a client side abort for the first job.

22:21:52.132[8585.29][DEBUG][dummy][1318:bptm:5186:media01][DEBUG] PDVFS: pdvfs_lib_log: CRSessionEnd: media01.test.com: session closed, status aborted
22:21:52.132[8585.29][DEBUG][dummy][1318:bptm:5186:media01][DEBUG] PDVFS: pdvfs_lib_log: CRSessionOrphan: unbinding session 2718096627 to media01.test.com
22:21:52.132[8585.29][ERROR][dummy][1318:bptm:5186:media01][ERROR] PDVFS: pdvfs_lib_log: _crBinaryMessageSend2: Error sending data: aborted
22:21:52.132[8585.29][DEBUG][dummy][1318:bptm:5186:media01][DEBUG] PDVFS: pdvfs_lib_log: Store request for segment 0ff29e7fc56673fbe7390c56596d39ad to cr at media01.test.com failed (aborted) (../stream.c:soRequest:1593)

The spoold session log shows a timeout a few minutes earlier:

May 02 21:17:33 ERR [1080609088]: 25015: Timed out reached while waiting for session id 1960943674 to reconnect. Reconnect Failed.
May 02 21:17:33 ERR [1080609088]: 25015: binaryMessageRead:3402, Could not receive binary message: expected 4 bytes got -1 bytes. Cause: connection timed out

Cause

The failure occurs due to a firewall dropping idle sockets.

Because of the deduplication, there may long spans of time when no data is flowing between the client and media server and consequently media server and master server. A firewall may mistake this for an idle socket that is no longer needed and drop the connection silently. Once the fragment (or job) is complete and NetBackup attempts to use the socket, the dropped connection(s) will be detected result in the failure.

In the first job above, the dropped connection was between the media server and client was detected first and thus detected by both of those hosts and resulted in the status 87.

In the second job above, the connection between master and media server was also dropped and the sequence of operations lead bpbrm to report the status 40. When bpbrm subsequently shutdown bptm and the storage side, then the remaining active connections to the client were also dropped. Notice also that nbjm had a read pending on the socket to bpbrm, the master server TCP stack would be unaware of the dropped idle socket (since no data was arriving) but when the TCP Keepalive Interval expired the O/S attempted to send a keepalive which resulted in a TCP reset which was returned to nbjm which logged the status 636.

Solution

To improve network performance between the media server and client host, make sure that NetBackup is not attempting to adjust the TCP send and receive space. Doing so disables auto-tune on modern TCP stacks which can lead to poorer performance. By improving the performance, the backup may complete before the firewall drops the idle socket.

On UNIX and Linux media servers and client hosts:
echo "0" > /usr/openv/netbackup/NET_BUFFER_SZ

On Windows media servers:
echo "0" > install_path\NetBackup\NET_BUFFER_SZ

To keep the firewall from dropping idle sockets, either lengthen the idle socket timeout on the firewall or shorten the TCP keepalive frequency on the hosts on either side of the firewall. The frequency should be less than the idle socket timeout setting on the firewall. The default frequency is 2 hours, which is much too long for most sites. A frequency of 15 minutes is usually appropriate, but use a shorter frequency if needed.

 
Operating System Parameter for frequency of probes Values Commands
AIX tcp_keepidle 1,800 half secs $ no -o tcp_keepidle=1800
HP-UX 11i tcp_keepalive_interval 900,000 ms $ ndd -set /dev/tcp tcp_keepalive_interval 900000
Linux tcp_keepalive_time 900 secs $ sysctl -w net.ipv4.tcp_keepalive_time=900
Solaris tcp_keepalive_interval 900,000 ms $ ndd -set /dev/tcp tcp_keepalive_interval 900000
Windows KeepAliveTime 900,000 ms See Related Article 100028680 below.

Applies To

Any environment with client side deduplication jobs that run for a longer period of time than the idle socket timeout on the firewall between the hosts.

Was this content helpful?