Important Update: Cohesity Products Knowledge Base Articles


All Cohesity Knowledge Base Articles are now managed via the Cohesity Support Portal: https://support.cohesity.com/s/searchunify. The Knowledge Base articles available here will not reflect the latest information or may no longer be accessible.

STATUS CODE: 23, 24, 25, 53 sporadic backup failures with the error "(11) network write error" on Solaris

Article: 100017746
Last Published: 2020-05-21
Ratings: 0 0
Product(s): NetBackup

Problem

STATUS CODE: 23, 24, 25, 53 sporadic backup failures with the error "(11) network write() error" on NetBackup systems running Solaris10. Additional problems that may be a result of this issue are poor backup performance and catalog backups exiting with status code 11. 


A variety backup failures can occur due to a known operating system problem on Solaris 10 machines running various levels of patch revision 118833-xx.  These can include, but are not limited to, the following NetBackup status codes 23, 24, 25, and 53.  Additional error messages such as "(11) network write() error" appearing in the log files.   This issue is specific to systems running Solaris 10.

Similar issues may occur on Solaris 9, 10, or 11 when the Solaris TCP Fusion feature is enabled and NetBackup is running on a host that has few CPUs or significant contention for the available CPUs.

Errors:
These error messages can occur in various log files during different NetBackup operations.  The Log Files section below lists some possible examples that can appear on Solaris 10 systems.  If any of these messages appear to be the cause of the problem, refer to the resolution listed below.

Log Files:

1. The /usr/openv/netbackup/logs/bpjobd/log.<date> may show failures when trying to connect to the job daemon.  This would occur when the system istrying to update the Activity Monitor in the NetBackup Administration Console.

<2> put_long: (11) network write()error: Interrupted system call (4); socket =6
<16> start_bpbrm: Could not write CONTINUE to CMD_SOCK
<16>read_bpbrm_stderr: error processing new job:  socket write failed(24)
<2> put_long: (11) network write() error: Bad file number (9); socket = -1

2. The /usr/openv/netbackup/logs/bpsched/log.<date> file on NetBackup master servers may show different failures.


Example1:
<2> ?: CLIENTXYZ exited with status 24(socket write failed)
<16>log_in_errorDB: backup of client CLIENTXYZ exited with status 24 (socket writefailed)

Example 2:
<2>?: CLIENTABC exited with status 25 (cannot connect on socket)

Example 3:
<2>?: CLIENT123 exited with status 53 (backup restore manager failed to read the file list)

3. The /var/adm/messages file may also reporta problem with a process fork, similar to the following:
Nov 11 18:15:30 nbu_back sshd[468]:[ID 800047 auth.error] error: fork: Error0


Cause:
This issue is related to TCP optimizations for local traffic (traffic to same host) introduced in recent kernel update patches.  NetBackup uses a number of processes which require various TCP connections to the same host.
 
In addition to the patch noted below, TCP Fusion may cause problems if the process writing to the connection (data provider) is 'on' CPU and the process reading from the connection (data consumer) is not 'on' CPU.  In those instances, TCP Fusion will cause an error indication to the data provider if more than 8 (default) consecutive writes occur without any reads by the data consumer. 
 

Solution:
The patch specific problem is resolved in Sunpatch 125100-10 or later for the SPARC platform, and Sun patch 125101-10 or later for the x86 platform.  These are kernel update patches.

For status updates please contact Sun.  Document ID 87866 contains additional information regarding this issue and is available at the following link:
  https://sunsolve.sun.com/search/document.do?assetkey=1-9-87866-1

Sun recommends customers disable TCP Fusion in order to workaround this issue.  TCP Fusion can also be disabled to prevent the more general problem of CPU contention resulting in connection write failure and subsequent connection drop.

Check whether TCP Fusion is enabled or disabled as follows.
 
# echo "do_tcp_fusion/D" | mdb -k
do_tcp_fusion:
do_tcp_fusion:  1

 

There are two methods to disable the use of TCP Fusion on a Solaris system.  Please contact Sun Enterprise Technical Services with any questions on disabling TCP Fusion.  

1.  Use the modular debugger ( mdb).
Note: This method does NOT require a system reboot, but it will not be persistent when the system is rebooted. If this option is chosen then these steps will need to be followed every time the system is rebooted.  This could potentially disrupt system operation if mistakes occur, so great caution is required.

a. When there are no activebackup or restore operations then run the following commands.
    # echo 'do_tcp_fusion/W 0' | mdb -kw
 
b. The NetBackup processes will also need to be restarted.
# cd/usr/openv/netbackup/bin/goodies
# ./netbackup stop
# ./netbackup start

2. Use the /etc/system file.
This option has less potential for disrupting the system but does require the system to be rebooted.
 
Add following line in the /etc/system file.
 
setip:do_tcp_fusion = 0
 
Once the /etc/system file is updated it will be necessary to restart the system before the workaround will take effect.
 
Note:  Disabling TCP Fusion is preferred over trying to tune tcp_fusion_rcv_unread_min and other related settings.

 

Was this content helpful?