Important Update: Cohesity Products Knowledge Base Articles
All Cohesity Knowledge Base Articles are now managed via the Cohesity Support Portal: https://support.cohesity.com/s/searchunify. The Knowledge Base articles available here will not reflect the latest information or may no longer be accessible.
STATUS CODE: 23, 24, 25, 53 sporadic backup failures with the error "(11) network write error" on Solaris
Problem
STATUS CODE: 23, 24, 25, 53 sporadic backup failures with the error "(11) network write() error" on NetBackup systems running Solaris10. Additional problems that may be a result of this issue are poor backup performance and catalog backups exiting with status code 11.
A variety backup failures can occur due to a known operating system problem on Solaris 10 machines running various levels of patch revision 118833-xx. These can include, but are not limited to, the following NetBackup status codes 23, 24, 25, and 53. Additional error messages such as "(11) network write() error" appearing in the log files. This issue is specific to systems running Solaris 10.
Similar issues may occur on Solaris 9, 10, or 11 when the Solaris TCP Fusion feature is enabled and NetBackup is running on a host that has few CPUs or significant contention for the available CPUs.
Errors:
These error messages can occur in various log files during different NetBackup operations. The Log Files section below lists some possible examples that can appear on Solaris 10 systems. If any of these messages appear to be the cause of the problem, refer to the resolution listed below.
Log Files:
1. The /usr/openv/netbackup/logs/bpjobd/log.<date> may show failures when trying to connect to the job daemon. This would occur when the system istrying to update the Activity Monitor in the NetBackup Administration Console.
<2> put_long: (11) network write()error: Interrupted system call (4); socket =6 <16> start_bpbrm: Could not write CONTINUE to CMD_SOCK <16>read_bpbrm_stderr: error processing new job: socket write failed(24) <2> put_long: (11) network write() error: Bad file number (9); socket = -1
2. The /usr/openv/netbackup/logs/bpsched/log.<date> file on NetBackup master servers may show different failures.
Example1:
<2> ?: CLIENTXYZ exited with status 24(socket write failed)
<16>log_in_errorDB: backup of client CLIENTXYZ exited with status 24 (socket writefailed)
Example 2:
<2>?: CLIENTABC exited with status 25 (cannot connect on socket)
Example 3:
<2>?: CLIENT123 exited with status 53 (backup restore manager failed to read the file list)
3. The /var/adm/messages file may also reporta problem with a process fork, similar to the following:
Nov 11 18:15:30 nbu_back sshd[468]:[ID 800047 auth.error] error: fork: Error0
Cause:
This issue is related to TCP optimizations for local traffic (traffic to same host) introduced in recent kernel update patches. NetBackup uses a number of processes which require various TCP connections to the same host.
Solution:
The patch specific problem is resolved in Sunpatch 125100-10 or later for the SPARC platform, and Sun patch 125101-10 or later for the x86 platform. These are kernel update patches.
For status updates please contact Sun. Document ID 87866 contains additional information regarding this issue and is available at the following link:
https://sunsolve.sun.com/search/document.do?assetkey=1-9-87866-1
Sun recommends customers disable TCP Fusion in order to workaround this issue. TCP Fusion can also be disabled to prevent the more general problem of CPU contention resulting in connection write failure and subsequent connection drop.
echo "do_tcp_fusion/D" | mdb -k
do_tcp_fusion:
do_tcp_fusion: 1
There are two methods to disable the use of TCP Fusion on a Solaris system. Please contact Sun Enterprise Technical Services with any questions on disabling TCP Fusion.
1. Use the modular debugger ( mdb).
Note: This method does NOT require a system reboot, but it will not be persistent when the system is rebooted. If this option is chosen then these steps will need to be followed every time the system is rebooted. This could potentially disrupt system operation if mistakes occur, so great caution is required.
# echo 'do_tcp_fusion/W 0' | mdb -kw
# cd/usr/openv/netbackup/bin/goodies
# ./netbackup stop
# ./netbackup start
2. Use the /etc/system file.
This option has less potential for disrupting the system but does require the system to be rebooted.
setip:do_tcp_fusion = 0