Status codes 14, 40 and 41 are seen on the Windows Client and Media Server

Article: 100020853
Last Published: 2010-01-14
Ratings: 0 3
Product(s): NetBackup & Alta Data Protection

Problem

Exit status codes 14, 40 and 41 are seen on the Windows Client and Media Server

Solution

Overview:
Backup jobs fail after a period of time. Status Codes 14, 40 and 41 are observed throughout the backup processes including Activity Monitor, Media Server, and Client logs.

Details:
Different NetBackup processes generate slightly different error codes when an established socket is unexpectedly disconnected between two processes. For instance:

Status Code 14: A write to a file or socket failed
Status Code 40: The connection between the client and the server was broken.
Status Code 41: The server did not receive any information from the client for too long a period of time.

Because socket/network related root causes occur on a lower level then NetBackup can log, all NetBackup can do is indicate that there was an unexpected communication problem.

Identifying the exact root cause of a communication problem can often require using tools to capture network packets as the error occurs as well as the expertise to analyze the captured network packets.

Troubleshooting:
Some tools which can be used to validate whether the network environment on a windows host is problematic are:

The command " NetStat -S" can be used to see if communication problems have occurred:

 TCP Statistics
 Active Opens                        =4023
 Passive Opens                      = 46
 Failed Connection Attempts          =113
 Reset Connections                   =166
 Current Connections                 =13
 Segments Received                   =781309
 Segments Sent                      = 575802
 Segments Retransmitted              =381

In the event data is dropped or lost en route between the sender and the receiving machines, the TCP Stack is constructed to re-send the missing data. This value is called TcpMaxDataRetransmissions and it defaults to 5 (retransmissions). In the example above, Segments Retransmitted has a value of 381. In the event that a packet still has not reached the intended destination after 5 attempts, the TCP Stack is designed to shut down the established socket. This can cause errors like the following:

Client side bpbkar log (General 2 / TCP3):
4:00:58.504 PM: [8056.1768] <2> dtcp_write: TCP - success: send socket (868), 32768 of 32768 bytes
4:00:59.317 PM: [8056.1768] <16>dtcp_write: TCP - failure: send socket (868) (TCP 10054: Connection reset by peer)
4:00:59.317 PM: [8056.1768] <16> dtcp_write: TCP - failure:attempted to send 32768 bytes
4:00:59.317 PM: [8056.1768] <4>tar_base::V_vTarMsgW: INF - tar message received from tar_backup::backup_data_state
4:00:59.317 PM: [8056.1768] <2>tar_base::V_vTarMsgW: FTL - tar file write error (10054)
4:00:59.317 PM:[8056.1768] <2> dtcp_write: TCP - success: send socket (812), 35 of 35 bytes
4:00:59.317 PM: [8056.1768] <2> tar_base::V_vTarMsgW: INF -Client completed sending data for backup
4:00:59.317 PM: [8056.1768]<4> tar_base::V_StopKeepaliveThread: INF - Waiting for Keepalive Thread to EXIT
4:00:59.317 PM: [8056.7404] <4> tar_base::V_KeepaliveThread: INF -Keepalive Thread Terminating. Mutex:WAIT_OBJECT_0
4:00:59.317 PM: [8056.1768]<4> tar_base::V_StopKeepaliveThread: INF - The Keepalive Thread hasExited. Wait Reason:WAIT_OBJECT_0
4:00:59.317 PM: [8056.1768] <2>tar_base::V_vTarMsgW: INF - EXIT STATUS 14: file write failed
4:00:59.317 PM:[8056.1768] <2> dtcp_write: TCP - success: send socket (812), 40 of 40 bytes
4:00:59.317 PM: [8056.1768] <4> tar_backup::backup_done_state:INF - Not waiting for server status

Media Server side bpbrm log (Verbose 5):
16:56:27.971 [3872.4236] <2> process_cpr_message:db_addCHKPT returned the requested operation was successfully completed (0)
16:56:27.971 [3872.4236] <2> process_cpr_message: Sent bpdbm theCPR info.
17:00:59.240 [3872.4236] <32> bpbrm main: from clientbig-server: FTL - tar file write error (10054)
17:00:59.334 [3872.4236]<2> vnet_vnetd_service_socket: vnet_vnetd.c.2034:VN_REQUEST_SERVICE_SOCKET: 6 0x00000006
17:00:59.584 [3872.4236] <2>logconnections: BPDBM CONNECT FROM 10.180.12.12.4453 TO10.180.32.24.13724
17:00:59.990 [3872.4236] <2> bpbrm main: clientbig-server EXIT STATUS = 14: file write failed
17:00:59.990 [3872.4236]<2> bpbrm kill_child_process_Ex: start
17:01:00.115 [3872.4236]<2> bpbrm wait_for_child: start
17:01:00.115 [3872.4236] <2>bpbrm wait_for_child: child exit_status = 150
17:01:00.115 [3872.4236]<2> inform_client_of_status: INF - Server status =14

Resolution:
The above failure was discovered to be rooted in a customized configuration of the TCP Stack called TcpMaxDataRetransmissions on the Windows Client and Windows Media Server. In order to discover the problem, a WireShark network capture was established on both the Windows Media Server as well as the Windows Client. A Microsoft Network Engineer was able to analyze the captured data only to discover that the socket was being closed after the second retransmission of a problematic packet.

Both servers in this environment had TcpMaxDataRetransmissions set to a value of 2.

According to Microsoft, it is common practice to increase TcpMaxDataRetransmissions in an effort to make TCP more resilient. The customer had reduced the value instead. After returning this value back to the default of 5, backups began to succeed.

This Microsoft Article describes all TCP/IP Configuration Parameters includingTcpMaxDataRetransmissions:
 
 

 

Was this content helpful?