TCP Keepalive Best Practices - detecting network drops and preventing idle socket timeout

TCP Keepalive Best Practices - detecting network drops and preventing idle socket timeout

Article: 100028680
Last Published: 2013-04-19
Ratings: 11 5
Product(s): NetBackup

Problem

When a NetBackup job becomes active, the Job Manager (nbjm) on the master server creates TCP connections to the Backup Restore Manager (bpbrm) on the media server.  The master server monitors the socket waiting for updates and job exit status.

If there are problems with the TCP stacks on the hosts, or network between the hosts, or catastrophic application failure on the media server (crash/core), or explicit termination of the application process (kill or TaskMgr), or unusually long processing delays, then the connection may drop and the TCP stack on the master server (and thus nbjm) will be unaware of the situation.

Depending on the root cause, the media server and client processes may continue and may even complete the operation successfully.

When the TCP stack on the master server eventually detects the lost connection, nbjm will report the job as failed and retry the job.

How to detect the dropped connection sooner, so job resources are released and the job is retried more quickly?

This situation may also be encountered on other NetBackup connections, such as client initiated jobs using the wait (-w) option, VMware mapping information exchange,  duplications, and image clean-up.
 

Error Message

Regardless of the job status determined by the media server, nbjm will eventually record a status 40 in the Job Details, typically just over 2 hours later.

12:43:08.415 [Debug] [CallbackQueue::queueRequest] queueing JL updateJobStatus : jobid=249061, birthtime=1221446343, status=40 -- retry count=-1(CallbackQueue.cpp:1212)

If the TCP keepalive retry configuration is set too low/short, additional failure symptoms are possible such as status 13, status 14, status 24, status 42, status 44, or status 636.

Cause

Most NetBackup tasks complete within seconds, most jobs within a few minutes or perhaps an hour.  In situations like those above, NetBackup has a controlling process and connection waiting for return status while other processes and connections on other hosts complete the tasks for the job.  If those hosts, network segments, or processes are congested or behave sub-optimally at times, then the tasks may take longer and overall results status may be delayed.  

Compounding this situation, many sites implement an idle socket timeout in a network component; either on a firewall/device in the network or in the TCP stack on one of the hosts.  The timer, if it expires, silently drops the control connection before the other tasks for the job are completed.  No notifications are sent at that time, so the applications at either end of the connection are unaware.  If the idle socket timeout occurs on a firewall or other device between the hosts, the TCP stacks on both hosts is also unaware.

If the TCP stack on the media server is not reliably sending packets on the control connection, or the remote process has faulted or been terminated, or an idle socket timeout has dropped the connection, then nbjm will be unaware of the failure.  But, NetBackup will have set the SO_KEEPALIVE socket option on the socket and the master server operating system (O/S) will eventually send a TCP Keepalive packet.  When probed, the network should deliver the keepalive to the media server and the TCP stack on that host should respond with an immediate TCP RST if the remote process is no longer running.  However, in the case of an idle socket timeout, the keepalive may be silently discarded by the device or software that dropped the connection.  The TCP stack that sent the keepalive should send retransmissions of the TCP Keepalive until it believes the connection is no longer valid. 

Once the master server O/S detects the socket is no longer valid, an error indication is provided to nbjm, the job fails and become eligible for retry.
 

Solution

The root problem is whatever it is that is causing the media server host to no longer be able to send data on the socket in the expected timeframe.  Analyze and resolve that problem to prevent the initial job failure.

As a work-around, to detect network drops more quickly, and retry jobs sooner, adjust the TCP Keepalive settings on the master server to send the keepalives more frequently and fail within a reasonable timeframe.  Settings which detect the failure within 5 to 15 minutes are appropriate for modern networks.

Note: Decreasing the TCP Keepalive interval sufficiently will also prevent idle socket timeout.

Below are tuning examples for several different platforms.  Please check your O/S vendor documentation for details on these or equivalent parameters.  The goals are three-fold:

  1. Send TCP Keepalives more frequently and detect loss of the remote endpoint within 15 minutes.
  2. Send TCP Keepalives successfully (within 15 minutes), before idle socket timeout (typically 60 or 30 minutes).
  3. Make sure TCP Keepalives retry at least as robustly as TCP data retransmission to prevent spurious connection drop.  By default, Windows typically retransmits either 5 or 10 times and drops the connection within 10 to 150 seconds if no response.  By default, UNIX/Linux typically retransmits 10 to 20 times over 8 to 20 minutes.

Table 1: Parameters for different operating systems and sample values per the goals above.
  

Operating System Parameter for frequency of probes Parameter for interval between failed probes Parameter for max probes or time before connection failure Units for timers
AIX tcp_keepidle = 700
(default is 1440)
*tcp_keepintvl = 20
(default is 150)
*tcp_keepcnt = 20
(default is 9)
half-seconds
HP-UX 11i tcp_keepalive_interval = 300000
(default is 7200000)
n/a
(uses normal data retransmit settings)

*tcp_keepalives_kill = 1
tcp_ip_abort_interval = 600000
(default is 600000)

milliseconds
Linux tcp_keepalive_time = 700
(default is 7200)
*tcp_keepalive_intvl = 10
(default is 75)
*tcp_keepalive_probes = 20
(default is 9)
seconds
Solaris tcp_keepalive_interval = 420000
(default is 7200000)
n/a
(uses normal data retransmit settings)
*tcp_keepalive_abort_interval = 480000
(default is 480000)
milliseconds
Windows
(requires reboot)
KeepAliveTime = 750000
(default is 7200000)
KeepAliveInterval = 15000
(default is 1000)
pre-2008: *TcpMaxDataRetransmission = 10
pre-2008: (default is 5)
Win2008+: hard-coded at 10
milliseconds

Windows Server 2008 Additional Registry Entries

* Caution: When decreasing the TCP Keepalive interval, be sure to review the keepalive retry interval and retry count; the latter may need to be increased.  If they are set lower than normal TCP data retransmission settings, and there is packet loss, then decreasing the TCP keepalive interval may expose connections to an increased risk of connection drop due to failure to receive at least one TCP keepalive acknowledgement.

* Caution: Decreasing the timespan during which TCP Keepalives will be retransmitted, if less than the TCP data retransmission timespan, may limit the length of temporary network outage that could normally be overcome by retransmission.  You need to make an explicit decision as to which is more important, detecting a dropped connection quickly or allowing a longer period of retries to potentially overcome a network outage.
 

Table 2: Commands to display the current values for TCP parameters.
 
Platform Operating System Command
AIX no -o parameter (or) no -a
HP-UX 11i ndd -get /dev/tcp parameter
Linux sysctl -n net.ipv4.parameter (or) sysctl -a
Solaris ndd -get /dev/tcp parameter
Windows From the Start menu, choose Run and enter regedit to view the parameter located in the Registry file HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters

 
Table 3: Commands to change the TCP parameters.  Check your O/S documentation for how to make those values persist through a reboot.
 
Platform Operating System Command
AIX no -o parameter=new_value
HP-UX 11i ndd -set /dev/tcp parameter new_value
Linux sysctl -w net.ipv4.parameter=new_value
Solaris ndd -set /dev/tcp parameter new_value
Windows Run regedit to edit the Windows Registry key located in the path HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters.
Some parameters require a restart of the computer for the change to take effect.
 

Applies To

Any environment that is experiencing network connection drops.

Older (pre-6.5) versions of NetBackup with clustered media servers may experience this problem if there is a hardware or software fault and the cluster fails over to the passive node.
 

References

Etrack : 3064492

Was this content helpful?