TCP Keepalive Best Practices - detecting network drops and preventing idle socket timeout

TCP Keepalive Best Practices - detecting network drops and preventing idle socket timeout

Article: 100028680
Last Published: 2022-07-08
Ratings: 17 7
Product(s): NetBackup

Problem

When a NetBackup job becomes active, the Job Manager (nbjm) on the master server creates TCP connections to the Backup Restore Manager (bpbrm) on the media server.  The master server monitors the socket waiting for updates and job exit status.

If there are problems with the TCP stacks on the hosts, or network between the hosts, or catastrophic application failure on the media server (crash/core), or explicit termination of the application process (kill or TaskMgr), or unusually long processing delays, then the connection may drop and the TCP stack on the master server (and thus nbjm) will be unaware of the situation.

Depending on the root cause, the media server and client processes may continue and may even complete the operation successfully.

When the TCP stack on the master server eventually detects the lost connection, nbjm will report the job as failed and retry the job.

How to detect the dropped connection sooner, so job resources are released and the job is retried more quickly?

This situation may also be encountered on other NetBackup connections, such as client initiated jobs using the wait option (bpbackup -w ..., bprestore -w ...), VMware mapping information exchange,  duplications, and image clean-up.
 

Error Message

Regardless of the job status determined by the media server, nbjm will eventually record a status 40 in the Job Details, typically just over 2 hours later.

12:43:08.415 [Debug] [CallbackQueue::queueRequest] queueing JL updateJobStatus : jobid=249061, birthtime=1221446343, status=40 -- retry count=-1(CallbackQueue.cpp:1212)

If the TCP keepalive retry configuration is set too low/short, additional failure symptoms are possible such as status 13, status 14, status 24, status 42, status 44, or status 636.

Cause

Most NetBackup tasks complete within seconds, most jobs within a few minutes or perhaps an hour.  In situations like those above, NetBackup has a controlling process and connection waiting for return status while other processes and connections on other hosts complete the tasks for the job.  If those hosts, network segments, or processes are congested or behave sub-optimally at times, then the tasks may take longer and overall results status may be delayed.  

Compounding this situation, many sites implement an idle socket timeout in a network component; either on a firewall/device in the network or in the TCP stack on one of the hosts.  The timer, if it expires, silently drops the control connection before the other tasks for the job are completed.  No notifications are sent at that time, so the applications at either end of the connection are unaware.  If the idle socket timeout occurs on a firewall or other device between the hosts, the TCP stacks on both hosts is also unaware.

If the TCP stack on the media server is not reliably sending packets on the control connection, or the remote process has faulted or been terminated, or an idle socket timeout has dropped the connection, then nbjm will be unaware of the failure.  But, NetBackup will have set the SO_KEEPALIVE socket option on the socket and the master server operating system (O/S) will eventually send a TCP Keepalive packet.  When probed, the network should deliver the keepalive to the media server and the TCP stack on that host should respond with an immediate TCP RST if the remote process is no longer running.  However, in the case of an idle socket timeout, the keepalive may be silently discarded by the device or software that dropped the connection.  The TCP stack that sent the keepalive should send retransmissions of the TCP Keepalive until it believes the connection is no longer valid. 

Once the master server O/S detects the socket is no longer valid, an error indication is provided to nbjm, the job fails and become eligible for retry.
 

Solution

The root problem is whatever it is that is causing the media server host to no longer be able to send data on the socket in the expected timeframe.  Analyze and resolve that problem to prevent the initial job failure.

As a work-around, to detect network drops more quickly, and retry jobs sooner, adjust the TCP Keepalive settings on the master server to send the keepalives more frequently and fail within a reasonable timeframe.  Settings which detect the failure within 5 to 15 minutes are appropriate for modern networks.

Note: Decreasing the TCP Keepalive interval sufficiently will also prevent idle socket timeout.

What to change?

Below are tuning examples for several different platforms.  Please check your O/S vendor documentation for details on these or equivalent parameters.  The goals are three-fold:

  1. Send TCP Keepalives more frequently and detect loss of the remote endpoint within 15 minutes.
  2. Send TCP Keepalives successfully (within 15 minutes), before idle socket timeout (typically 60 or 30 minutes).
  3. Make sure TCP Keepalives retry at least as robustly as TCP data retransmission to prevent spurious connection drop.  By default, Windows typically retransmits either 5 or 10 times and drops the connection within 10 to 150 seconds if no response.  By default, UNIX/Linux typically retransmits 10 to 20 times over 8 to 20 minutes.

Table 1: Parameters for different operating systems and sample values per the goals above.
  

Operating System Parameter for frequency of probes Parameter for interval between failed probes Parameter for max probes or time before connection failure Units for timers
AIX tcp_keepidle = 700
(default is 1440)
*tcp_keepintvl = 20
(default is 150)
*tcp_keepcnt = 20
(default is 9)
half-seconds
HP-UX 11i tcp_keepalive_interval = 300000
(default is 7200000)
n/a
(uses normal data retransmit settings)

*tcp_keepalives_kill = 1
tcp_ip_abort_interval = 600000
(default is 600000)

milliseconds
Linux tcp_keepalive_time = 700
(default is 7200)
*tcp_keepalive_intvl = 10
(default is 75)
*tcp_keepalive_probes = 20
(default is 9)
seconds
Solaris
(reboot may be required)
tcp_keepalive_interval = 420000
(default is 7200000)
n/a
(uses normal data retransmit settings)
*tcp_keepalive_abort_interval = 480000
(default is 480000)
milliseconds
Windows
(requires reboot)
KeepAliveTime = 750000
(default is 7200000)
KeepAliveInterval = 15000
(default is 1000)
pre-2008: *TcpMaxDataRetransmission = 10
pre-2008: (default is 5)
Win2008+: hard-coded at 10
milliseconds

Windows Server 2008 Additional Registry Entries

* Caution: When decreasing the TCP Keepalive interval, be sure to review the keepalive retry interval and retry count; the latter may need to be increased.  If they are set lower than normal TCP data retransmission settings, and there is packet loss, then decreasing the TCP keepalive interval may expose connections to an increased risk of connection drop due to failure to receive at least one TCP keepalive acknowledgement.

* Caution: Decreasing the timespan during which TCP Keepalives will be retransmitted, if less than the TCP data retransmission timespan, may limit the length of temporary network outage that could normally be overcome by retransmission.  You need to make an explicit decision as to which is more important, detecting a dropped connection quickly or allowing a longer period of retries to potentially overcome a network outage.
 

Table 2: Commands to display the current values for TCP parameters.
 
Platform Operating System Command
AIX no -o parameter (or) no -a
HP-UX 11i ndd -get /dev/tcp parameter
Linux sysctl -n net.ipv4.parameter (or) sysctl -a
Solaris ndd -get /dev/tcp parameter
Windows From the Start menu, choose Run and enter regedit to view the parameter located in the Registry file HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters

 
Table 3: Commands to change the TCP parameters.  Check your O/S documentation for how to make those values persist through a reboot.
 
Platform Operating System Command
AIX no -o parameter=new_value
HP-UX 11i ndd -set /dev/tcp parameter new_value
Linux sysctl -w net.ipv4.parameter=new_value
Solaris ndd -set /dev/tcp parameter new_value
Windows Run regedit to edit the Windows Registry key located in the path HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters.
Some parameters require a restart of the computer for the change to take effect.


On which host to make the kernel tuning change?

Typically the application program at each end of the network connection should be setting the SO_KEEPALIVE option on the socket, so the operating system on each host should be sending TCP Keepalives at the configured interval.  Accordingly the kernel tuning change could be made on either or both hosts, with the following considerations.

The NetBackup primary server is a good candidate because it is dedicated to NetBackup operations and not other applications, so connectivity to the media servers and many clients from that host will benefit from changing the one host.  The NetBackup media servers are also good candidates because that would benefit the connections to many client hosts, as well as to the primary server.  Changing the client hosts involves tuning potentially tens or hundreds of hosts which is a lot more work, plus those hosts are typically running other applications and the connections used by those applications will get the same TCP Keepalive settings which may or may not be desirable.

Whether the tuning change requires a reboot versus being picked up dynamically may also be a consideration when deciding where to implement the tuning.  Possibly 'now' on one host for short-term benefit, to be reverted after the tuning is 'later' put in place on the remote host for long-term benefit.

How to confirm the change? 

By using 'netstat -naopt' (on Linux) or taking a TCP packet capture using tcpdump, snoop, windump, or Wireshark of an idle connection between the changed host and a remote NetBackup host to which communication normally occurs. This command will show the connection(s) to be monitored in the capture and will hold the connections open and idle on the changed/connecting host for two hours and ten minutes while the other tools monitoring for the TCP Keepalive to be sent and a response to be received: 

bptestbpcd -client <remote_hostname> -wait_to_close 7210 -auth_only
1 1 1
127.0.0.1:44588 -> 127.0.0.1:45208 PROXY 192.168.1.15:54336 -> 192.168.1.12:1556
127.0.0.1:58195 -> 127.0.0.1:42938 PROXY 192.168.1.15:48070 -> 192.168.1.12:1556
192.168.1.15:54052 -> 192.168.1.12:1556

On the accepting host, bpcd will timeout after 300 seconds causing the connecting host to have the auth-only socket in TCP CLOSE_WAIT and the the secure connections in TCP ESTABLISHED and the kernel will send TCP Keepalives per the tuning.  If Linux, the shortened keepalive timer used for the countdown and transmission can be observed in the netstats output:

...snip...
Thu Jul  7 15:51:30 CDT 2022
tcp  0  0 192.168.1.15:48070  192.168.1.12:1556  ESTABLISHED 11826/vnetd       keepalive (53.27/0/0)
tcp  0  0 192.168.1.15:54052  192.168.1.12:1556  ESTABLISHED 20056/bptestbpcd  keepalive (53.27/0/0)
tcp  0  0 192.168.1.15:54336  192.168.1.12:1556  ESTABLISHED 11826/vnetd       keepalive (53.26/0/0)

Here, at   15:51:33.292 bpcd timed out after 300 seconds waiting for the next BPCD_*_RQST.

Thu Jul  7 15:51:40 CDT 2022
tcp  0  0 192.168.1.15:48070  192.168.1.12:1556  ESTABLISHED 11826/vnetd       keepalive (43.23/0/0)
tcp  1  0 192.168.1.15:54052  192.168.1.12:1556  CLOSE_WAIT  20056/bptestbpcd  keepalive (43.23/0/0)
tcp  0  0 192.168.1.15:54336  192.168.1.12:1556  ESTABLISHED 11826/vnetd       keepalive (43.23/0/0)
Thu Jul  7 15:51:51 CDT 2022
tcp  0  0 192.168.1.15:48070  192.168.1.12:1556  ESTABLISHED 11826/vnetd       keepalive (33.19/0/0)
tcp  1  0 192.168.1.15:54052  192.168.1.12:1556  CLOSE_WAIT  20056/bptestbpcd  keepalive (33.19/0/0)
tcp  0  0 192.168.1.15:54336  192.168.1.12:1556  ESTABLISHED 11826/vnetd       keepalive (33.19/0/0)
Thu Jul  7 15:52:01 CDT 2022
tcp  0  0 192.168.1.15:48070  192.168.1.12:1556  ESTABLISHED 11826/vnetd       keepalive (23.16/0/0)
tcp  1  0 192.168.1.15:54052  192.168.1.12:1556  CLOSE_WAIT  20056/bptestbpcd  keepalive (23.16/0/0)
tcp  0  0 192.168.1.15:54336  192.168.1.12:1556  ESTABLISHED 11826/vnetd       keepalive (23.16/0/0)
Thu Jul  7 15:52:11 CDT 2022
tcp  0  0 192.168.1.15:48070  192.168.1.12:1556  ESTABLISHED 11826/vnetd       keepalive (13.13/0/0)
tcp  1  0 192.168.1.15:54052  192.168.1.12:1556  CLOSE_WAIT  20056/bptestbpcd  keepalive (13.13/0/0)
tcp  0  0 192.168.1.15:54336  192.168.1.12:1556  ESTABLISHED 11826/vnetd       keepalive (13.13/0/0)
Thu Jul  7 15:52:21 CDT 2022
tcp  0  0 192.168.1.15:48070  192.168.1.12:1556  ESTABLISHED 11826/vnetd       keepalive (3.09/0/0)
tcp  1  0 192.168.1.15:54052  192.168.1.12:1556  CLOSE_WAIT  20056/bptestbpcd  keepalive (3.09/0/0)
tcp  0  0 192.168.1.15:54336  192.168.1.12:1556  ESTABLISHED 11826/vnetd       keepalive (3.09/0/0)

Here, the tcp_keepalive_time=75 counted down to 0 and starts over.

Thu Jul  7 15:52:31 CDT 2022
tcp  0  0 192.168.1.15:48070  192.168.1.12:1556  ESTABLISHED 11826/vnetd       keepalive (68.31/0/0)
tcp  1  0 192.168.1.15:54052  192.168.1.12:1556  CLOSE_WAIT  20056/bptestbpcd  keepalive (68.31/0/0)
tcp  0  0 192.168.1.15:54336  192.168.1.12:1556  ESTABLISHED 11826/vnetd       keepalive (68.31/0/0)

Above, the second field in the keepalive counters did not increment from 0 to 1.
This indicate that TCP replies were received and the connections are still connected host-to-host.
If not connected, replies would not have been received, and the retransmit fields would have incremented.
If max retransmissions are sent without receiving a reply, the connection is dropped/reset by the kernel.

On any operating system, a packet capture will also show the TCP Keepalives and, if the hosts are still connected, the TCP Keepalive replies.

$ tcpdump -n -nn -i eth2 host 192.168.1.12 and port 1556
...snipped application traffic when bptestbpcd first connected to bpcd.
15:51:09.409304 IP 192.168.1.12.1556 > 192.168.1.15.48070: Flags [P.], seq 5547:5549, ack 3943, win 192, ... length 2
15:51:09.410528 IP 192.168.1.12.1556 > 192.168.1.15.54336: Flags [P.], seq 5482:5521, ack 4530, win 259, ... length 39
15:51:09.410825 IP 192.168.1.15.54336 > 192.168.1.12.1556: Flags [.], ack 5521, win 335, ... length 0
15:51:09.448804 IP 192.168.1.15.48070 > 192.168.1.12.1556: Flags [.], ack 5549, win 335, ... length 0
...Above, the last two TCP segments containing application data (length >0) and the TCP ACK for each of them.
...Here, all connections are idle for tcp_keepalive_time = 75 seconds.
...Below, the local TCP sends a TCP Keepalive on each connection and the remote TCP replies with a TCP ACK.
15:52:24.314806 IP 192.168.1.15.54052 > 192.168.1.12.1556: Flags [.], ack 5737, win 357, ... length 0
15:52:24.314830 IP 192.168.1.12.1556 > 192.168.1.15.54052: Flags [.], ack 4191, win 214, ... length 0
15:52:24.408832 IP 192.168.1.15.48070 > 192.168.1.12.1556: Flags [.], ack 5549, win 335, ... length 0
15:52:24.408861 IP 192.168.1.12.1556 > 192.168.1.15.48070: Flags [.], ack 3943, win 192, ... length 0
15:52:24.410663 IP 192.168.1.15.54336 > 192.168.1.12.1556: Flags [.], ack 5521, win 335, ... length 0
15:52:24.410673 IP 192.168.1.12.1556 > 192.168.1.15.54336: Flags [.], ack 4530, win 259, ... length 0
...Here, all connections are again idle for tcp_keepalive_time = 75 seconds.
15:53:39.578785 IP 192.168.1.15.54052 > 192.168.1.12.1556: Flags [.], ack 5737, win 357, ... length 0
15:53:39.578808 IP 192.168.1.12.1556 > 192.168.1.15.54052: Flags [.], ack 4191, win 214, ... length 0
15:53:39.578841 IP 192.168.1.15.48070 > 192.168.1.12.1556: Flags [.], ack 5549, win 335, ... length 0
15:53:39.578843 IP 192.168.1.12.1556 > 192.168.1.15.48070: Flags [.], ack 3943, win 192, ... length 0
15:53:39.578847 IP 192.168.1.15.54336 > 192.168.1.12.1556: Flags [.], ack 5521, win 335, ... length 0
15:53:39.578853 IP 192.168.1.12.1556 > 192.168.1.15.54336: Flags [.], ack 4530, win 259, ... length 0
...repeat every 75 seconds until the application either sends data or closes the connections.

Applies To

Any environment that is experiencing network connection drops.

Older (pre-6.5) versions of NetBackup with clustered media servers may experience this problem if there is a hardware or software fault and the cluster fails over to the passive node.
 

References

Etrack : 3064492

Was this content helpful?