VMware backup with "File Level Recovery" fails with Status 636 due to disconnect between Master and Media Server
Problem
When performing a VMware Backup with File Level Recovery, if there are no updates to the Master server for an extended period of time, then the Master and media server socket may be disconnected by a firewall in between. By default, most firewalls will disconnect after 2 hours when there are no updates.
Error Message
Detail Status:
3/31/2014 11:55:45 AM - begin writing
3/31/2014 11:57:09 AM - Info bpbkar32(pid=2844) 0 entries sent to bpdbm
read from input socket failed(636)
3/31/2014 7:24:46 PM - Error bpbrm(pid=3644) could not write FILE ADDED message to OUTSOCK
3/31/2014 7:24:51 PM - Error bpbrm(pid=3644) could not write FILE ADDED message to OUTSOCK
3/31/2014 7:24:56 PM - Error bpbrm(pid=3644) could not write FILE ADDED message to OUTSOCK
3/31/2014 7:25:01 PM - Error bpbrm(pid=3644) could not write FILE ADDED message to OUTSOCK
3/31/2014 7:25:06 PM - Error bpbrm(pid=3644) could not write FILE ADDED message to OUTSOCK
3/31/2014 7:25:11 PM - Error bpbrm(pid=3644) could not write FILE ADDED message to OUTSOCK
3/31/2014 7:25:16 PM - Error bpbrm(pid=3644) could not write FILE ADDED message to OUTSOCK
3/31/2014 7:25:21 PM - Error bpbrm(pid=3644) could not write FILE ADDED message to OUTSOCK
3/31/2014 7:25:27 PM - Error bpbrm(pid=3644) could not write FILE ADDED message to OUTSOCK
3/31/2014 7:25:32 PM - Error bpbrm(pid=3644) could not write FILE ADDED message to OUTSOCK
3/31/2014 7:25:39 PM - Error bpbrm(pid=3644) could not write FILE ADDED message to OUTSOCK
3/31/2014 7:25:45 PM - Error bpbrm(pid=3644) db_FLISTsend failed: no entity was found (227)
3/31/2014 7:27:22 PM - Info bpbkar32(pid=0) bpbkar waited 0 times for empty buffer, delayed 0 times.
3/31/2014 7:27:22 PM - Critical bpbrm(pid=3644) unexpected termination of client VMCLIENT
3/31/2014 7:29:23 PM - Error bpbrm(pid=3644) could not write EXIT STATUS to OUTSOCK
3/31/2014 7:29:23 PM - Info bpbkar32(pid=0) done. status: 227: no entity was found
bpbrm log snippet from the media server:
12:10:23.472 [3644.4264] <2> bpbrm main: ADDED FILES TO DB FOR VMCLIENT_1396281339 250 v4recovery
19:24:46.066 [3644.4264] <2> put_strlen_str: cannot write data to network: An existing connection was forcibly closed by the remote host.
19:24:46.066 [3644.4264] <16> bpbrm main: could not write FILE ADDED message to OUTSOCK
19:24:46.066 [3644.4264] <2> set_job_details: Tfile (263102): LOG 1396308286 16 bpbrm 3644 could not write FILE ADDED message to OUTSOCK
Cause
The issue occurs due to a device (firewall) in between the Master and Media server closing the open socket.
Solution
- Identify what is closing the socket and address the issue.
- Change the default OS KeepAliveTime to lower than the identified device disconnect time.
For example, if the firewall closes the connection in 2 hours, then a 15 minute OS KeepAliveTime may help avoid the issue.
To keep the firewall from dropping idle sockets, either lengthen the idle socket timeout on the firewall or shorten the TCP keepalive frequency on the hosts on either side of the firewall. The frequency should be less than the idle socket timeout setting on the firewall. The default frequency is 2 hours, which is much too long for most sites. A frequency of 15 minutes is usually appropriate, but use a shorter frequency if needed.
| Operating System | Parameter for frequency of probes | Values | Commands | 
| AIX | tcp_keepidle | 1,800 half secs | $ no -o tcp_keepidle=1800 | 
| HP-UX 11i | tcp_keepalive_interval | 900,000 ms | $ ndd -set /dev/tcp tcp_keepalive_interval 900000 | 
| Linux | tcp_keepalive_time | 900 secs | $ sysctl -w net.ipv4.tcp_keepalive_time=900 | 
| Solaris | tcp_keepalive_interval | 900,000 ms | $ ndd -set /dev/tcp tcp_keepalive_interval 900000 | 
| Windows | KeepAliveTime | 900,000 ms | See Related Documentation below. | 
Related Documentation: NetBackup Backup Planning and Performance Tuning Guide:
NetBackup™ Backup Planning and Performance Tuning Guide (veritas.com)
