Problem
When initiating a restore from a Windows VCS clustered primary server you may get an error Client / Server handshake failed. Restore (bprestore/etc) and browse/list (bplist/etc) requests from other hosts may fail with status 26 or status 23, or suffer delays of several minutes before the job goes active or the list request completes.
In this situation the bpclntcmd -pn from the active node of the primary server will also fail with a connection timeout. Similar delays or failures may happen when the requesting host is remote from the primary server.
Error Message
Errors observed in the bprestore debug log on the primary server:
<2> get_long_base: (1) cannot read (byte 1) from network: Connection timed out.
<2> get_bprdRCVD: get_string() failed, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (10060), network read error
<2> bprestore: get_bprdRCVD() returned -3
When running bpclntcmd -pn -verbose -debug from the primary server we see the following errors.
<2> get_long_base: (1) cannot read (byte 1) from network: Connection timed out.
<2> get_bprdresp: get_string() failed, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (10060), network read error
The commands above all connect to the bprd process on the primary server. The bprd PID handling each request will need to determine if the connecting peername is a known/expected hostname by having the database manager (bpdbm) search for a matching configured client name. That search spans the altnames database, policy clients database, and all network hostname aliases for the known hostnames. If any portions are delayed excessively, the requesting process may timeout before the search is finished.
This bprd debug log PID shows a get_ccname request to bpdbm that did not return for 52 seconds..
12:38:09.626 [92936.54260] <2> get_ccname: determine configured name for requesting-host.domain
+ 00:52.591
12:39:02.217 [92936.54260] <2> get_ccname: configured name is: requesting-host.domain
These bprd debug log PIDs show the connection to bpdbm, but then does not show a successful completion of the bpdbm query.
12:38:09.861 [bprd.54260] <2> get_ccname: determine configured name for other-requestor.domain
...snip...
12:38:09.861 [bprd.54260] <2> logconnections: BPDBM CONNECT FROM 10.1.1.11.63895 TO 10.1.1.13.13721 fd = 972
...doesn't show get_ccname results even though the log continues for hours...
The corresponding bpdbm debug log PIDs show long delays in the O/S. First starting a bpdbm child process to do the search, then name service resolutions delays of 2-3 seconds for many hostnames. Those name service queries also were not successful (WINSOCK 11001 = WSAHOST_NOT_FOUND). There are over 1,500 name service lookup failures spanning more than an hour and hence the delay and timeout.
12:40:19.913 [bpdbm.33632] <2> logconnections: BPDBM ACCEPT FROM 10.1.1.11.63492 TO 10.1.1.13.13721 fd = 1224
...snip...
12:40:19.929 [118592.33632] <4> bpdbm: VERBOSE = 5
...snip...
12:40:19.929 [118592.33632] <2> db_logimagerec: protocol version 15 qtype 84 vers 0 client other-requestor.domain policy ?
...snip...
12:40:19.929 [118592.33632] <2> image_db: Q_IMAGE_GET_CCNAME
...snip...
12:40:19.975 [118592.33632] <2> init_cache: [vnet_hosts.c:862] host_failed_cache_size=40, cache_time=3600
12:40:22.304 [118592.33632] <2> vnet_same_host_and_update: [vnet_addrinfo.c:3215] Comparing name1:[p10512-bu.net] name2:[p10512-bu]
12:40:22.304 [118592.33632] <8> vnet_cached_getaddrinfo_and_update: [vnet_addrinfo.c:1851] in failed file cache ERR=11001 NAME=p10512-bu SVC=NULL
... The hostname above failed to resolve in the prior hour, and cached by
... NetBackup so that failure completed quickly because name services was not used.
...snip...
12:40:24.634 [118592.33632] <2> read_all_clients:
... begins reading the policy client files, and checking or hostname aliases for each.
...snip...
... The hostnames below are stale in the NetBackup Host Cache, and
... each subsequent query to name services takes >2 seconds to get WSAHOST_NOT_FOUND.
12:40:30.073 [118592.33632] <2> vnet_cached_getaddrinfo_and_update: [vnet_addrinfo.c:1822] file cache has expired NAME=p505483 SVC=NULL
12:40:32.355 [118592.33632] <8> retry_getaddrinfo_for_real: [vnet_addrinfo.c:1230] getaddrinfo() failed RV=11001 NAME=p505483 SVC=0
--
12:40:33.231 [118592.33632] <2> vnet_cached_getaddrinfo_and_update: [vnet_addrinfo.c:1822] file cache has expired NAME=p506354 SVC=NULL
12:40:35.560 [118592.33632] <8> retry_getaddrinfo_for_real: [vnet_addrinfo.c:1230] getaddrinfo() failed RV=11001 NAME=p506354 SVC=0
--
12:40:35.592 [118592.33632] <2> vnet_cached_getaddrinfo_and_update: [vnet_addrinfo.c:1822] file cache has expired NAME=p506381 SVC=NULL
12:40:37.920 [118592.33632] <8> retry_getaddrinfo_for_real: [vnet_addrinfo.c:1230] getaddrinfo() failed RV=11001 NAME=p506381 SVC=0
--
12:40:37.951 [118592.33632] <2> vnet_cached_getaddrinfo_and_update: [vnet_addrinfo.c:1822] file cache has expired NAME=p506404 SVC=NULL
12:40:40.280 [118592.33632] <8> retry_getaddrinfo_for_real: [vnet_addrinfo.c:1230] getaddrinfo() failed RV=11001 NAME=p506404 SVC=0
--
12:40:40.280 [118592.33632] <2> vnet_cached_getaddrinfo_and_update: [vnet_addrinfo.c:1822] file cache has expired NAME=p506485 SVC=NULL
12:40:42.678 [118592.33632] <8> retry_getaddrinfo_for_real: [vnet_addrinfo.c:1230] getaddrinfo() failed RV=11001 NAME=p506485 SVC=0
--
12:40:42.694 [118592.33632] <2> vnet_cached_getaddrinfo_and_update: [vnet_addrinfo.c:1822] file cache has expired NAME=p506504 SVC=NULL
12:40:45.070 [118592.33632] <8> retry_getaddrinfo_for_real: [vnet_addrinfo.c:1230] getaddrinfo() failed RV=11001 NAME=p506504 SVC=0
... snipped 1,499 more such name services failures ...
13:41:43.615 [118592.33632] <2> vnet_cached_getaddrinfo_and_update: [vnet_addrinfo.c:1822] file cache has expired NAME=p12430 SVC=NULL
13:41:45.928 [118592.33632] <8> retry_getaddrinfo_for_real: [vnet_addrinfo.c:1230] getaddrinfo() failed RV=11001 NAME=p12430 SVC=0
...snip...
13:41:45.959 [118592.33632] <2> db_sendSTATUS: Status:227
13:41:45.959 [118592.33632] <2> process_request: request complete: exit status 227 no entity was found; query type: 84
Example #2
This is an Exchange DAG restore request from a Windows client to a Linux primary server demonstrating two related aspects. First, it has cascading restore jobs where the requesting host for the child jobs is localhost (primary server). Second, the name services on the primary server is non-responsive so there are delays checking hostnames.
The bprd debug log shows the original restore request originated from the NbWin GUI at 10.1.1.100, which name services on the primary server resolved to exchangeclient, which is recognized as exchangedag. Notice the get_ccname request here took only 202 milliseconds.
11:45:18.836 [39658.39658] <2> connected_peer: Connection from host exchangeclient, 10.1.1.100, on non-reserved port 30776
...snip...
11:45:18.836 [39658.39658] <2> logconnections: BPRD ACCEPT FROM 127.0.0.1.59040 TO 127.0.0.1.15356 fd = 3
11:45:18.836 [39658.39658] <2> logconnections: PROXY ACCEPT FROM 10.1.1.100.30776 TO 10.1.1.1.1556 fd = 3 REMOTE PROXY PID 9004 PEER PID 27768 NbWin
...snip...
11:45:18.847 [39658.39658] <2> process_request: command C_BPRESTORE_7_6 (169) received
...snip...
11:45:18.847 [39658.39658] <2> get_ccname: determine configured name for exchangeclient
...snipped the fast successful lookup.
11:45:19.049 [39658.39658] <2> get_ccname: configured name is: exchangedag
Restore of the Exchange DAG requires an additional job. The bprd PID above connects to the listener to start a another bprd. The first bprd encounters a 300 second timeout waiting for the second bprd to finish and fails with status 23.
11:45:35.618 [39658.39658] <2> logconnections: BPRD CONNECT FROM 127.0.0.1.3949 TO 127.0.0.1.13720 fd = 3
11:50:35.618 [39658.39658] <2> get_long_base: (1) cannot read (byte 1) from network: Interrupted system call (4)
11:50:35.618 [39658.39658] <2> get_bprdRCVD: get_string() failed, Interrupted system call (4), connection timed out
11:50:35.618 [39658.39658] <2> make_bpresolver_call: get_bprdRCVD() returned -2
11:55:35.619 [39658.39658] <2> get_long_base: (1) cannot read (byte 1) from network: Interrupted system call (4)
11:55:35.619 [39658.39658] <2> get_bprdresp: get_string() failed, Interrupted system call (4), connection timed out
...snip...
11:55:35.619 [39658.39658] <2> set_exit_status: entered; status = 23
The second bprd performs the same steps as the first bprd, but now the connection is originating from the primary server and not from the client host. Hence the source IP address is 127.0.0.1 and is resolved to 'localhost' and the get_ccname request begins a search to see if that hostname is known to NetBackup.
11:45:35.620 [39949.39949] <2> connected_peer: Connection from host localhost, 127.0.0.1, on non-reserved port 3949
...snip...
11:45:35.620 [39949.39949] <2> logconnections: BPRD ACCEPT FROM 127.0.0.1.3949 TO 127.0.0.1.13720 fd = 3
...snip...
11:45:35.658 [39949.39949] <2> process_request: command C_BPRESTORE_SYNC_DAG (135) received
...snip...
11:45:35.658 [39949.39949] <2> get_ccname: determine configured name for localhost
...snip...
11:45:35.660 [39949.39949] <2> logconnections: BPDBM CONNECT FROM 10.1.1.1.3740 TO 10.1.1.1.13721 fd = 6
... Hangs here waiting on bpdbm, until the first bprd times out.
The bpdbm debug log shows the query criteria from bprd, and then the getaddrinfo API calls to the O/S name services for the various hostnames already known to the primary server. Many calls are returning EAI_AGAIN because name services is often not able to service some of the requests in a timely fashion. This happens repeatedly, causing the validation of the hostname 'localhost' to be delayed. Hence the reason for the timeout to be exceeded in bprd.
11:45:35.660 [39807] <2> logconnections: BPDBM ACCEPT FROM 10.1.1.1.3740 TO 10.1.1.1.13721 fd = 10
...snip...
11:45:35.660 [39807] <2> db_logimagerec: protocol version 16 qtype 84 vers 0 client localhost policy ?
...snip...
11:45:35.660 [39807] <2> image_db: Q_IMAGE_GET_CCNAME
...snip...
11:45:35.660 [39807] <2> configured_client_name: searching for client matching localhost
...snip...
11:45:54.798 [39807] <8> retry_getaddrinfo_for_real: ... EAI_AGAIN retry NAME=localhost.mydomain TRYN=0
11:46:13.890 [39807] <8> retry_getaddrinfo_for_real: ... EAI_AGAIN took too long NAME=localhost.mydomain TIME=30
11:46:37.904 [39807] <2> vnet_legacy_cached_gethostbyname: ... TRY_AGAIN current=0, max=3
11:46:57.045 [39807] <2> vnet_legacy_cached_gethostbyname: ... DNS lookup slow for host localhost.mydomain, elapsed time=44
...snipped many other timeouts resolving other hostnames.
...during which the first bprd timed out.
12:01:16.197 [39807] <8> retry_getaddrinfo_for_real: ... EAI_AGAIN took too long NAME=client2 TIME=30
12:01:16.197 [39807] <8> retry_getaddrinfo: ... retry_getaddrinfo_for_real failed RV=-3 NAME=client2 SVC=0
12:01:35.334 [39807] <8> retry_getaddrinfo_for_real: ... EAI_AGAIN retry NAME=client3 TRYN=0
12:01:54.415 [39807] <8> retry_getaddrinfo_for_real: ... EAI_AGAIN took too long NAME=client3 TIME=30
12:01:54.415 [39807] <8> retry_getaddrinfo: ... retry_getaddrinfo_for_real failed RV=-3 NAME=client3 SVC=0
12:02:13.557 [39807] <8> retry_getaddrinfo_for_real: ... EAI_AGAIN retry NAME=clientX TRYN=0
... log file was gathered at this time, the search had not yet finished.
Cause
Fast and reliable name resolution is critical for NetBackup functionality. Delays will cause NetBackup processes to take seconds/minutes instead of the normal fractions of a second to resolve a hostname. Cumulative delays may take minutes/hours and cause NetBackup to timeout and fail when many hostnames are non-resolvable quickly, possibly complicated by the use of IP addresses/hostname that have not been defined to NetBackup and/or Name Services.
During the restore process NetBackup performs a get_ccname to resolve host names. The amount of time to resolve the hostnames involved can cause delays and sometimes failures with the restore.
Solution
Make sure the NetBackup host cache configuration settings have not been changed from the defaults, so that it is still caching name service lookups for an hour. This will help minimize calls to name services.
UNIX/Linux: nbgetconfig | grep HOST_CACHE
Windows: nbgetconfig | findstr HOST_CACHE
HOST_CACHE_TTL = 3600
HOST_CACHE_RESYNC_INTERVAL = 900
HOST_CACHE_RESYNC_TIMEOUT = 120
Make sure that name services can consistently and quickly resolve all hostnames known to NetBackup. Both those configured into NetBackup, and those that result from resolving the source/inbound IP addresses from which connections arrive.
- All hostnames in the NetBackup configuration (nbgetconfig).
- All current clients (bpplclients -allunique).
- All past clients that still have backup images (ls -ld netbackup/db/images/*).
- All peernames from the IP addresses assigned to the primary server (bplocaladdrs).
- All peernames from the IP addresses assigned to all nodes of a clustered primary server.
- All peernames from the IP addresses assigned to the media servers and client hosts.
- All mapped hostnames (nbhostmgmt -list -short).
Workarounds
If name services cannot be fixed in a timely fashion, especially when time critical restore operations need to be performed, then the following work-arounds may bypass some or all of the slow/failed name resolution attempts.
Option 1. Perform the restore from a different host if possible, primary for client, client instead of primary.
The peername for the client/primary may be identifiable more quickly than when the restore request originates from the primary/client. Especially if the client/primary has only one IP address and network hostname.
Option 2. Add a policy (can be inactive) that contains all the hostnames of the host involved in the restore, both the requesting host, and source/browse host, and the destination/target host. The hostnames for each host should include the short name, fully qualified domain name, and any backup/private network hostnames in case name service resolution changes in the future. Especially the peername to which the primary server resolves the requesting source IP address.
Option 3. Configure the host.xlate file on the primary server
This file is searched first during get_ccname processing. A match prevents the need to perform additional searches
- Scan the policy database and read the policy and policy client files.
- Scan the image database and read the client subdirectory names.
- Perform subsequent name resolution of those client names to see if any has a network alias hostname for comparison to the peername for the requesting/inbound host source IP’s.
An example of useful entries for the host.xlate file is given below. The first block is for the peernames for the primary server. The second block is for the peernames from which a multi-NIC client might connect.
more <install_path>\netbackup\db\altnames\host.xlate
0 localhost cluster-name.public
0 cluster-name.public cluster-name.public
0 cluster-name.backup cluster-name.backup
0 cluster cluster-name.public
0 node1.public cluster-name.public
0 node1.backup cluster-name.public
0 node1.net-3 cluster-name.public
0 node1 cluster-name.public
0 node2.public cluster-name.public
0 node2.backup cluster-name.public
0 node2.net-3 cluster-name.public
0 node2 cluster-name.public
0 client-pub client-pub
0 client-bk client-pub
0 client.net3 client-pub
0 client-alias client-pub
Column #1 is always a zero (0).
Column #2 is the peername to which the primary server resolved the requesting/source IP.
Column #3 is the hostname by which NetBackup knows the peer host, i.e. policy client.
These workarounds help protect the get_ccname functionality, they don’t replace a fast and reliable name service. Lack of the latter will still negatively affect other parts of NetBackup and should be addressed.