Considerations for using cluster hostname failover without IP address failover with the NetBackup Host Cache
Problem
The virtual hostname of a cluster is transfered from a failed node to a newly active node in the cluster, but the IP address to which the virtual hostname resolves is not transfered. Instead name services is updated so that the already known cluster hostname resolves to a new IP address.
This causes application which cache hostname-to-IP mappings to have an out-dated view of the network topology, and connections to the clustered hostname will fail until the cache is updated from the name services (NIS, DNS, or even hosts files).
Implementation which may be affected:
- DNS IP Failover
- GEO Clusters
- Any other cluster technology that transfers a cluster hostname, but not the cluster IP address, between nodes
- Any other cluster technology that transfers the NetBackup var directory, which houses the NetBackup Host Cache, between nodes
Note: This does not apply to most VCS and MSC clusters where the cluster IP also fails over to the newly active node and the cluster hostname to cluster IP mapping does not change. But it can apply to VCS global clusters.
NetBackup, versions 7.0.1 and above, implement a host cache. Inter-host connections and other name resolutions may fail for up to an hour or until the host cache is either cleared or the entries therein become expired and are refreshed.
DNS IP Failover implements a hostname and IP address for each node in the cluster, but there isn’t a IP address specific to the cluster – only a cluster hostname. The DNS server is custom configured to determine which node is actively hosting the clustered application and associates the IP address of that node with the cluster hostname. Upon failover, the DNS is simply updated to change the IP address associated with the cluster hostname from the original node IP to the previously passive node IP.
GEO Clustering is similar, plus it replicates the application volume between the nodes in the cluster.  This makes a copy of the NetBackup var directory from the original node available to NetBackup processes on the newly active node.  Because the var directory contains the host cache, the newly active node uses the stale hostname to IP mapping files from the failed node.
 
Error Message
Scenario #1: Failover of the virtual hostname of a clustered master server.
There will be a cascade of problems when failover occurs.
- When the NetBackup master server component tries to start on the formerly passive node, the VCS AGENT_DEBUG.log shows the following.
 
 Wed Oct 26 13:26:55 2011 vcs/Online.pl: Calling: /usr/openv/netbackup/bin/cluster/util/online /usr/openv/netbackup/bin/cluster/NBU_RSP
 Wed Oct 26 13:26:55 2011 vcs/online.pl: /usr/openv/netbackup/bin/cluster/util/online /usr/openv/netbackup/bin/cluster/NBU_RSP exited with 99
 
 This occurs because the formerly passive node will have previously received from name services the virtual hostname and IP address of the failed node. It expects to connect to itself via the virtual hostname, but those connections will go to the IP address on the failed node and db_srv and the other master server processes will fail to start.
 
- Other NetBackup processes, on the newly active node, may be similarly confused as show by this portion of the bprd debug log. 
 <2> bprd: lock file fd = 5
 <16> running_as_a_master_server: Running on an inactive node of a cluster.
 <2> bprd: nbu-master is not the primary server
- Some NetBackup media server processes, still using the old hostname-to-IP mapping, will be unable to connect to the master server and will show as offline.
 
 Host properties > Media Servers show status 'Offline'
 
- Some clients making client-directed requests (including the XBSA type agents for Datastore, DB2, Informix, Oracle, SAP, SQL-Server, Sybase) will also fail to reach the newly active node for the master server.
Scenario #2: Failover of the virtual hostname of a clustered client
There may or may not be a failure.
- Scheduled backups will fail with status 25 if the failed node was contacted in the prior hour and is no longer reachable.
 
- If the failed node is reachable, then the backup will proceed and may succeed or fail for other reasons. However, this backup will be of the failed node, not the active node so the backup image captured may not be as expected.
Cause
NetBackup 7.0.1 and above, implement a host cache to store the results of name services forward look-ups; hostname to IP address. This minimizes the delays that may otherwise be encountered repeatedly waiting for an external name service to look up a hostname, especially one that cannot be resolved by the primary name server and is forwarded to a secondary or tertiary server. The cache also greatly reduces the number of requests that a busy NetBackup environment will make to name services, allowing the DNS servers to respond to other requests much more quickly.
The NetBackup host cache retains entries with a time-to-live (TTL) of 3600 seconds by default. Depend on when the failover occurs, it may be up to an hour before the entry is refreshed from name services. Once the refresh occurs, normal NetBackup operations will resume.
Solution
Steps A & B below should be taken on the newly active and newly passive cluster nodes.
A delay of up to an hour before resuming successful operations may be acceptable for many hosts at many sites; typically NetBackup file system clients or NetBackup media servers that are not busy most hours of the day. Those hosts do not need any special adjustments when the master server faults and the cluster fails over to the passive node.
If there are time crucial backup and restore operations, then steps may need to be taken on the master server and any hosts which communicate with the cluster.
- The clusterware that starts the NetBackup processes on the newly active cluster node must clear the NetBackup host cache prior to starting the NetBackup processes. This is necessary to allow NetBackup to recognize that the cluster hostname/IP is now local to this host.
UNIX/Linux: /usr/openv/netbackup/bin/bpclntcmd -clear_host_cache
Windows: <install>\Veritas\NetBackup\bin\bpclntcmd -clear_host_cache
- Ideally, the clusterware would also run the same command on the newly passive cluster node. This is necessary to allow NetBackup to recognize that the cluster hostname/IP is now no longer local to this host.
- Determine if there are other NetBackup hosts with an urgent need to communicate with the cluster in the first hour after failover.
 
 For clustered master servers this may include:
 * Media servers
 * Clients that initiate backup or restore operations 24x7; notably XBSA-using database agents such as Datastore, DB2, Informix, Oracle, SAP, SQL-Server, and Teradata
 * Targeted AIR master servers
 * Administrative and reporting consoles; e.g. OpsCenter, APTARE
 
 For clustered clients this may include:
 * Media servers
 * Master servers that are either performing stream discovery, or backing up XBSA-using database agents.
 * For SharePoint farms, the other frontend and SQL-Server hosts.
 
- Ensure the hosts from step C will be able to resume communication with the cluster, within the timeframe needed, after cluster failover.
 
 Option #1
 Clear the NetBackup Host Cache on each of those hosts immediately after failover.
 
 Option #2
 If option #1 cannot be coordinated in near real-time, determine the length of time that is acceptable for detecting the cluster failover. It should be as long as possible and still meet the needs of the business.
 
 Then review the Host Cache TTL on each host, note those that will not detect the cluster failover in time, they will have to be adjusted in steps F or G.
 
 Display the Host Cache TTL, the current value is displayed followed by the default value. Both values are in seconds.
 
 UNIX/Linux: /usr/openv/netbackup/bin/bptestnetconn -bv | grep 3600
 Windows: <install_path>\Veritas\NetBackup\bin\bptestnetconn -bv | findstr 3600
 
 VNET_OPTIONS (cache time) = 900 [3600]
 
- On the master server, ensure that jobs will retry for at least as long as necessary for the host noted in steps C & D.
 
 Display the current configuration on the master server: (defaults shown)
UNIX/Linux: /usr/openv/netbackup/bin/admincmd/bpconfig -U
Windows: <install>\Veritas\NetBackup\bin\admincmd\bpconfig -U
Job Retry Delay:              10 minutes
...snip...
Backup Tries:                 2 time(s) in 12 hour(s)
Coordinate the adjustment of the Job Retry Delay and the Backup Tries to ensure that jobs will retry for a period of time longer than it takes to get the host caches cleared or expired and elegible to be refreshed.
In this example, the delay between tries and retries is increased from 10 to 30 minutes, and the number of tries is increased from 2 to 3 so that the first try is at time 0, the second try at >30 minutes, the third try at >60 minutes; by which time all caches would be expired and refreshed. bpconfig -wi 30bpconfig -tries 3 -period 12
Note-1: Adjusting the Job Retry Delay and Backup Tries upward is preferred to decreasing the Host Cache TTL.
Cautions for step F:
- Before shortening the Host Cache TTL, it is strongly recommended that the SERVER and MEDIA_SERVER lists on the host be made concise. Delete all entries for hosts that do not communicate with this host. Do the same for policy clients on the master server. This will minimize the number of hostnames that need to be resolved on a frequent basis.
- Be aware that shortening the TTL will partially defeat the purpose and benefits of the host cache. This will increase queries to the external name services and may negatively impact both NetBackup and other applications.
- Be aware that if name services does not consistently respond in a timely fashion, any shortening of the TTL will magnify the - already present - negative impact to NetBackup operations.
- If shortening the TTL, use the largest value possible, that is still small enough to meet service objectives. The value should never be less that the the frequency with which the DNS or clusterware detects and performs failover. Values below 300 seconds are strongly discouraged and should only be used if name services are consistently very fast. If the failover must be detected quickly, but name services is not reliably fast, then consider using step G instead of step F.
- Update the Host Cache TTL on the hosts from step D option #2, which require a lower value. 
 
 - echo HOST_CACHE_TTL=1200|nbsetconfig
 
 - UNIX/Linux:- /usr/openv/netbackup/bin/nbsetconfig
 Windows: <install\Veritas\NetBackup\bin\nbsetconfig
 
 Note-2: See Cautions 1-4 above.
 
 Note-3: If the clustered host is the master server, and the TTL is being decreased to a value of 600 - 900, then also decrease HOST_CACHE_RESYNC_INTERVAL to a value 300 seconds less than the TTL. Values below 300 may do more harm than good and in those cases continue to use the default of 900 seconds.
 
 - echo HOST_CACHE_RESYNC_INTERVAL=600|nbsetconfig
 
 Note-4: Step F above can also be implemented from the master server using 'bpsetconfig -h hostname' instead of nbsetconfig so that log onto each client or media server host is not necessary. Step G below can only be implemented directly on the target hosts.
- In some cases, it may be useful to use custom scripting on the hosts from step D option #2.  The commands below can be used to expire just one or a few hostnames from the NetBackup Host Cache. They could be scripted to run as frequently as necessary to detect prompt failover.
 
 This would be most useful in two scenarios.- The clustered hosts are NetBackup clients, and shortening the TTL for all hostnames on the master and media servers would adversely affect the load on name services.
- The clustered host is the master server, and shortening the TTL for all the hostnames on the media servers and many clients would adversely affect the load on name services.
 
 - Delete the individual host cache files for the clustered hostname and the IP addresses that could be associated with that hostname.
 
 Replace 'clusterHN' with the cluster hostname and '10.0.0.11' and '10.0.0.12' with appropriate values for the potential cluster IP addresses.
 
 UNIX/Linux:
 cd /usr/openv/var/host_cache
 /bin/rm */*clusterHN*.txt
 /bin/rm */*10.0.0.11*.txt
 /bin/rm */*10.0.0.12*.txt
 Windows:
 cd <install_path>\Veritas\NetBackup\var\host_cache
 del *clusterHN*.txt /S
 del *10.0.0.11*.txt /S
 del *10.0.0.12*.txt /S
- Update the modification time on the file which indicates the host cache has been cleared.  Ideally, replacing the contents with the current epoch seconds.  This file is in the parent directory.
 
 UNIX/Linux:
 ...wait for at least one second, then...
 cd ..
 touch clear_cache_time.txt
 or
 perl -le 'print ( time )' > clear_cache_time.txt
 
 Windows:
 ...wait for at least one second, then...
 cd ..
 echo see_file_mtime > clear_cache_time.txt
 or
 powershell [Int](Get-Date -UFormat '%s') > clear_cache_time.txt
 or
 perl -le "print ( time )" > clear_cache_time.txt
 
Need further assistance? Please contact NetBackup Technical Support for additional guidance with adjusting the host cache behaviors. The configuration keywords noted above are intentionally not documented to minimize the likelihood of changing them to inappropriate values and adversely affecting NetBackup operation. Hostname failover, without IP failover, is the only known reason for adjusting the default settings for the NetBackup Host Cache.
Applies To
NetBackup 7.x and 8.x and beyond.
