Media servers bounce between "Active for Disk" and "Active for Disk and Tape.

Article: 100021074
Last Published: 2010-01-29
Ratings: 1 0
Product(s): NetBackup & Alta Data Protection

Problem

GENERAL ERROR: Media servers bounce between "Active for Disk" and "Active for Disk and Tape."

Error Message

N/A: Media servers bounce between "Active for Disk" and "Active for Disk and Tape."

Cause

Overview of how NetBackup determines if a media server is active for disk or active for disk and tape.
This documentation can be useful when troubleshooting issues when media server states bounce between Active for Disk and Active for Disk and Tape.

This "state change" occurs when a heartbeat / handshake does not complete successfully on a timed 5 minute interval between a master server and media server.

Details:
On a 5 minute interval, the ltid process on the Media Server initiates a heartbeat to nbemm on the Master Server. nbemm receives the heartbeat and sends an acknowledgement back to ltid on the Media Server. When VERBOSE is set in the vm.conf file, this is logged in the ltid log.

On Unix: /usr/openv/volmgr/debug/ltid
On Windows: <install_path>\VERITAS\Volmgr\debug\ltid

11:58:17.759 [4744.3816] <4> SendEmmHeartbeat: Detected change in MachineState...
11:58:17.759 [4744.3816] <4> SendEmmHeartbeat: Detected change in MachineState... going Active

nbemm listens for heartbeat updates. Every 1 minute, nbemm evaluates when the last time a heartbeat was received from a Media Server:

Logged in OID 111 (nbemm), with internal OIDs 219 (resource event manager) and 144 (device allocator) all at verbose 6:

4/28/2009 11:52:07.847 [Debug] NB 51216 da 144 PID:788 TID:7324 File ID:111 [No context] 5 [DA_Thread_Pool::CheckForHeartbeat] Host MyMedia05 last heartbeat time was 1240944701, heartbeat interval is 300
4/28/2009 11:53:07.864 [Debug] NB 51216 da 144 PID:788 TID:7324 File ID:111 [No context] 5 [DA_Thread_Pool::CheckForHeartbeat] Host MyMedia05 last heartbeat time was 1240944701, heartbeat interval is 300
4/28/2009 11:54:07.866 [Debug] NB 51216 da 144 PID:788 TID:7324 File ID:111 [No context] 5 [DA_Thread_Pool::CheckForHeartbeat] Host MyMedia05 last heartbeat time was 1240944701, heartbeat interval is 300

nbemm expects to receive a heartbeat update every 5 minutes:

4/28/2009 11:57:11.855 [Diagnostic] NB 51216 rem 219 PID:9228 TID:9864 File ID:111 [No context] 1 V-219-1 [ResourceEventMgr_i::updateInfo ] Heartbeat received from host MyMedia05

If nbemm does not receive the heartbeat in 5 minutes, it allows an additional 60 second grace period and then changes the Media Server state accordingly:

4/28/2009 11:55:37.884 [Debug] NB 51216 da 144 PID:788 TID:7324 File ID:111 [No context] 5 [DA_Thread_Pool::CheckForHeartbeat] Host MyMedia09 last heartbeat time was 1240944576, heartbeat interval is 300
4/28/2009 11:55:37.884 [Debug] NB 51216 da 144 PID:788 TID:7324 File ID:111 [No context] 1 [DA_Thread_Pool::CheckForHeartbeat] Found machine < MyMedia09 > not sending heartbeat
4/28/2009 11:55:37.884 [Application] NB 51216 da 144 PID:788 TID:7324 File ID:111 [No context] [Error] V-144-1049 EMMServer generic error = Machine < MyMedia09 > not sending heartbeat, changing to OFFLINE

Convert the ctime above to see the last successful heartbeat time:

E:\Program Files\Veritas\NetBackup\bin>bpdbm -ctime 1240944576
1240944576 = Tue Apr 28 11:49:36 2009

In this case, 6 minutes passed (5 minutes + 60 second grace period) between the last successful heartbeat and the state change: 11:49:36 until 11:55:37.

Alternately, if nbemm receives the heartbeat, it responds to the sending media server ( ltid) with an acknowledgment. If the acknowledgement is unsuccessful, it assumes the Media Server is unreachable and performs a state change.

4/28/2009 12:02:59.223 [Debug] NB 51216 137 PID:9228 TID:9744 File ID:111 [No context] 4 [TAO] TAO (9228|9744) - PBXIOP_Transport[2248]::recv_i, read failure - An existing connection was forcibly closed by the remote host.
4/28/2009 12:02:59.223 [Debug] NB 51216 137 PID:9228 TID:9756 File ID:111 [No context] 4 [TAO] TAO (9228|9756) - PBXIOP_Transport[2248]::recv_i, read failure - An existing connection was forcibly closed by the remote host.

If ltid does not receive the acknowledgement from nbemm, a message is logged to the ltid log, but a retry is not performed - it simply continues to send a heartbeat every 5 minutes:

12:03:17.065 [4744.3816] <16> emmlib_UpdateMachineState: (0) updateMachineState failed, Error < 0x2DC6C4 >
12:03:17.065 [4744.3816] <16> SendEmmHeartbeat: (-) Translating EMM_ERROR_CorbaException(3000004) to 334 in the device management context
12:03:17.065 [4744.3816] <16> SendEmmHeartbeat: Could not update machine state, emm error code = 334

There can be many reasons why a heartbeat / handshake is unsuccessful for example, incorrect name resolution, an inaccessible network route, excessive network problems like TCP port resets or dropped packets. Please search the knowledge base for articles relating to "Active for Disk" or "Active for Disk and Tape".

https://www.veritas.com/support/en_US/dpp.NetBackup

Solution

In the event the problem is due to an unstable network environment, it is possible to tune NetBackup heartbeat / handshake communication in order to allow more frequent heartbeat update attempts from the Media Server, and extend the 60 second grace period on the Master Server.

Note: It must be understood that these configuration changes are a cover-up for the real problem if an unstable network environment exists. If communication problems do not exist between the master server and the media server, these adjustments will not be necessary. Although making these adjustments might make NetBackup resilient to the underlying network problems, the network problems will continue to exist and may be reflected in other operations of NetBackup or other processes.

Media Server Modification:

Because ltid does not retry when an attempted heartbeat fails or is not acknowledged as successful, it is possible to change the heartbeat frequency from a default of every 5 minutes (300 seconds) to something shorter. For example, 2 minutes (120 seconds).

Create a new vm.conf entry:
SCAN_HOST_STATUS_INTERVAL = 120

On Unix: /usr/openv/volmgr/vm.conf
On Windows: <install_path>\VERITAS\Volmgr\vm.conf

Note: For this setting to take effect, a restart of the NetBackup Device Manager (ltid) and NetBackup Volume Manager (vmd) services is required.

Master Server Modification:

nbemm requires a successful heartbeat communication every 5 minutes + 60 second grace period before making a state change. It is possible to adjust the grace period from a default of 60 seconds to something higher. For example, increasing the grace period to 10 minutes (600 seconds) would result in a combined required update interval of 15 minutes (5 minutes + 10 minute grace period). This gives ltid seven possible attempts to send and receive a successful heartbeat before a state change would take place.

Create a new emm.conf entry:
NUM_DA_GRACE_PERIOD = 600

On Unix: /usr/openv/var/global/emm.conf
On Windows: <install_path>\VERITAS\NetBackup\var\global\emm.conf

Note: For this setting to take effect, a restart of NetBackup Enterprise Media Manager (nbemm) service is required.

 

Was this content helpful?