VxUpdate fails with status code 7207 when updating AIX clients

Article: 100046330
Last Published: 2019-09-24
Ratings: 0 0
Product(s): Appliances, NetBackup

Problem

When running a Deployment policy to update AIX clients, the job fails in one hour with status code 7207.

Error Message

The Job Details and the admin debug log from the media server that was communicating with the client show:

Exit status [7207]

Cause

The AIX svmon program is taking many minutes, instead of the usual fraction of a second, to generate output for any/each of the processes running on the host.

During the upgrade, nbliveup is using backlevel_kill.sh to detect and terminate all previously started NetBackup processes.  The script, utilizes the AIX svmon program to inspect each running PID and determine if the process is running from a NetBackup directory.  This is to prevent backlevel_kill.sh from inadvertently terminating a 3rd-party process that happens to have the same name as a NetBackup process. 

The script runs svmon once for every process present in the process table.  E.g.

root 14352428        1   0 09:52:00      -  0:01 /usr/openv/tmp/nbliveup ...
root 15597630 14352428   0 09:52:13      -  0:00 /bin/sh /tmp/backlevel_kill.sh
root  8126504 15597630 120 10:34:01      -  0:19 svmon -P 7471358 ...
root 16384240  6357032   0 10:35:10  pts/0  0:00 egrep 'nbliveup|backlevel|svmon'

Observe that each svmon execution is taking many minutes to complete, and there are hundreds of processes on the host.  As a result, it will take several hours for AIX to successfully run svmon against each process, and thus backlevel_kill.sh and nbliveup will appear to be hung for several hours.  All of this can be observed using the 'ps' command on the client host during the upgrade.  Notice above that the svmon process with PID 8126504 has used 19 CPU seconds during the 69 seconds (10:35:10 - 10:34:01) of it's lifetime, and has not yet competed.

In the meantime, the nbmtrans process on the media server, which started nbliveup, uses a timeout to protect itself from an infinite hang by either the client side processing or the TCP connection between the hosts.  The admin debug log on the media server confirms that nbmtrans encountered a problem after exactly one hour.

09:52:04.667 [2123] <2> execute_remote_nbinstallagent: Successfully set timeout of [3600] seconds for reading nbinstallagent messages.
... Notice the timeout being set in the line above.
... snip ...
09:52:12.616 [2123] <2> execute_remote_nbinstallagent: Running service shutdown command: [/usr/openv/netbackup/bin/goodies/netbackup stop 2>&1].
09:52:13.586 [2123] <2> execute_remote_nbinstallagent: Execution finished.
09:52:13.586 [2123] <2> execute_remote_nbinstallagent: netbackup stop returned following output:
09:52:13.586 [2123] <2> execute_remote_nbinstallagent: ------------------------------------------------------------
09:52:13.586 [2123] <2> execute_remote_nbinstallagent: stopping the NetBackup Discovery Framework
09:52:13.586 [2123] <2> execute_remote_nbinstallagent: stopping the NetBackup client daemon
09:52:13.586 [2123] <2> execute_remote_nbinstallagent: stopping the NetBackup network daemon
09:52:13.586 [2123] <2> execute_remote_nbinstallagent:
09:52:13.586 [2123] <2> execute_remote_nbinstallagent: ------------------------------------------------------------
09:52:13.587 [2123] <2> execute_remote_nbinstallagent: Service shutdown successful.
... 3600 second delay before the timeout occurs due to svmon slow execution.
10:52:13.765 [2123] <16> execute_remote_nbinstallagent: Socket error - recv failed, retval=-1, errno=11
10:52:23.766 [2123] <8> vnet_close_socket_safely_ex: [vnet.c:1163] vnet_sock_ready() failed 11 0xb
10:52:33.767 [2123] <8> vnet_close_socket_safely_ex: [vnet.c:1163] vnet_sock_ready() failed 11 0xb
10:52:33.768 [2123] <2> nbmtrans_main: Execution of the nbinstallagent on the remote host failed
10:52:33.768 [2123] <16> nbmtrans_main: Exit status [7207]

Notice that the error is during an AIX recv() API call that fails after exactly one hour.  The errno=11 (EAGAIN/EWOULDBLOCK) indicates the recv() was waiting and would have continued to wait.

This occurs because nbmtrans set SO_RCVTIMEO, to 3600 seconds, on the connection to prevent an infinite client side problem from hanging the media server side. Which is what the svmon delays look like to the media server.

Solution

The root problem is the svmon delays.  AIX technical resources should be able to provide an appropriate solution which allows the svmon execution to become sub-second.

As a work-around, review the nbliveup and other client side evidence to determine how long the client side processing delay might be expected to take.  Then increase the timeout, used by nbmtrans to wait for nbliveup completion, to a value larger than the observed delays to allowed the server side to wait patiently long enough for the client side processing to complete.  

Be aware that this parameter is an adhoc nbconf settting which is not initially visible, nor can an initial value be set using nbsetconfig.  E.g.

$ grep VXUPDATE /usr/openv/netbackup/bp.conf
$

$ /usr/openv/netbackup/bin/nbgetconfig VXUPDATE_CLIENT_READ_TIMEOUT_SECONDS
VXUPDATE_CLIENT_READ_TIMEOUT_SECONDS = 3600

$ echo VXUPDATE_CLIENT_READ_TIMEOUT_SECONDS=10800|/usr/openv/netbackup/bin/nbsetconfig
nbsetconfig> nbsetconfig>

$ /usr/openv/netbackup/bin/nbgetconfig VXUPDATE_CLIENT_READ_TIMEOUT_SECONDS
VXUPDATE_CLIENT_READ_TIMEOUT_SECONDS = 3600

$ grep VXUPDATE /usr/openv/netbackup/bp.conf
$

However, the parameter can be set manually, and adjusted thereafter by nbsetconfig.  E.g.

$ echo 'VXUPDATE_CLIENT_READ_TIMEOUT_SECONDS = 7200' >> /usr/openv/netbackup/bp.conf

$ /usr/openv/netbackup/bin/nbgetconfig VXUPDATE_CLIENT_READ_TIMEOUT_SECONDS
VXUPDATE_CLIENT_READ_TIMEOUT_SECONDS = 7200

$ echo VXUPDATE_CLIENT_READ_TIMEOUT_SECONDS=10800|/usr/openv/netbackup/bin/nbsetconfig
nbsetconfig> nbsetconfig>

$ grep VXUPDATE /usr/openv/netbackup/bp.conf
VXUPDATE_CLIENT_READ_TIMEOUT_SECONDS = 10800

 

 

Was this content helpful?