Backups of multiple clients across multiple media servers fail concurrently performing peer host validation
Problem:
A subset of backup jobs, of multiple clients across multiple media servers, fail at about the same time just a short time after the backup Window first opens. Some jobs, perhaps many, will complete successfully, but multiple jobs will fail with status code 26 and status code 23 and status code 7643.
Occurs only when many hundreds (perhaps thousands) of jobs are allowed to go active at once when the backup window opens.
Applies only to master servers running NetBackup 8.1 or higher. Most likely encountered after upgrading to NetBackup 8.2 or higher.
Error Message:
The Job Details will show that bpbrm reports that peer validation has failed with status 26 and status 23.
Jan 16, 2020 2:11:35 AM - Error bpbrm (pid=9180) [PROXY] Received status: 26 with message Unable to perform peer host name validation. Curl error has occurred for peer name: nb-client, self name: nb-media
Jan 16, 2020 2:11:35 AM - Error bpbrm (pid=9180) [PROXY] Encountered error (VALIDATE_PEER_HOST_PROTOCOL_RUNNING) while processing(ValidatePeerHostProtocol).
Jan 16, 2020 2:11:35 AM - Error bpbrm (pid=9180) bpcd on nb-client exited with status 23: socket read failed
Jan 16, 2020 2:12:06 AM - Error bpbrm (pid=9180) [PROXY] Received status: 26 with message Unable to perform peer host name validation. Curl error has occurred for peer name: nb-client, self name: nb-media
Jan 16, 2020 2:12:06 AM - Error bpbrm (pid=9180) [PROXY] Encountered error (VALIDATE_PEER_HOST_PROTOCOL_RUNNING) while processing(ValidatePeerHostProtocol).
Jan 16, 2020 2:12:06 AM - Error bpbrm (pid=9180) cannot send mail because BPCD on nb-client exited with status 23: socket read failed
Jan 16, 2020 2:12:06 AM - Info bpbkar32 (pid=0) done. status: 23: socket read failed
Jan 16, 2020 2:12:06 AM - end writing
socket read failed (23)
The matching nbpxyhelper debug log from the media server will show peer validation timeout and status 7643.
01/16/20 02:12:05.741 [Debug] NB 51216 nbpxyhelper 486 PID:16612 TID:140522929985280 File ID:486 [{DBB5B85C-3155-11EA-9C6C-C507200DCEE0}:OUTBOUND] 1 [ValidatePeerHostProtocol::setState] Peer host validation timed out for SECURE connection; Peer host: nb-client , nbu status = 7643, severity = 1, Additional Message: [PROXY] Encountered error (VALIDATE_PEER_HOST_PROTOCOL_RUNNING) while processing(ValidatePeerHostProtocol)., nbu status = x, severity = 1 (../machines/LibNbPxyValidatePeerHost.cpp:1342) 
  
 
Cause:
There are several elements that combined may contribute to this situation.
- The storage unit (STU) configuration for the domain allows many hundreds, even thousands, of jobs to be active concurrently; number of STUs, jobs/streams per STU, MPX settings per device, etc.
- The backup policy windows are not staggered and many/all jobs for the night/day are allowed to queue and go active at the same time.
- Starting with NetBackup 8.2, more than one job per second is allowed to become active. If a backup window contains many clients, or if a policy allows multiple data streams and clients have either multiple streams or mount points, then post upgrade more jobs will go active together very quickly, instead of being spaced out over time.
- If the number of clients in the backup window is increased, it can have the same effect.
The ability to start a large number of jobs in a short span of time may overwhelm the ability of the NetBackup Web Management Console (nbwmc) on the master server to respond to all the requests from all of the many media servers and clients.
Solution:
The problem can be resolved several ways. Depending on the operational needs of the site, apply one or more as applicable.
- If enough backups will be queued that they will run for hours, stagger the backup windows of the policies so that they do not all open at once.  Having all storage units busy, with a few hundred jobs in queue is relatively common.  But if the number of jobs queued approaches or exceeds a thousand, then NetBackup is forced to re-evaluate all of them regularly, and that is very inefficient, plus that then allows many more jobs to go active in a short span of time.
 
- Review the backup policies with Allow Multiple Datastreams enabled to see if any are creating an excessive number of jobs.  It may be that inappropriate use of wild-cards is causing tens, or even hundreds, of small jobs for a client instead of a fewer number of large jobs.  Adjust the policy configuration as appropriate; disable Allow Multiple Datastreams, or adjust the wild-cards or NEW_STREAM directives to create a reasonable number of reasonably sized jobs.
 
 The bpdbjobs and/or bpimagelist output are useful to determine how many jobs per client per policy occur daily.
 
- If the problem only became present after upgrading to NetBackup 8.2 or higher, then review the bpdbjobs output and/or bperror output to determine how many jobs per second are going active after the upgrade of the master server.
 
 If the number of jobs/second is greater than two or three, then restrict the number of jobs that can go active per second. Make the following change to the configuration on the master server.
 
 echo DBM_NEW_IMAGE_DELAY=1000|nbsetconfig
 
 UNIX/Linux: /usr/openv/netbackup/bin/nbsetconfig
 Windows: <install_path>\Veritas\NetBackup\bin\nbsetconfig
 
 This will allow only one job per 1000 milliseconds - the same as NetBackup 8.1.2 - instead of the default which is constrained only be the speed at which the master server host (CPU, memory, network) can start processes and make connections.
 
 If the environment can handle more than one job/second, and without the setting there are five or more jobs/second, then consider a value of 500 (two jobs/second) or 333 (three jobs/second) instead of 1000 milliseconds.
 
- If there is a need to start many jobs in a short span of minutes using the same one or few media servers, then it will be beneficial to make a change on the media servers. 
 
 The media servers normally discard any cached peer validation information at the start of each job, which causes a connection to nbwmc from the media server for each job. The clearing of the whitelist cache is performed to ensure that any certificate changes since the prior backup window can be re-evaluated. But if multiple jobs will be started for the same client in a short span, during which the certificate situation will not change, then the clearing of the cache can be disabled to prevent the connection to nbwmc. But be aware that certificates expired in preceding hours may be accepted and honored until the cached information expires.
 
 A media server patch has been made available for this issue (4). This patch has not yet gone through any extensive Q&A testing. Consequently, if you are not adversely affected by this problem and have a satisfactory temporary workaround in place, we recommend that you wait for the public release of this hotfix. Please contact Veritas Technical Support to obtain this patch.
