backup or restore fails with status 41 a few seconds after bpbrm starts

Article: 100027688
Last Published: 2021-10-12
Ratings: 2 0
Product(s): NetBackup & Alta Data Protection

Problem

This intermittent problem can affect backup and restore.   The jobs queue and start but then fails with status 41 less than a minute later, long before any application timeout could occur.

 

The example below is for an Oracle multi-stream and multiplexed backup.   But the problem could affect backup or restore, file system or extension, MPX or non-MPX, tape or disk, SAN media server or remote client.

 

Error Message

The dbclient debug log shows that one job was able to get connected and begin transferring a tar image to the mediaserver.  In this case the IP addresses and shared memory flags indicate the client is the master server which is also the mediaserver, even though the client name (myClient) is different from the server name (myServer).

 

17:44:56.332 [17970] <2> logconnections: BPRD CONNECT FROM 10.0.0.1.64743 TO 10.0.0.1.1556 fd = 21

17:44:56.334 [17970] <4> serverResponse: initial client_read_timeout = <32400>

17:45:02.334 [17970] <4> serverResponse: read comm file:<17:45:02 Initiating backup>

17:45:13.336 [17970] <4> serverResponse: read comm file:<17:45:13 INF - Starting bpbrm>

...snip...

17:45:18.402 [17970] <4> serverResponse: read comm file:<17:45:18 INF - Master server = myServer.com>

17:45:18.402 [17970] <4> serverResponse: read comm file:<17:45:18 INF - Media server = myServer>

...snip...

17:45:19.402 [17970] <4> serverResponse: read comm file:<17:45:18 INF - New data socket = myClient.1024>

17:45:19.402 [17970] <4> serverResponse: INF - connecting to server on NEW DATA socket myClient.1024

17:45:19.402 [17970] <4> connectSockStr: INF - port id = 1024

17:45:19.402 [17970] <4> connectSockStr: INF - hostname = myClient

...snip...

17:45:19.406 [17970] <4> serverResponse: read comm file:<17:45:18 INF - Use shared memory = 1>

...snip...

17:47:35.781 [17970] <2> int_WriteData: INF - writing buffer # 1 of size 262144

 

But a job that started just a few moments later receives a server status 41 within a minute without getting connectivity established with the media server.

 

17:44:56.699 [17975] <2> logconnections: BPRD CONNECT FROM 10.0.0.1.64747 TO 10.0.0.1.1556 fd = 21

17:44:56.700 [17975] <4> serverResponse: initial client_read_timeout = <32400>

17:45:03.701 [17975] <4> serverResponse: read comm file:<17:45:03 Initiating backup>

17:45:20.703 [17975] <4> serverResponse: read comm file:<17:45:19 INF - Waiting for mount of media id 004077 on server myServer for writing>

17:45:26.703 [17975] <4> serverResponse: read comm file:<17:45:25 INF - Server status = 41>

17:45:26.704 [17975] <16> serverResponse: ERR - server exited with status 41: network connection timed out

17:45:26.704 [17975] <16> CreateNewImage: ERR - serverResponse() failed

17:45:26.704 [17975] <4> closeApi: entering closeApi.

17:45:26.704 [17975] <4> closeApi: INF - EXIT STATUS 6: the backup failed to back up the requested files

17:45:26.704 [17975] <16> VxBSACreateObject: ERR - Could not create new image with file /backup_SID_20120919_45102.bkp.

 

 

The bpbrm debug log shows that the status 41 was received from bptm instead of the expected DATASOCKET information.

 

17:45:16.353 [18472] <2> bpbrm read_parent_msg: read from parent START BACKUP ... -b myClient_1348087509  ... 17975.0.1348087496 ...

...snip...

17:45:16.747 [18472] <2> bpbrm tell_mm_to_start_backup: telling media manager to start

17:45:16.748 [18472] <2> bpbrm multiplexed_backup: calling read_media_msg for DATASOCKET.

17:45:25.855 [18472] <2> bpbrm read_media_msg: read from media manager: EXIT myClient_1348087509 41

17:45:25.855 [18472] <2> bpbrm process_media_msg: media manager for backup id myClient_1348087509 exited with status 41: network connection timed out

 

 

The bptm debug log shows that, for the successful job, a legacy call-back connection is being used for the DATASOCKET and that local port 1024 is bound.

 

17:45:15.740 [18502] <2> read_brm_msg: START BACKUP -b myClient_1348087508 .... 17970.0.1348087496 ...

...snip...

17:45:16.114 [18502] <2> setup_bpbkar_info: /usr/openv/netbackup/db/config/shm/myClient_1348087508 file successfully created

17:45:16.114 [18502] <2> get_data_socket: tmmpx.c.4129: port_status: 16843008 16843008 0x01010100

17:45:16.114 [18502] <2> get_data_socket: tmmpx.c.4130: use_local_sock_protocol: 1 1 0x00000001

17:45:16.114 [18502] <2> get_data_socket: tmmpx.c.4133: client_hostname: myClient

17:45:16.115 [18502] <2> get_data_socket: tmmpx.c.4197: *port: 1024 1024 0x00000400

17:45:16.115 [18502] <2> get_data_socket: tmmpx.c.4198: mpx_ipc:

...snip...

17:45:16.117 [18502] <2> send_brm_msg: DATASOCKET myClient_1348087508 1024

 

The bptm log shows that for the next job, the attempt to bind an available port fails, and confirms that the status 41 was returned to bpbrm.

 

17:45:20.502 [18502] <2> read_brm_msg: START BACKUP -b myClient_1348087509 ... 17975.0.134808 ...

...snip...

17:45:20.834 [18502] <2> setup_bpbkar_info: /usr/openv/netbackup/db/config/shm/myClient_1348087509 file successfully created

17:45:20.834 [18502] <2> get_data_socket: tmmpx.c.4129: port_status: 16843008 16843008 0x01010100

17:45:20.834 [18502] <2> get_data_socket: tmmpx.c.4130: use_local_sock_protocol: 1 1 0x00000001

17:45:20.834 [18502] <2> get_data_socket: tmmpx.c.4133: client_hostname: myClient

17:45:20.835 [18502] <2> vnet_bind_to_port_addr_extra_ipi: vnet_bind.c.195: 0: Function failed: 10 0x0000000a

17:45:20.844 [18502] <16> get_data_socket: listen() error on data socket, Address already in use

17:45:20.844 [18502] <2> mpx_start_child: tmmpx.c.1169: get_data_socket failed: 41 41 0x00000029

17:45:25.844 [18502] <2> send_brm_msg: EXIT myClient_1348087509 41

 

Cause

The root cause is that there all of the port numbers on the media server host that are available to NetBackup are already in use, hence the 'Address already in use' message.

 

The reason is that the NetBackup Server Port Window, which is used for legacy call-back connections, has been narrowed to just one port number.  Hence only one backup job at a time can bind a port and run to completion. 

 

SERVER_PORT_WINDOW = 1024 1024

 

This problem can occur with a larger window if there are enough concurrent operations running.

 

Solution

The range of port numbers that NetBackup is allowed to use for call-back connections must be large enough to accommodate all of the backup jobs that must run concurrently.   The default is not to configure the SERVER_PORT_WINDOW which allows NetBackup to utilize from 1024 to the O/S defined upper limit.   If other applications desire a specific range, NetBackup can be configured to use a smaller, but sufficiently large range.   E.g.

 

SERVER_PORT_WINDOW = 1024 3023

 

Alternatively, ensure that NetBackup can avoid the use of call-back connections.   This is accomplished by ensuring that port 1556/PBX (and 13724/vnetd, if pre-7.1) are open between the NetBackup hosts and that the connect options are set to '1 1 0' or '1 1 1'.

 

A similar problem can occur with source ports for connections to other NetBackup processes.   Simply size the Client Port Window appropriately large.   E.g.

 

CLIENT_PORT_WINDOW = 3024 5023

 

Another alternative is to limit the number of concurrently active jobs.

 

 

 

Was this content helpful?