Problem
This intermittent problem can affect backup and restore. The jobs queue and start but then fails with status 41 less than a minute later, long before any application timeout could occur.
The example below is for an Oracle multi-stream and multiplexed backup. But the problem could affect backup or restore, file system or extension, MPX or non-MPX, tape or disk, SAN media server or remote client.
Error Message
The dbclient debug log shows that one job was able to get connected and begin transferring a tar image to the mediaserver. In this case the IP addresses and shared memory flags indicate the client is the master server which is also the mediaserver, even though the client name (myClient) is different from the server name (myServer).
17:44:56.332 [17970] <2> logconnections: BPRD CONNECT FROM 10.0.0.1.64743 TO 10.0.0.1.1556 fd = 21
17:44:56.334 [17970] <4> serverResponse: initial client_read_timeout = <32400>
17:45:02.334 [17970] <4> serverResponse: read comm file:<17:45:02 Initiating backup>
17:45:13.336 [17970] <4> serverResponse: read comm file:<17:45:13 INF - Starting bpbrm>
...snip...
17:45:18.402 [17970] <4> serverResponse: read comm file:<17:45:18 INF - Master server = myServer.com>
17:45:18.402 [17970] <4> serverResponse: read comm file:<17:45:18 INF - Media server = myServer>
...snip...
17:45:19.402 [17970] <4> serverResponse: read comm file:<17:45:18 INF - New data socket = myClient.1024>
17:45:19.402 [17970] <4> serverResponse: INF - connecting to server on NEW DATA socket myClient.1024
17:45:19.402 [17970] <4> connectSockStr: INF - port id = 1024
17:45:19.402 [17970] <4> connectSockStr: INF - hostname = myClient
...snip...
17:45:19.406 [17970] <4> serverResponse: read comm file:<17:45:18 INF - Use shared memory = 1>
...snip...
17:47:35.781 [17970] <2> int_WriteData: INF - writing buffer # 1 of size 262144
But a job that started just a few moments later receives a server status 41 within a minute without getting connectivity established with the media server.
17:44:56.699 [17975] <2> logconnections: BPRD CONNECT FROM 10.0.0.1.64747 TO 10.0.0.1.1556 fd = 21
17:44:56.700 [17975] <4> serverResponse: initial client_read_timeout = <32400>
17:45:03.701 [17975] <4> serverResponse: read comm file:<17:45:03 Initiating backup>
17:45:20.703 [17975] <4> serverResponse: read comm file:<17:45:19 INF - Waiting for mount of media id 004077 on server myServer for writing>
17:45:26.703 [17975] <4> serverResponse: read comm file:<17:45:25 INF - Server status = 41>
17:45:26.704 [17975] <16> serverResponse: ERR - server exited with status 41: network connection timed out
17:45:26.704 [17975] <16> CreateNewImage: ERR - serverResponse() failed
17:45:26.704 [17975] <4> closeApi: entering closeApi.
17:45:26.704 [17975] <4> closeApi: INF - EXIT STATUS 6: the backup failed to back up the requested files
17:45:26.704 [17975] <16> VxBSACreateObject: ERR - Could not create new image with file /backup_SID_20120919_45102.bkp.
The bpbrm debug log shows that the status 41 was received from bptm instead of the expected DATASOCKET information.
17:45:16.353 [18472] <2> bpbrm read_parent_msg: read from parent START BACKUP ... -b myClient_1348087509 ... 17975.0.1348087496 ...
...snip...
17:45:16.747 [18472] <2> bpbrm tell_mm_to_start_backup: telling media manager to start
17:45:16.748 [18472] <2> bpbrm multiplexed_backup: calling read_media_msg for DATASOCKET.
17:45:25.855 [18472] <2> bpbrm read_media_msg: read from media manager: EXIT myClient_1348087509 41
17:45:25.855 [18472] <2> bpbrm process_media_msg: media manager for backup id myClient_1348087509 exited with status 41: network connection timed out
The bptm debug log shows that, for the successful job, a legacy call-back connection is being used for the DATASOCKET and that local port 1024 is bound.
17:45:15.740 [18502] <2> read_brm_msg: START BACKUP -b myClient_1348087508 .... 17970.0.1348087496 ...
...snip...
17:45:16.114 [18502] <2> setup_bpbkar_info: /usr/openv/netbackup/db/config/shm/myClient_1348087508 file successfully created
17:45:16.114 [18502] <2> get_data_socket: tmmpx.c.4129: port_status: 16843008 16843008 0x01010100
17:45:16.114 [18502] <2> get_data_socket: tmmpx.c.4130: use_local_sock_protocol: 1 1 0x00000001
17:45:16.114 [18502] <2> get_data_socket: tmmpx.c.4133: client_hostname: myClient
17:45:16.115 [18502] <2> get_data_socket: tmmpx.c.4197: *port: 1024 1024 0x00000400
17:45:16.115 [18502] <2> get_data_socket: tmmpx.c.4198: mpx_ipc:
...snip...
17:45:16.117 [18502] <2> send_brm_msg: DATASOCKET myClient_1348087508 1024
The bptm log shows that for the next job, the attempt to bind an available port fails, and confirms that the status 41 was returned to bpbrm.
17:45:20.502 [18502] <2> read_brm_msg: START BACKUP -b myClient_1348087509 ... 17975.0.134808 ...
...snip...
17:45:20.834 [18502] <2> setup_bpbkar_info: /usr/openv/netbackup/db/config/shm/myClient_1348087509 file successfully created
17:45:20.834 [18502] <2> get_data_socket: tmmpx.c.4129: port_status: 16843008 16843008 0x01010100
17:45:20.834 [18502] <2> get_data_socket: tmmpx.c.4130: use_local_sock_protocol: 1 1 0x00000001
17:45:20.834 [18502] <2> get_data_socket: tmmpx.c.4133: client_hostname: myClient
17:45:20.835 [18502] <2> vnet_bind_to_port_addr_extra_ipi: vnet_bind.c.195: 0: Function failed: 10 0x0000000a
17:45:20.844 [18502] <16> get_data_socket: listen() error on data socket, Address already in use
17:45:20.844 [18502] <2> mpx_start_child: tmmpx.c.1169: get_data_socket failed: 41 41 0x00000029
17:45:25.844 [18502] <2> send_brm_msg: EXIT myClient_1348087509 41
Cause
The root cause is that there all of the port numbers on the media server host that are available to NetBackup are already in use, hence the 'Address already in use' message.
The reason is that the NetBackup Server Port Window, which is used for legacy call-back connections, has been narrowed to just one port number. Hence only one backup job at a time can bind a port and run to completion.
SERVER_PORT_WINDOW = 1024 1024
This problem can occur with a larger window if there are enough concurrent operations running.
Solution
The range of port numbers that NetBackup is allowed to use for call-back connections must be large enough to accommodate all of the backup jobs that must run concurrently. The default is not to configure the SERVER_PORT_WINDOW which allows NetBackup to utilize from 1024 to the O/S defined upper limit. If other applications desire a specific range, NetBackup can be configured to use a smaller, but sufficiently large range. E.g.
SERVER_PORT_WINDOW = 1024 3023
Alternatively, ensure that NetBackup can avoid the use of call-back connections. This is accomplished by ensuring that port 1556/PBX (and 13724/vnetd, if pre-7.1) are open between the NetBackup hosts and that the connect options are set to '1 1 0' or '1 1 1'.
A similar problem can occur with source ports for connections to other NetBackup processes. Simply size the Client Port Window appropriately large. E.g.
CLIENT_PORT_WINDOW = 3024 5023
Another alternative is to limit the number of concurrently active jobs.