Legacy and CORBA connection delays between processes, most often bpbrm to the client

Article: 100012326
Last Published: 2014-04-09
Ratings: 0 0
Product(s): NetBackup & Alta Data Protection

Problem

When there are a large number of network interfaces on the host, and the Preferred Network configuration prohibits at least one of the local interfaces from being used, it is observed that there are significant delays connecting to some hosts.  

 

These delays most commonly occur between when bpbrm is started and when it is connected to the client.   Once connected, the client and bptm will complete the data transfer normally, sometime in only a few seconds for small jobs.   The connection delays are more noticeable with small jobs because they account for a larger percentage of the elapse time for the job.   Large backups are equally affected, but may not be noticed.

 

This problem may affect CORBA connections, but the effect is less because the delay can occur only once per source and destination process due to the ConnectList cache, and the delay is shorter.

 

These delays will also affect restore, and other operations, between the same hosts.

 

Error Message

There are no error messages, but the Job Details show the delays after the job becomes active.  There will be two or three spots, depending on job type, where there will be delays of 20 or more seconds.  In this example, the data transfer takes only 2 seconds out of the total elapse time of 287 seconds.

 

...snip...

12:05:10 - Info nbjm (pid=28274) started backup (backupid=myclient_1393869869) job ...

…delay…

12:06:10 - Info bpbrm (pid=3429) myclient is the host to backup data from

12:06:10 - Info bpbrm (pid=3429) reading file list for client

…delay…

12:07:41 - connecting

12:07:41 - Info bpbrm (pid=3429) listening for client connection

…delay…

12:09:43 - Info bpbrm (pid=3429) accepted connection from client

12:09:43 - Info dbclient (pid=25892) Backup started

12:09:43 - Info bpbrm (pid=3429) bptm pid: 18775

12:09:43 - Info bptm (pid=6553) start

12:09:43 - connected; connect time: 0:02:02

...snip...

12:09:52 - Info dbclient (pid=25892) dbclient(pid=10427) wrote first buffer(size=262144)

12:09:54 - Info dbclient (pid=25892) dbclient waited 346 times for empty buffer, delayed 445 times

12:09:54 - Info dbclient (pid=25892) done. status: 0

12:09:54 - Info bptm (pid=6553) waited for full buffer 206 times, delayed 5672 times

12:09:57 - Info bptm (pid=6553) EXITING with status 0 <----------

 

 

The bpbrm debug log shows there are no delays connecting to the master server.  Notice that there are preferred network settings configured. At least one of the local interfaces is prohibited, and there is not a match for the destination IP address.  If the master server was on the 2.2.2.0/24 network, a 10 second delay would be seen here.

 

...snip...

12:05:10.504 [3429] <2> verify_client: ... *connect_opts: 16974338 16974338 0x01030202

12:05:10.504 [3429] <2> ConnectionCache::connectAndCache: Acquiring new connection for host mymaster, query type 223

12:05:10.504 [3429] <2> vnet_get_pref_netconnection: ... targaddr=1.1.1.1 No match, some prohibited, prefnet=IF_LIST

12:05:10.506 [3429] <2> logconnections: BPDBM CONNECT FROM 1.1.1.2.46242 TO 1.1.1.1.1556 fd = 4

 

However, when connecting to bpcd on the client host there is 3 delays of 10 seconds each.  The delays come between each of 4 attempts to connect to the host.  The fourth is successful to the PBX port so this is not a problem with blocked ports in the firewall or service not listening.  Notice that the preferred network settings also do not match on this target IP address.

 

12:05:10.554 [3429] <2> bpbrm start_bpcd_stat: ... bpcd_client: myclient

...snip...

12:05:10.556 [3429] <2> vnet_get_pref_netconnection: ... targaddr=4.4.4.3 No match, some prohibited, prefnet=IF_LIST

12:05:10.556 [3429] <2> async_connect: ... connect in progress 1 0x1

12:05:20.566 [3429] <2> async_connect: ... connect in progress 2 0x2

12:05:30.576 [3429] <2> async_connect: ... connect in progress 3 0x3

12:05:40.586 [3429] <2> async_connect: ... connect in progress 4 0x4

12:05:40.588 [3429] <2> logconnections: BPCD CONNECT FROM 4.4.4.2.34346 TO 4.4.4.3.1556 fd = 8

 

The same sequence of delays occurs when connecting to vnetd for the forward socket that bpcd will also use.

 

12:05:40.589 [3429] <2> vnet_get_pref_netconnection: ... targaddr=4.4.4.3 No match, some prohibited, prefnet=IF_LIST

12:05:40.590 [3429] <2> async_connect: ... connect in progress 1 0x1

12:05:50.784 [3429] <2> async_connect: ... connect in progress 2 0x2

12:06:00.785 [3429] <2> async_connect: ... connect in progress 3 0x3

12:06:10.795 [3429] <2> async_connect: ... connect in progress 4 0x4

12:06:10.797 [3429] <2> async_connect: ... connect async CONNECT FROM 4.4.4.2.44451 TO 4.4.4.3.1556 fd = 9

12:06:10.798 [3429] <8> vnet_vnetd_connect_forward_socket_begin: ... VN_REQUEST_CONNECT_FORWARD_SOCKET 10 0xa

 

This first bpcd connection was to verify several settings on the client.

 

12:06:10.930 [3429] <2> bpcr_get_hostname_rqst: Server hostname length = 12

12:06:10.930 [3429] <2> bpcr_get_version_rqst: bpcd version: 07500004

12:06:10.937 [3429] <2> verify_client: ... bpcr_update_resiliency : 0 0 0x00000000

12:06:10.938 [3429] <2> verify_client: ... *connect_opts: 16843008 16843008 0x01010100

...snip...

 

The next pair of connections to bpcd and vnetd incurs the same 10 second delays.

 

12:06:10.946 [3429] <2> async_connect: ... connect in progress 1 0x1

12:06:20.956 [3429] <2> async_connect: ... connect in progress 2 0x2

12:06:30.963 [3429] <2> async_connect: ... connect in progress 3 0x3

12:06:40.973 [3429] <2> async_connect: ... connect in progress 4 0x4

12:06:40.974 [3429] <2> logconnections: BPCD CONNECT FROM 4.4.4.2.36029 TO 4.4.4.3.1556 fd = 9

12:06:40.975 [3429] <2> async_connect: ... connect in progress 1 0x1

12:06:50.976 [3429] <2> async_connect: ... connect in progress 2 0x2

12:07:00.984 [3429] <2> async_connect: ... connect in progress 3 0x3

12:07:10.994 [3429] <2> async_connect: ... connect in progress 4 0x4

12:07:10.996 [3429] <8> do_pbx_service: ... via PBX VNETD CONNECT FROM 4.4.4.2.55183 TO 4.4.4.3.1556 fd = 10

12:07:11.115 [3429] <2> bpcr_get_version_rqst: bpcd version: 07500004

12:07:11.115 [3429] <2> bpcr_get_charset_rqst: bpcd charset: 0000000d

...snip...

 

Because this is an Oracle stream-based application backup using XBSA, there are three vnetd forward sockets created to the client to be used as the New Data, Data, and Name connections.  Each incurs the same delays.  This would also be true of DB2, Informix, SAP, SQL-Server, Sybase, and DataStore backup jobs.

 

12:07:11.160 [3429] <2> async_connect: ... connect in progress 1 0x1

12:07:21.166 [3429] <2> async_connect: ... connect in progress 2 0x2

12:07:31.176 [3429] <2> async_connect: ... connect in progress 3 0x3

12:07:41.186 [3429] <2> async_connect: ... connect in progress 4 0x4

12:07:41.188 [3429] <8> do_pbx_service: ... via PBX VNETD CONNECT FROM 4.4.4.2.54261 TO 4.4.4.3.1556 fd = 9

...snip...

12:07:41.267 [3429] <2> set_job_details: Tfile (271461): LOG 1393870061 4 bpbrm 3429 listening for client connection

12:07:41.267 [3429] <2> bpbrm listen_for_client: HOT_ORACLE_DB_BACKUP

...snip...

12:07:41.268 [3429] <2> async_connect: ... connect in progress 1 0x1

12:07:51.270 [3429] <2> async_connect: ... connect in progress 2 0x2

12:08:01.278 [3429] <2> async_connect: ... connect in progress 3 0x3

12:08:11.288 [3429] <2> async_connect: ... connect in progress 4 0x4

12:08:11.289 [3429] <8> do_pbx_service: ... via PBX VNETD CONNECT FROM 4.4.4.2.46340 TO 4.4.4.3.1556 fd = 10

12:08:11.366 [3429] <2> bpbrm listen_for_client: ... accept_socket_data: 10 10 0x0000000a

...snip...

12:08:11.368 [3429] <2> async_connect: ... connect in progress 1 0x1

12:08:21.378 [3429] <2> async_connect: ... connect in progress 2 0x2

12:08:31.388 [3429] <2> async_connect: ... connect in progress 3 0x3

12:08:41.390 [3429] <2> async_connect: ... connect in progress 4 0x4

12:08:41.392 [3429] <8> do_pbx_service: ... via PBX VNETD CONNECT FROM 4.4.4.2.40519 TO 4.4.4.3.1556 fd = 11

12:08:41.471 [3429] <2> bpbrm listen_for_client: ... accept_socket_name: 11 11 0x0000000b

...snip...

 

The same delays are present when making the connections, to bpcd and vnetd, needed to update the comm file.  Those updates make the client aware of the prior connections and the rest of the job attributes.

 

12:08:41.473 [3429] <2> async_connect: ... connect in progress 1 0x1

12:08:51.483 [3429] <2> async_connect: ... connect in progress 2 0x2

12:09:01.490 [3429] <2> async_connect: ... connect in progress 3 0x3

12:09:11.500 [3429] <2> async_connect: ... connect in progress 4 0x4

12:09:11.501 [3429] <2> logconnections: BPCD CONNECT FROM 4.4.4.2.59704 TO 4.4.4.3.1556 fd = 12

...snip...

12:09:11.503 [3429] <2> async_connect: ... connect in progress 1 0x1

12:09:21.513 [3429] <2> async_connect: ... connect in progress 2 0x2

12:09:31.524 [3429] <2> async_connect: ... connect in progress 3 0x3

12:09:41.534 [3429] <2> async_connect: ... connect in progress 4 0x4

12:09:41.535 [3429] <8> do_pbx_service: ... via PBX VNETD CONNECT FROM 4.4.4.2.54019 TO 4.4.4.3.1556 fd = 13

...snip...

12:09:43.001 [3429] <2> bpbrm write_msg_to_progress_file: INF - Data socket = mymm-bk.IPC:/tmp/vnet-...

12:09:43.005 [3429] <2> bpbrm write_msg_to_progress_file: INF - Name socket = mymm-bk.IPC:/tmp/vnet-...

12:09:43.006 [3429] <2> bpbrm write_msg_to_progress_file: INF - Job id = 271461

12:09:43.006 [3429] <2> bpbrm write_msg_to_progress_file: INF - Backup id = myclient_1393869869

12:09:43.010 [3429] <2> bpbrm write_msg_to_progress_file: INF - New data socket = mymm-bk.IPC:/tmp/vnet-...

12:09:43.010 [3429] <2> bpbrm write_msg_to_progress_file: INF - Encrypt = 0

12:09:43.011 [3429] <2> bpbrm write_msg_to_progress_file: INF - Use FAT shared memory

12:09:43.011 [3429] <2> bpbrm write_msg_to_progress_file: INF - Compression = 0

…snip…

 

Once the dbclient is connected to the Data, Name, and New Data connections, it sends the BACKUP START and bpbrm then starts bptm.  Thereafter the data transfer is completed in a few seconds.

 

12:09:43.210 [3429] <2> set_job_details: Tfile (271461): LOG 1393870183 4 bpbrm 3429 accepted connection from client

12:09:43.250 [3429] <2> bpbrm main: from client myclient: read client start message

12:09:43.250 [3429] <2> bpbrm spawn_child: /usr/openv/netbackup/bin/bptm ...

12:09:43.251 [3429] <2> set_job_details: Tfile (271461): LOG 1393870183 4 bpbrm 3429 bptm pid: 6553

...snip...

12:09:46.994 [3429] <2> bpbrm mm_sig: received ready signal from media manager

12:09:52.265 [3429] <2> bpbrm main: from client myclient: change timeout to 10800

12:09:52.357 [3429] <2> set_job_details: ... dbclient(pid=25892) wrote first buffer(size=262144)

12:09:54.167 [3429] <2> bpbrm main: ADDED FILES TO DB FOR myclient_1393869869 1 /arc.DB.s5299.p1.t841233867

12:09:54.167 [3429] <2> set_job_details: ... dbclient waited 3 times for empty buffer, delayed 46 times

12:09:54.167 [3429] <2> bpbrm main: client myclient EXIT STATUS = 0: the requested operation was successfully completed

...snip...

12:09:54.167 [3429] <2> ConnectionCache::connectAndCache: Acquiring new connection for host mymaster, query type 78

12:09:54.170 [3429] <2> logconnections: BPDBM CONNECT FROM 1.1.1.2.60701 TO 1.1.1.1.1556 fd = 5

...snip...

12:09:54.236 [3429] <2> bpbrm wait_for_child: start

12:09:57.750 [3429] <2> bpbrm wait_for_child: child exit_status = 0 signal_status = 0

12:09:57.750 [3429] <2> bpbrm wait_for_child: child exit normal

12:09:57.750 [3429] <2> bpbrm main: validating image for client myclient

12:09:57.750 [3429] <2> ConnectionCache::connectAndCache: Acquiring new connection for host mymaster, query type 74

12:09:57.752 [3429] <2> logconnections: BPDBM CONNECT FROM 1.1.1.2.57337 TO 1.1.1.1.1556 fd = 5

...snip...

12:09:57.837 [3429] <2> bpbrm Exit: client backup EXIT STATUS 0: the requested operation was successfully completed

 

 

The bptm debug log does not show any delays in this example, because it connects only to the master server and not to the client host.  It confirms that the data transfer took only a few seconds.  If the master was on the 2.2.2.0/24 network, then bptm would have delays making both legacy and CORBA connections, 10 and 5 seconds respectively.

 

12:09:43.679 [6553] <2> bptm: INITIATING (VERBOSE = 5): -w -c myclient ...

...snip...

12:09:44.392 [6553] <2> io_init: child delay = 1, parent delay = 1 (milliseconds)

...snip...

12:09:46.994 [6553] <4> write_backup: begin writing backup id myclient_1393869869, ...

...snip...

12:09:52.359 [6553] <2> write_data: received first buffer (262144 bytes), begin writing data

...snip...

12:09:54.176 [6553] <2> write_data: writing short block, 32768 bytes, remainder 0

12:09:54.176 [6553] <2> write_data: waited for full buffer 206 times, delayed 5672 times

 

 

The sequence of events, shown by bpbrm above, can also be seen in the bptestbpcd output on the media server.

 

mymm$ bptestbpcd -client myclient -debug

 

Cause

The delays are due to the number of available network interfaces on the host, combined with having NetBackup configured to not use at least one of the interfaces.  By prohibiting an interface, the operating system (OS) cannot be allowed to pick ANY source binding and instead NetBackup must try to make each outbound connection from each of the remaining local interfaces until one is successful.

 

In this example, the master server is network routable out the first interface (1.1.1.0/24) so there are no delays connecting to that host.  But the client is not network routable via the first three interfaces (1.1.1.0/24, 2.2.2.0/24, or 3.3.3.0/24).  For legacy connections this results in a 10 second delay per non-routable interface (30 seconds total), multiplied by the number of legacy connections (9) that need to be established to the client.  As a result there are 270 seconds of connection delay to perform a data transfer that takes less than 10 seconds.

 

The delay for outbound CORBA connections is 5 seconds per non-routable source interface when connecting to the destination, multiplied by the number of unique processes on the remote host to which connections must be made.

 

The details are shown in the bptestnetconn connection (-c) output from bptestnetconn targeting the client name as the host (-H) and including the configured servers (-s) list. The connection (CN) delay shown is for a CORBA connection.

 

mymm$ bptestnetconn -v6 -a -p -t5 -o10 -c -s -Hmyclient

adding hostname = myclient

SERVER = mymaster

SERVER = mymm

...snip...

FL: myclient -> 4.4.4.3  :  1 ms FAST (< 5 sec) SRC: 1.1.1.2,2.2.2.2,3.3.3.2,4.4.4.2

FL: mymaster -> 1.1.1.1  :  1 ms FAST (< 5 sec) SRC: 1.1.1.2,2.2.2.2,3.3.3.2,4.4.4.2

FL:     mymm -> 1.1.1.2  :  0 ms FAST (< 5 sec) [local] SRC: ANY

...snip...

CN:  myclient :  15010 ms [SUCCESS] PBX: Yes VNETD: Yes BPCD: Yes

CN:  mymaster :      3 ms [SUCCESS] PBX: Yes VNETD: Yes BPCD: Yes

CN:      mymm :      2 ms [SUCCESS] PBX: Yes VNETD: Yes BPCD: Yes

CN: localhost :      3 ms [SUCCESS] PBX: Yes VNETD: Yes BPCD: Yes

...snip...

Displaying all locally configured IP addresses

and if these addresses are found in the DNS

 

Address      Local  DNS Entry

-------      -----  ----------

192.168.1.1  Yes    No - private/autoconfigured

127.0.0.1    Yes    No - private/autoconfigured

5.5.5.2      Yes    No - private/autoconfigured

4.4.4.2      Yes    No - private/autoconfigured

3.3.3.2      Yes    No - private/autoconfigured

2.2.2.2      Yes    No - private/autoconfigured

1.1.1.2      Yes    Yes      

 

PREFERRED_NETWORK rules are applied in this order:

[0] PREFERRED_NETWORK = 5.5.5.2 PROHIBITED

[1] PREFERRED_NETWORK = 192.168.1.1 PROHIBITED

 

The delay for each outbound legacy connection attempt is 10 seconds to allow for TCP try and retry the connection on networks that drop a few packets.  Depending on which local network interfaces are routable to the target host, the delay to each specific host may vary.  Typically bpbrm will connect to bpcd at least twice when starting a backup or restore, resulting in 40 seconds of delay (bpcd, vnetd, bpcd, vnetd).  The delays will be multiplied if there are additional non-routable interfaces.

 

Application backup jobs, such as the one above, make additional connections and incur additional delays.

 

Solution

These delays can be avoided by using different or additional Preferred Network configurations.

 

If there is a local interface that should not be used, replace the prohibited interface with a match setting for that network to specify the local source binding.   Typically this will be the default route on the host, but could be another interface.   For the situation above:

 

PREFERRED_NETWORK = 192.168.1.0/24 MATCH 1.1.1.2

PREFERRED_NETWORK = 5.5.5.0/24 MATCH 1.1.1.2

 

Results in the following source binding lists, which let the OS routing table pick the source interface binding:

 

FL: myclient -> 4.4.4.3   :   1 ms FAST (< 5 sec) SRC: ANY

FL: mymaster -> 1.1.1.1   :   1 ms FAST (< 5 sec) SRC: ANY

FL:      mymm -> 1.1.1.2   :   0 ms FAST (< 5 sec) [local] SRC: ANY

 

 

Alternatively, match settings can be implemented for the specific locally attached networks, with a catch-all that uses the default route.

 

PREFERRED_NETWORK = 2.2.2.0/24 MATCH 2.2.2.2

PREFERRED_NETWORK = 3.3.3.0/24 MATCH 3.3.3.2

PREFERRED_NETWORK = 4.4.4.0/24 MATCH 4.4.4.2

PREFERRED_NETWORK = 0/0 MATCH 1.1.1.2

 

Resulting in the following source binding lists:

 

FL: myclient -> 4.4.4.3   :   1 ms FAST (< 5 sec) SRC: 4.4.4.2

FL: mymaster -> 1.1.1.1   :   1 ms FAST (< 5 sec) SRC: 1.1.1.2

FL:      mymm -> 1.1.1.2   :   0 ms FAST (< 5 sec) [local] SRC: ANY

 


Applies To

NetBackup 7.x server and client hosts with multiple network interfaces

Was this content helpful?