Using Server Read Timeout to accommodate Oracle delays during RMAN non-PROXY backup and restore operations
Problem
During backup and restore operations, Oracle RMAN makes repeated calls to the System Backup Tape (SBT) API. Typically, there is minimal delay between the API calls. However, when using large data files, especially in bigfile tablespaces, the delays may be lengthy in some Oracle configurations. Lengthy delays make it difficult for NetBackup to know if Oracle is hung versus just delayed. If the delay is excessively long, then NetBackup will fail the job.
The Server Read Timeout has been added to NetBackup 7.6.1.1 and above to make it easier to accommodate specific backup jobs that might encounter an unusually long Oracle delay.
Error Message
A verbose = 6 dbclient debug log will show the Oracle delays. In these examples the delays are minimal as the Oracle instance is well tuned, but longer delays at the indicated points may cause a failure.
For a backup, the Oracle delays related to the transfer of the backup image may be observed at these points after the storage unit is ready to receive the backup image.
08:36:00.447 [23189] <2> sbtbackup: INF – leaving
…waiting for Oracle to call sbtwrite2() and provide the first buffer of the backup piece.
08:36:00.454 [23189] <2> int_WriteData: INF - writing buffer # 1 of size 262144
08:36:00.454 [23189] <2> xbsa_SendData: INF - entering
08:36:00.454 [23189] <4> VxBSASendData: INF - entering SendData.
08:36:00.454 [23189] <4> dbc_put: INF - sending 262144 bytes
08:36:00.454 [23189] <2> writeToServer: bytesToSend = 262144 bytes
08:36:00.455 [23189] <2> writeToServer: bytesSent = 262144 bytes
08:36:00.455 [23189] <4> dbc_put: INF - sent: 262144 Bytes
08:36:00.455 [23189] <2> xbsa_SendData: INF - leaving (0)
…waiting for Oracle to call sbtwrite2() and provide more of the backup piece,
…or sbtclose2() if the entire piece has been provided.
08:36:00.456 [23189] <2> xbsa_SendData: INF - entering
08:36:00.456 [23189] <4> VxBSASendData: INF - entering SendData.
08:36:00.456 [23189] <4> dbc_put: INF - sending 262144 bytes
08:36:00.456 [23189] <2> writeToServer: bytesToSend = 262144 bytes
…snipped the rest of the data transfer…
08:36:00.497 [23189] <2> writeToServer: bytesSent = 262144 bytes
08:36:00.497 [23189] <4> dbc_put: INF - sent: 9961472 Bytes
08:36:00.497 [23189] <2> xbsa_SendData: INF - leaving (0)
… waiting for Oracle to call sbtclose2() and end the data transfer.
08:36:00.497 [23189] <2> sbtclose2: INF - entering
08:36:00.497 [23189] <2> int_CloseImage: INF - entering
08:36:00.497 [23189] <2> int_CloseImage: INF - Backup - closing <cntrl_46_1_869474153>
For a restore, the Oracle delays related to the transfer of the backup image may be observed at these points after the storage unit is ready to transmit the backup image.
13:07:50.995 [8637] <2> sbtrestore: INF - leaving
…waiting for Oracle to call sbtread2() to receive the first buffer of the backup piece.
13:07:50.995 [8637] <2> int_ReadData: INF - reading buffer # 1 of size 262144
13:07:50.995 [8637] <2> xbsa_GetData: INF - entering
13:07:50.995 [8637] <4> VxBSAGetData: INF - entering GetData.
13:07:50.995 [8637] <4> dbc_get: INF - reading 262144 bytes
13:07:50.995 [8637] <4> readFromServer: entering readFromServer.
13:07:50.995 [8637] <4> readFromServer: INF - reading 262144 bytes
13:07:50.996 [8637] <4> dbc_get: INF - bytesRead = 262144
13:07:50.996 [8637] <4> dbc_get: INF - ObjectSize: 9699328 Bytes
13:07:50.996 [8637] <2> xbsa_GetData: INF - leaving (0)
13:07:50.996 [8637] <2> int_ReadData: INF - exiting sbtread2 : Read <1> buffers, <262144> bytes
…waiting for Oracle to call sbtread2() to receive more of the backup piece.
…or sbtclose2() if it has received as much of the piece as it needs.
13:07:50.996 [8637] <2> xbsa_GetData: INF - entering
13:07:50.996 [8637] <4> VxBSAGetData: INF - entering GetData.
13:07:50.996 [8637] <4> dbc_get: INF - reading 262144 bytes
13:07:50.997 [8637] <4> readFromServer: entering readFromServer.
13:07:50.997 [8637] <4> readFromServer: INF - reading 262144 bytes
…snipped the rest of the data transfer…
13:07:51.050 [8637] <4> dbc_get: INF - bytesRead = 262144
13:07:51.050 [8637] <4> dbc_get: INF - ObjectSize: 0 Bytes
13:07:51.050 [8637] <4> dbc_get: INF - successful END of DATA: </cntrl_46_1_869474153>
13:07:51.050 [8637] <2> int_ReadData: INF - exiting sbtread2 : Read <38> buffers, <9961472> bytes
13:07:51.051 [8637] <2> xbsa_GetData: INF - entering
13:07:51.051 [8637] <4> VxBSAGetData: INF - entering GetData.
13:07:51.051 [8637] <4> dbc_get: INF - reading 0 bytes
13:07:51.051 [8637] <4> dbc_get: /cntrl_46_1_869474153
13:07:51.051 [8637] <2> xbsa_GetData: INF - leaving (18)
13:07:51.051 [8637] <2> int_ReadData: INF - Got end-of-file
…snipped NetBackup end of transfer processing…
13:07:51.052 [8637] <2> sbterror: INF - entering
13:07:51.052 [8637] <2> sbterror: INF - Error=7504: Got end-of-file
13:07:51.052 [8637] <2> sbterror: INF - leaving
… waiting for Oracle to call sbtclose2() and end the data transfer.
13:07:51.109 [8637] <2> sbtclose2: INF - entering
13:07:51.109 [8637] <2> int_CloseImage: INF - entering
13:07:51.109 [8637] <2> int_CloseImage: INF - Restore - closing <cntrl_46_1_869474153>
If an Oracle delay is excessively long, bpbrm will not see any forward progress in the job and the Client Read Timeout will expire causing the job to fail with status 13. The Job Details will show both the timeout while reading from the network connection to the client and the status 13.
09:50:33 - begin writing
09:50:38 - Info dbclient (pid=23339) dbclient(pid=23339) wrote first buffer(size=262144)
13:09:40 - Error bpbrm (pid=23364) socket read failed: errno = 62 - Timer expired
13:09:42 - Error bptm (pid=23384) media manager terminated by parent process
13:10:01 - Info dbclient (pid=23339) done. status: 13: file read failed
13:10:01 - end writing; write time: 3:19:28
file read failed (13)
Solution
Note: Server Read Timeout applies only to Oracle RMAN backup images where the backup was performed without the PROXY option.
Note: Adjusting the Server Read Timeout is only useful if there is a temporarily delay in the flow of data during a backup or restore. If the transfer of the backup image actually hangs, increasing the timeout will merely delay detection of the hang condition.
Note: Before adjusting Server Read Timeout, always review the dbclient debug log - as noted above - to confirm that the delay is during the backup image transfer. Adjusting the timeout will not work-around job delays that occur either prior to or after the transfer.
NetBackup 7.6.1.1 implements the Server Read Timeout configuration setting on the client host. This setting can be configured to instruct the dbclient to lengthen or shorten the timeout used by the media server when waiting for progress status updates from the client during transfer of the backup image.
Typically, this timer should not need to be increased because Oracle should make SBT API calls with minimal delays. However, it may be useful to adjust the timeout under the following conditions:
· Temporarily increase the timeout to allow backup or restore to complete while isolating the cause of the Oracle delays between SBT API calls.
· Permanently increase the timeout if the Oracle delays cannot be minimized.
· Permanently decrease the timeout (default is 3 hours) if it is preferred to have NetBackup fail the job sooner and release the storage unit if the Oracle delays are expected to be non-recoverable or should not be allowed to persist beyond a shorter maximum length of time.
The setting is configured on the client host, and specifies a value in seconds. The setting can be configured in four different locations, which determines the scope of the jobs that will use the value. The order of precedence is top to bottom
· On the RMAN SEND statement. The value will be used by all backup or restore operations that use the previously allocated channels.
ALLOCATE CHANNEL ch00 … ;
ALLOCATE CHANNEL ch01 … ;
SEND ‘NB_ORA_SERVER_READ_TIMEOUT=777;
BACKUP TABLESPACE …. ;
· On each RMAN channel allocation statement. The value will be used by all backup or restore operations that use that specific channel.
ALLOCATE CHANNEL … PARMS='ENV=(NB_ORA_SERVER_READ_TIMEOUT=666)’;
BACKUP TABLESPACE …. ;
· On UNIX, in the $HOME/bp.conf file of the Oracle user. The value will be used by all RMAN non-PROXY backup or restore operations by that Oracle user.
SERVER_READ_TIMEOUT = 555
· In the NetBackup configuration, bp.conf file or registry. The value will be used by all RMAN non-PROXY backup and restore operations on the client host.
From the master server: push the setting to the client
echo SERVER_READ_TIMEOUT=444 | bpsetconfig –h myoraclient
It is recommended to use the NB_ORA_SERVER_READ_TIMEOUT instead of the SERVER_READ_TIMEOUT. That way the value can be tuned to the specific backup job (typically a specific data file or tablespace) for a specific Oracle instance, and will not affect other jobs for other instances.
Note: The media server will honor the Server Read Timeout value received from the client only if the value is greater than the Client Read Timeout already configured on the media server. Typically timeouts should be set as low as possible to prevent holding idle resources that could be used by other jobs that are able to progress.
Note: It is recommended to use distinct and easily recognized values like those above, and avoid using values that are the same as normal NetBackup default values such as 300, 900, 1800, 3600, 7200, and 10800. That way, when/if a timeout does occur, the configuration setting that limited/detected the delay can be more easily identified.
If the NetBackup version is prior to NetBackup 7.6.1.1, adjusting the Client Read Timeout setting on the media server can be used as a work-around for the same issues. However, that change will affect all jobs using the media server.
Applies To
Oracle RMAN stream-based backup pieces, i.e. when the backup was taken without the PROXY option.
NetBackup 7.6.1.1 client and higher