Problem
Both NetBackup and Oracle Recover Manager (RMAN) maintain independent catalogs that track backups that have occurred, and when those backups will become expired. Keeping the catalogs in sync is necessary for efficient and effective operations.
Background Information
Backup piece names
During a backup, RMAN identifies data and objects to be backed up and associates them with a backup piece; sometimes one-to-one, sometimes many-to-one. As part of a backup, RMAN may create one or more backup pieces. The Oracle recovery catalog tracks the backup piece names, contents, and status. RMAN also provides the piece name to the SBT API when interacting with third-party backup software such as NetBackup for Oracle.
Thus each piece name must be unique so that it accurately represents the contents that will be needed during a restore. For AUTOBACKUP CONTROLFILE operations, RMAN will auto generate a prospective piece name and then query the SBT API (to make sure it is also not known to the backup software) before using the new piece name. For other backup operations, the RMAN configuration or input should include a FORMAT statement which results in a unique piece name.
Following each backup, RMAN will immediately query the SBT API to ensure the piece name or names are known and restore-able. RMAN does not consider a backup to be complete unless this step is also successful.
RMAN to SBT API (NetBackup)
In the case of a normal (stream-based) backup, RMAN does not inform NetBackup which objects are in the backup, and each backup piece is a separate NetBackup job and backup image. This allows NetBackup to close, catalog, and validate the backup image before RMAN queries to confirm it is restore-able.
In the case of proxy (file-based) backups, RMAN provides a backup piece name for each file that it requests NetBackup to include in the backup. In this case, NetBackup knows the file/object name associated with each piece, and the backup job and associated image may contains one or more backup pieces depending on whether it is a backup of a single file or a backup of an entire database or an entire disk volume.
As part of Oracle catalog maintenance, RMAN can both query NetBackup to confirm that a backup piece is still known and available for restore (crosscheck) and also request NetBackup to discard the backup image associated with a backup piece (delete). Either request requires NetBackup to locate the backup image that contains the piece of interest.
Searching for a piece name
The search for a piece name that already exists is usually quick and efficient. The search for a piece name that is either not yet known to NetBackup (a new, prospective, piece name) or no longer known to NetBackup (was in a backup image that has already expired) requires a broader search. Either can be less efficient, depending on how the backup and catalog comparison operations are configured and performed.
NetBackup saves the following information about the backup piece and the backup image within which it is stored:
- the piece name
- the ownership; user & group for the Oracle process performing the backup
- the access permissions; default is BKUP_IMAGE_PERM=GROUP, can be configured for USER or ANY
- the backup policy name
- the backup policy client performing the backup
- the backup time; UNIX time on the master server when the backup job changed from Queued to Active
When the NetBackup for Oracle client, on behalf of RMAN, queries the master server it must specify sufficient search criteria to allow the appropriate backup image to be either found or verified as not existing:
- the piece name
- the user and group of the requesting Oracle server process
- the browse client name; the configured NB_ORA_CLIENT or CLIENT_NAME
- an optional policy name; NB_ORA_POLICY if configured
- a time range
Because searching a wide time range can be inefficient, the client will typically search a narrow range first, and then widen the range until success or all of time has been searched. E.g.
- the Oracle time, if included in the piece name at backup time, +/- 24 hours
- the current client host time, +/- 24 hours
- the prior 1-4 weeks and then prior 2-6 months
- January 1, 1970 to the present time on the master server (does not search into the future)
Note: +/- 24 hours is used to allow for reasonable variation between host clocks and time zones on the client and master server.
Note: Enabling USEDEFAULTDATERANGE will cause some queries to skip the narrow ranges. This should rarely, if ever, be needed.
If the time range searches are not successful, the client will then query the master server to get the host name by which the master server knows the client. This is the configured name (ccname) from the policy client list, and may differ from the browse client, typically a short hostname vs a fully qualified hostname vs a network hostname alias. The time ranges are then searched a second time using the configured name.
Manually reviewing backup pieces in NetBackup
To review the backup pieces known to NetBackup, use the following command on the master server.
bplist -C <browse_client_name> -t 4 -l -R -s 01/01/1970 /
The output will show what pieces are available for restore under a specific browse client, their ownership & access permissions & size, and whether RMAN was using an appropriate FORMAT for the piece name. If the starting date is not specified, only the prior six months will be searched and displayed.
Note that in NetBackup all piece names are relative to root (/), but Oracle knows the pieces without the leading slash. Here’s an example of a query received from RMAN, what NetBackup search for, and what was reported back to Oracle. Notice the leading slash appears only the middle row of output.
sbtinfo2: INF - requesting image info for <cntrl_13_1_764858031>
...
VxBSAQueryObject: INF - Object </cntrl_13_1_764858031> was found on media …
...
DumpSbtInfo: INF - Media Information for Backup File : <cntrl_13_1_764858031>
Solution
The overall goal is five-fold.
- Manage the RMAN Recovery Catalog size by not retaining records about backup pieces any longer then needed.
- Retain backup images in NetBackup until RMAN no longer needs them.
- Manage the NetBackup catalog size and storage media utilization by not retaining backup images that are no longer needed.
- Minimizing the number of times RMAN queries NetBackup for any specific backup piece.
- Minimize the amount of effort it will take NetBackup to locate (or confirm non-existent) the backup image that contains a backup piece of interest to RMAN.
The best approach to managing the catalog synchronization is to use the Oracle and Veritas recommended method. This method is to set the backup retention primarily as an RMAN attribute, and secondarily as a NetBackup attribute, and then have RMAN delete the backup pieces which are obsolete, but not expired, from NetBackup. The procedures to do this are documented in the Oracle Recovery Manager Guides for Oracle versions nine and later (www.oracle.com/technology/documentation/index.html). An outline of those procedures is shown below, but refer to the NetBackup and Oracle documentation for a complete description.
- Set RMAN retention for the number or duration of backup sets. This is to keep the pieces in the RMAN catalog as long as needed. If there is no RMAN catalog then use SQL to set an appropriate value for "control_file_record_keep_time". The minimum appropriate time would be the required backup retention time plus the maximum time between catalog maintenance operations.
- Set the NetBackup retention for Oracle backups images to be longer than the RMAN retention. This allows NetBackup to retain the backup images until RMAN both determines the images to be obsolete and deletes them from both catalogs. Thus avoiding exhaustive searches for pieces names that are known to RMAN but no longer known to NetBackup. If RMAN backups of the Oracle instance utilize more than one NetBackup policy and/or schedule, be sure each schedule is configured with the appropriate retention. Use an infinite retention only if the backups images either, will be expired manually at some future date, or must be retained forever. Unnecessary bloat, accumulated over time, will slow some searches.
- Ensure that the FORMAT specified for all RMAN backup piece names, except for AUTOBACKUP CONTROLFILE, ends with a '_%t' as documented in the NetBackup for Oracle Administrator's Guide. Both the trailing '%t' and the default format for AUTOBACKUP CONTROLFILE are parse-able for the Oracle backup time, and allow NetBackup to search a narrow time range for non-expired piece names. (See Note-1 and Note-2 below.)
- On a regular and frequent basis, run the RMAN "delete obsolete" command to expire obsolete images from both the RMAN catalog & control file and also from NetBackup. This minimizes the size of both catalogs as soon as possible, and also releases the storage media used for those backups. (See Note-3 below.)
- An RMAN "crosscheck" command is typically not necessary. If used, run it after obsolete pieces have been deleted, to minimize the number of pieces that need to be checked. Use the command infrequently to avoid regularly and repeatedly querying the master server for the same pieces, especially if NetBackup retention is shorter than RMAN retention. That condition will result in exhaustive searches which consume more CPU and I/O resources on the master. In the latter case, the RMAN "delete expired backup" command should be run afterwards to make sure the same pieces are not queried, with the same exhaustive search and results, by the next crosscheck. (See also Note-4 below.)
- Stagger the initiation of RMAN catalog maintenance operations to limit the number of concurrent crosscheck or deletion requests to the NetBackup master server by the clients. Especially if multiple channels are being allocated for multiple databases that backup to the same policy client. In addition, ideally catalog maintenance would not overlap with busy Oracle backup windows for the same client, since backups performs similar queries. Overlap with image duplication and replication operations is not an issue.
- Where possible, avoid the creation of many small RMAN backup pieces, especially for archivelog and controlfile backups. Modern network and storage devices can transfer gigabits of data per second, so backup piece sizes should be in the tens/hundreds/thousands of gigabytes to minimize the ratio of job overhead delay to actual backup transfer time. As an additional side effect, fewer backup piece names will be present and need to be tracked, queried, and expired. Carefully consider the RMAN configuration for the backups; fewer channels, transporting larger backup sets and piece, filled from a larger numbers of files will result in fewer pieces. This is most applicable to archivelog backups, but may be useful for database files which have poor deduplication ratios.
Additional Considerations
- Keep the clocks on the master server and client hosts set accurately, and with appropriate time zones, so that the first query with time range +/- 24 hours is successful.
- Ensure that 'bpclntcmd -pn' completes quickly, ideally sub-second, when run on each Oracle client host. If not, troubleshoot the network connectivity between and name services on both hosts until any delays have been identified and resolved. Each request from RMAN to NetBackup for Oracle is going to cause one or more similar connections. RMAN will often be making tens, hundreds, or even thousands of requests so the cumulative delays can be significant.
- Delete inactive policies that are unlikely to ever be returned to service. Otherwise NetBackup will be repeatedly reading them to build policy client lists.
- Ensure all policy clients are the 'right' hostnames for the client. The master server and client should generally identify the client host by the same hostname. (See Note-5 below.)
- Ensure all policy client hostnames consistently resolve quickly on the master server. This allows the master server to quickly find network aliases which might be in use by the client. (See Note-5 below.)
- Are other resource (CPU, memory, I/O, etc.) intensive operations occurring on the master server host at the same time; many hundreds or thousands of processes and network connections for other active backup/restore jobs, catalog backup, master running media server processes, file system maintenance, etc.?
If it is not possible to use the RMAN retention policy and "delete obsolete", then use these steps with RMAN "crosscheck" and "delete expired" commands.
- Regularly review the retention of backup images in NetBackup and ensure they are not excessively long.
- Regularly use the NetBackup bpexpdate command to expire any images that can be delete sooner than their initial retention setting.
- Regularly review steps 5 and 6 above to ensure the scheduling of RMAN crosscheck and delete expired is occurring regularly, but not so often as to perform redundant operations before additional images are expired from within NetBackup.
- Regularly review recent backups to ensure steps 3 and 7 above are being followed.
- Regularly confirm step 8-12 above.
Note-1: This does not benefit queries for expired images because they will still search the entire time span available when the piece is not found in the narrow time span. That is because a piece name might be formatted to include some other string of digits in the location where an Oracle time is expected, so all backup images must be searched to ensure the piece name is truly not present.
Note-2: In February of 2019, the Oracle time rolled over from 9 digits to 10 digits. NetBackup for Oracle client versions prior to NetBackup 8.2 will not recognize the new time format without having a fix applied. See the related article for details.
Note-3: For proxy backups where a single NetBackup image contains multiple RMAN backup pieces, an RMAN request to delete a piece will not result in the deletion/expiration of the image because other pieces still depend upon the image being present. If RMAN later re-catalogs the piece, NetBackup will be unaware. Hence this images will persist until the NetBackup retention expires.
Note-4: If performing crosscheck operations, be careful to ensure that the Oracle server process performing the sbtinfo API calls is running as a user or group which matches the backup piece ownership and access permissions. Otherwise, NetBackup will report the piece 'not found' and RMAN will mark it as expired, and then at a later time potentially delete it's record of the supposedly expired piece. The same is true if the crosscheck uses a browse client that does not match the policy client used at time of backup, see also Note-5 below. A similar result also occurs if NB_ORA_POLICY is configured during the crosscheck (typically it is not) and does not match the policy used for some of the backups. If NetBackup reports that a backup piece is not found, due to incorrect query criteria, Oracle 10.2+ can re-query NetBackup (with correct criteria) and re-catalog the piece using the RMAN "catalog device type 'sbt_tape' backuppiece '<piecename>'" command.
Note-5: Through normal business flow (equipment replacement or relocation, mergers & acquisitions, etc.) the hostname by which a host is known may change. If the host is backed up using a new policy client name, then the new backup images will be stored under the new name, and a crosscheck of both old and new images will only be successful for one group of images (whichever is currently configured as the NB_ORA_CLIENT or CLIENT_NAME). To avoid such failures, the new hostname can be made a client alias of the original hostname, so the crosscheck can be performed using either client name. Ideally, this would be done before any backups are taken using the new hostname, but it is also possible to merge images taken using two client names. See the related article for details.
Sample RMAN commands
SQL> show parameter control_file_record_keep_time;
SQL> alter system control_file_record_keep_time=90 scope=both;
RMAN> show all;
RMAN> configure retention policy to recovery window of 60 days;
RMAN> list backupset of database;
RMAN> list backupset of archivelog all;
RMAN> list backupset of controlfile;
RMAN> report obsolete;
RMAN> report obsolete orphan;
RMAN> allocate channel for maintenance type 'sbt_tape';
RMAN> delete obsolete;
RMAN> delete obsolete orphan;
RMAN> crosscheck backupset of database;
RMAN> crosscheck backupset of controlfile;
RMAN> crosscheck backupset of archivelog all;
RMAN> delete expired backup;
A backup can be exempted from the retention policy with the backup or change command using the "keep" option. Be sure that these backups also use a NetBackup policy and schedule with the appropriate retention period.
RMAN> backup database keep forever logs;
RMAN> change backup of database <recordspec> keep until time 'sysdate+365' logs;