Tuning SLP/lifecycle parameters for best duplication from disk to tape performance

Tuning SLP/lifecycle parameters for best duplication from disk to tape performance

Article: 100014136
Last Published: 2014-10-01
Ratings: 8 0
Product(s): NetBackup

Problem

Duplication to tape from disk storage units such as AdvancedDisk, MSDP, and other OST technologies may not be performing optimally. This can cause:

  • Excessive tape mounts
  • Excessive drive wear
  • High SLP duplication backlogs
  • High number of queued duplication jobs
  • Backup jobs running out of time in the start windows; status 196s

This article covers tuning Storage Lifecycle Policy (SLP)/lifecycle parameters, IO streams, job limits, SLP windows, and other resource management methods optimize disk to tape duplication performance. Any possible hardware issues, driver updates, storage file system issues, etc. should be explored in combination with these steps.

Error Message

No error is directly caused by poor duplication performance, but some indirect failures could include:

  • Status code 13
  • Status code 23
  • Status code 24
  • Status code 83
  • Status code 84
  • Status code 191
  • Status code 196

Cause

 

The default SLP/lifecycle parameters and other resource management settings are not optimized in a way that makes best use of tape hardware. These should be tuned based on the backup/duplication load, available hardware, and server resources. 

Solution

To optimize duplications to tape, the generation of the tape/drive needs to be known. Knowing the max write speed of the tape drive in relation to the source storage unit IO (read speed) can help in determining the number of drives that could be allocated to the storage unit. The current capacity and maximum theoretical write speeds are listed below for some tape drive generations.

  • LTO1 = 100 GB, 20 MB/sec
  • LTO2 = 200 GB, 40 MB/sec
  • LTO3 = 400 GB, 80 MB/sec
  • LTO4 = 800 GB, 120 MB/sec
  • LTO5 = 1536 GB, 140 MB/sec
  • LTO6 = 2560 GB, 160 MB/sec
  • LTO7 = 6554 GB, 315 MB/sec

For example, if a given MSDP storage server can rehydrate at 100 MB/sec on average, it would be impractical to allocate four LTO4 drives for duplication use which can write at a combined 480 MB/sec. The slower read performance from MSDP will cause the tape write buffer to empty faster than it can be filled. When the next buffer is full and data can be written again, the drive must reposition the tape back to where it last wrote. This back and forth motion is known as shoe-shining and significantly increases drive wear, shortening the life span of the drive. If using dedicated storage units for duplication, the "max write drives" parameter can be used to control how many drives can be used at any given time to better control this. 

Note: Allocate tape drives for duplication where the combined write speed is as close as possible to the average read speed from disk. Do not allocate more drives for duplication where data cannot be read fast enough from the source storage unit. 

Next, the SLP parameters (7.6) or LIFECYCLE_PARAMETERS (7.5 and older) values should be considered. If the file doesn't exist on the master server (/usr/openv/netbackup/db/config/LIFECYCLE_PARAMETERS for 7.5 and older) then the default values are being used. The defaults which are most relevant in duplication performance are listed below.

  • DUPLICATION_SESSION_INTERVAL_MINUTES 5
  • MAX_GB_SIZE_PER_DUPLICATION_JOB 25
  • MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB 30
  • MIN_GB_SIZE_PER_DUPLICATION_JOB 8

With the default parameters, there would be a separate duplication job for any images larger than 25 GB. There may also be smaller batches of images submitted if the 30 minute timer is reached to force a small duplication. This leads to a buildup of jobs in the queue which can prevent backup, restore, and other jobs from going active, leading to status 196s. It also dismounts and reallocates media between duplications, leading to large amounts of time being wasted where data isn't being written. This can be avoided by changing the min/max batch sizes nbstserv uses to create image batches to submit for duplication. Using the same example above with LTO4 drives at 800 GB native capacity, consider the new parameter values below.

  • MAX_GB_SIZE_PER_DUPLICATION_JOB 800
  • MIN_GB_SIZE_PER_DUPLICATION_JOB 200

Note: Set the max size to the native size of the tape and the min size as a multiple of the max size - either 1/4 or 1/3 preferably. 

These values will force nbstserv to create a batch of images between 200 and 800 GB. The upper value is enough to fill the tape (without factoring hardware compression) and the minimum value is high enough (factor of 4) to minimize the number of times the tape is mounted before it is full (ideally 4). However, with the other two settings at their defaults, this can lead to another problem. Depending on the backup load and size of the SLP backlog, nbstserv might not have enough time to create optimally sized image batches to submit which reduces the effectiveness of changing these two values alone. It would be ideal to allow more time to process the backlog and create batches closer to the max size parameter, so consider increasing the force and session intervals. 

  • DUPLICATION_SESSION_INTERVAL 15
  • MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB 120

Note: This might initially create a higher backlog of images pending SLP processing. The SLP backlog can be monitored by running "nbstlutil report" from the master server. However, the benefits of these changes typically outweigh the initial rise in backlog. A baseline aggregate throughput figure should be determined before changes are made and a second (or more) measurement should be taken after to measure the effectiveness of these changes. Additional adjustments may be needed to further improve performance. 

These will submit jobs as often as 15 minutes which better regulates the duplication load such as IO streams to a disk pool. Increasing the force interval as shown above will allow an image to remain in the SLP queue for up to two hours before being initiated. This gives nbstserv extra time to batch as many images as possible together and create job sizes closer to the max GB size parameter. If the backup images in the environment are relatively small or optimal batch size might take longer than two hours to create, consider raising these values further. 

Note: NetBackup 7.6 has moved LIFECYCLE_PARAMETERS into the GUI where they can be easily changed. These can be found under Host Properties > Master Server > SLP Parameters. The property names are similar to the LIFECYCLE_PARAMETERS file but note that the default max size value has changed from 25 GB to 100 GB. 

  • Minimum size per duplication job 8 GB
  • Maximum size per duplication job 100 GB
  • Force interval for small job 30 minutes
  • Image processing interval 5 minutes

Another factor to consider is the IO stream limit on the disk pool where the images are being read from. With most MSDP and PureDisk pools, the ideal value for most environments is around 60, but can vary depending on hardware and features being used, such as FibreTransport media server, where the ideal value is lower. Generally, the lower the IO stream limit, the better individual stream performance. This also lowers memory usage on the storage server and reduces random IOs.

Note: Limit IO streams to a value appropriate for the storage server's load and feature set. Consider lowering max jobs for the storage unit (STU) to 2/3 than the IO stream limit on the disk pool. This effectively reserves streams for restore and duplication jobs.

Another resource management feature which is new to 7.6 is the ability to create SLP start windows which can be set to defer processing. For example, an environment performs incremental backups Monday through Friday between 8 PM which typically complete before 4 AM. Fulls run on Saturday starting at 6 AM and typically complete Sunday by 6 PM. This leaves an opportunity on weekdays between 4 AM and 8 PM where duplication jobs can be performed without disrupting backups or clients, making this an ideal scenario for using an SLP start window. 

Disk arrays perform better when there isn't a mixture of read and write IOs. As backups are write operations and duplication jobs are read operations, it is preferred to separate these as much as possible to make most efficient use of the disk array, specifically caching on the RAID controller. Both backup and restore/duplication performance has been shown to improve substantially by implementing SLP windows in this fashion as all environment resources are dedicated to backups in backup window and for restores/duplications in the SLP window. See 000019536 for more information on configuring SLP windows in 7.6.

Note: Review the schedules and job run times for each day of the week. Look for "quiet times" of at least three hours which could be used for an SLP window and duplication session. 

Note: Applying an SLP start window to an SLP creates a new version of the SLP. Therefore, all images written to the queue before the window was applied will use the previous configuration or default 24x7 window. If there are many images in the SLP queue at the time the window was applied, the entire backlog will need to be processed before the benefits of this change are realized. Alternatively, the start window can be added to previous versions of the SLP using "nbstl -modify_version."

Note: If there are a high number of queued or active SLP jobs, any changes to the parameters above will not apply to them. These jobs can all be selected and cancelled in the activity monitor and will retry per SLP logic. When cancelled, nbstserv will read its parameter set and rebatch images using the new values. This can lead to faster realization of the new values and reduce the backlog sooner. If no action is taken, the existing jobs will be processed using the previous/old values. 

 

Applies To

This behavior is more pronounced in larger environments with multiple storage units where hundreds to thousands of jobs are run each day, but could also be seen in smaller environments.

Was this content helpful?