Problem
When initially setting up a NetBackup master, it is possible to anticipate the amount of work to be performed given the constraints at that time. However, months or years later, it can become possible to have an over committed master server. This is characterized by missing Service Level Agreements for backups or restores, or by Storage Lifecycle Processing taking longer and longer until it cannot complete the amount of work in one day that it has one day to do the work in. The end result is that SLP processing falls further and further behind, which can cause a calamitous result following a restart of NetBackup, especially if a close eye is not maintained on the backlog.
Error Message
After a restart of services on the NetBackup master, it can take many hours, even days, for SLP processing to finish its recovery logic, causing no SLP duplications to be submitted in the intervening period of time. This results in an even greater backlog, making the situation worse.
Cause
More work given in X hours than the master can complete in X hours.
Solution
We must either reduce the amount of work, or increase the ability of the master server to perform the work, and often both. By replacing the master HW with faster equipment, processing can improve by large factors, 2x, 3x, etc. Server technology improves year to year, and by examining what server resources are constrained on the master, it is possible to anticipate what hardware upgrades will be the most effective.
Often faster CPU cores are the biggest gain, especially when running NetBackup, where fewer yet faster CPUs are preferable over lightweight yet numerous CPUs. NetBackup does not have the characteristics of a web server, so HW that works well for web servers will not work well for NetBackup. Faster disk drives, coupled with more attention to how the data is laid out and presented to the OS, can improve disk IO throughput, if that is determined to be a factor. If server RAM is committed to the point of process swapping, additional memory can improve performance. The subject of performance tuning hardware is beyond the scope of a short document like this.
On the other side of the equation, reducing workload can help. This does not directly address getting the work done, as the work doesn't actually get done, but it prioritizes the work so that the mission critical work can complete successfully.
It is possible to deactivate an SLP that has less than business critical data associated with it. This puts the backup images out of play, and if there are tens or hundreds of thousands of such images, can significantly improve the SLP processing. As mentioned earlier, you can't cut one end of a blanket off and sew it to the other end to make the blanket larger. Once deactivated, this work simply will not get done. SLP behavior would leave these backup images on the data storage indefinitely -- or until SLP completes its processing. By waiting until the majority (or all) of the backup images have expired, you can simplify the amount of work required by NetBackup to process the images. Once the images have expired, you can temporarily re-activate the SLP that was deactivated, and the expired images will not be duplicated or processed further. Rather, SLP processing will identify the images as having passed their retention period. This will invoke normal image expiration processing, which will take less resources than completing the remaining SLP processing.