NetBackup™ Backup Planning and Performance Tuning Guide
- NetBackup capacity planning
- Primary server configuration guidelines- Size guidance for the NetBackup primary server and domain
- Factors that limit job scheduling
- More than one backup job per second
- Stagger the submission of jobs for better load distribution
- NetBackup job delays
- Selection of storage units: performance considerations
- About file system capacity and NetBackup performance
- About the primary server NetBackup catalog
- Guidelines for managing the primary server NetBackup catalog
- Adjusting the batch size for sending metadata to the NetBackup catalog
- Methods for managing the catalog size
- Performance guidelines for NetBackup policies
- Legacy error log fields
 
- Media server configuration guidelines- NetBackup hardware design and tuning considerations
- About NetBackup Media Server Deduplication (MSDP)- Data segmentation
- Fingerprint lookup for deduplication
- Predictive and sampling cache scheme
- Data store
- Space reclamation
- System resource usage and tuning considerations
- Memory considerations
- I/O considerations
- Network considerations
- CPU considerations
- OS tuning considerations
- MSDP tuning considerations
- MSDP sizing considerations
 
- Cloud tier sizing and performance
- Accelerator performance considerations
 
- Media configuration guidelines- About dedicated versus shared backup environments
- Suggestions for NetBackup media pools
- Disk versus tape: performance considerations
- NetBackup media not available
- About the threshold for media errors
- Adjusting the media_error_threshold
- About tape I/O error handling
- About NetBackup media manager tape drive selection
 
- How to identify performance bottlenecks
- Best practices- Best practices: NetBackup SAN Client
- Best practices: NetBackup AdvancedDisk
- Best practices: Disk pool configuration - setting concurrent jobs and maximum I/O streams
- Best practices: About disk staging and NetBackup performance
- Best practices: Supported tape drive technologies for NetBackup
- Best practices: NetBackup tape drive cleaning
- Best practices: NetBackup data recovery methods
- Best practices: Suggestions for disaster recovery planning
- Best practices: NetBackup naming conventions
- Best practices: NetBackup duplication
- Best practices: NetBackup deduplication
- Best practices: Universal shares
- NetBackup for VMware sizing and best practices
- Best practices: Storage lifecycle policies (SLPs)
- Best practices: NetBackup NAS-Data-Protection (D-NAS)
- Best practices: NetBackup for Nutanix AHV
- Best practices: NetBackup Sybase database
- Best practices: Avoiding media server resource bottlenecks with Oracle VLDB backups
- Best practices: Avoiding media server resource bottlenecks with MSDPLB+ prefix policy
- Best practices: Cloud deployment considerations
 
- Measuring Performance- Measuring NetBackup performance: overview
- How to control system variables for consistent testing conditions
- Running a performance test without interference from other jobs
- About evaluating NetBackup performance
- Evaluating NetBackup performance through the Activity Monitor
- Evaluating NetBackup performance through the All Log Entries report
- Table of NetBackup All Log Entries report
- Evaluating system components- About measuring performance independent of tape or disk output
- Measuring performance with bpbkar
- Bypassing disk performance with the SKIP_DISK_WRITES touch file
- Measuring performance with the GEN_DATA directive (Linux/UNIX)
- Monitoring Linux/UNIX CPU load
- Monitoring Linux/UNIX memory use
- Monitoring Linux/UNIX disk load
- Monitoring Linux/UNIX network traffic
- Monitoring Linux/Unix system resource usage with dstat
- About the Windows Performance Monitor
- Monitoring Windows CPU load
- Monitoring Windows memory use
- Monitoring Windows disk load
 
- Increasing disk performance
 
- Tuning the NetBackup data transfer path- About the NetBackup data transfer path
- About tuning the data transfer path
- Tuning suggestions for the NetBackup data transfer path
- NetBackup client performance in the data transfer path
- NetBackup network performance in the data transfer path
- NetBackup server performance in the data transfer path- About shared memory (number and size of data buffers)- Default number of shared data buffers
- Default size of shared data buffers
- Amount of shared memory required by NetBackup
- How to change the number of shared data buffers
- Notes on number data buffers files
- How to change the size of shared data buffers
- Notes on size data buffer files
- Size values for shared data buffers
- Note on shared memory and NetBackup for NDMP
- Recommended shared memory settings
- Recommended number of data buffers for SAN Client and FT media server
- Testing changes made to shared memory
 
- About NetBackup wait and delay counters
- Changing parent and child delay values for NetBackup
- About the communication between NetBackup client and media server- Processes used in NetBackup client-server communication
- Roles of processes during backup and restore
- Finding wait and delay counter values
- Note on log file creation
- About tunable parameters reported in the bptm log
- Example of using wait and delay counter values
- Issues uncovered by wait and delay counter values
 
- Estimating the effect of multiple copies on backup performance
- Effect of fragment size on NetBackup restores
- Other NetBackup restore performance issues
 
- About shared memory (number and size of data buffers)
- NetBackup storage device performance in the data transfer path
 
- Tuning other NetBackup components- When to use multiplexing and multiple data streams
- Effects of multiplexing and multistreaming on backup and restore
- How to improve NetBackup resource allocation
- Encryption and NetBackup performance
- Compression and NetBackup performance
- How to enable NetBackup compression
- Effect of encryption plus compression on NetBackup performance
- Information on NetBackup Java performance improvements
- Information on NetBackup Vault
- Fast recovery with Bare Metal Restore
- How to improve performance when backing up many small files
- How to improve FlashBackup performance
- Veritas NetBackup OpsCenter
 
- Tuning disk I/O performance
Leveraging requirements and best practices
Design plans include many new features.
When the data gathering phase is complete, the next steps involve leveraging that data to calculate the capacity, I/O and compute requirements to determine three key numbers. Those numbers are BETB, IOPS, and compute (memory and CPU) resources. Veritas recommends that customers engage the Veritas Presales Team to assist with these calculations to determine the sizing of the solution. Then, it is important to consider some best practices around sizing and performance, as well as ensuring that the solution has some flexibility and headroom.
Due to the nature of MSDP, the memory requirements are driven by the cache, spoold and spad. The guideline is 1GB of memory to 1TB of MSDP. For a 500TB MSDP pool, the recommendation is a minimum of 500GB of memory. Also note that leveraging features like Accelerator can be memory intensive. The memory sizing is important.
For the workloads that have very high job numbers, it is recommended that smaller disk drives be leveraged to increase IOPS performance. Sometimes 4TB drives are a better fit than 8TB drives. Consider this suggestion as a factor along with the workload type, data characteristics, retention, and secondary operations.
Where MSDP storage servers are virtual, whether through VMware, Docker, or in the cloud, it is important not to share physical LUNs between instances. Significant performance issues have been observed in MSDP storage servers that are deployed in AWS, Azure, VMware, and Docker when the physical LUNs are shared between instances.
Often, customers mistakenly believe that setting a high number of data streams on an MSDP pool can increase performance of their backups. However, the goal is to set the number of streams that satisfy the workload needs without creating a bottleneck due to too many concurrent streams fighting for resources. For example, a single MSDP storage server with a 500TB pool protecting Oracle workloads exclusively at 60K jobs per day was configured with a maximum concurrent stream count of 275. Initially, this count was set to 200 and then gradually increased to 275.
One method of determining if the stream count is too low, is to measure how long a single job, during the busiest times of the day, waits in queue. If a lot of jobs are waiting in the queue for lengthy periods, then it is possible the stream count is too low.
That said, it is important to gather performance data like SAR from the storage server to see how compute and I/O resources are used. If those resources are heavily used at the current state of a specific stream count, and yet there are still large numbers of jobs waiting in the queue for a lengthy period of time, then additional MSDP storage servers may be required to meet a customer's window for backups and secondary operations.
When it comes to secondary operations, the goal should be to process all SLP backlog within the same 24 hours it was placed in queue. As an example, if there are 40K backup images per day that must be replicated and duplicated, the goal is to process those images consistently within a 24-hour period to prevent a significant SLP backlog.
Customers often make the mistake of oversubscribing their Maximum Concurrent Jobs within their storage units (STUs). This mistake adds up to be a number larger than the Max Concurrent Streams on the MSDP pool. This approach is not a correct way to leverage STUs. Additionally, customers may incorrectly create multiple STUs that reference the same MSDP storage server with stream counts that individually aren't higher than the Max Concurrent Streams on the MSDP pool, but add up to a higher number when all STUs that reference that storage server are combined. This approach is also an improper use of STUs.
All actively concurrent STUs that reference a single, specific MSDP storage server must have Maximum Concurrent Jobs set in total to be less than or equal to the Maximum Concurrent Streams on the MSDP pool. STUs are used to throttle workloads that reference a single storage resource. For example, if Maximum Concurrent Streams for an MSDP pool is set to 200 and two storage units have Maximum Concurrent Jobs each set to 150, the maximum number of jobs that can be processed at any given time is still 200, even though the sum of the two STUs is 300. This type of configuration isn't recommended. Furthermore, it is important to understand why more than one STU should be created to reference the same MSDP pool. A clean, concise NetBackup configuration is easier to manage and highly recommended. It is rare that a client must have more than one STU referencing the same MSDP storage server and associated pool.
Another thing to consider is that SLPs do need one or more streams to process secondary operations. Duplications and replications may not always have the luxury to be written during a window of time when no backups are running. Therefore, it is recommended that the sum of the Maximum Concurrent Jobs on all STUs referencing a specific MSDP storage server be 7-10% less than the Maximum Concurrent Streams on the MSDP pool to accommodate secondary operations while backups jobs are running. An example is where the Maximum Concurrent Streams on the MSDP pool is set to 275 while the sum of all Maximum Concurrent Jobs set on the STUs that reference that MSDP storage server is 250. This example allows up to 25 streams to be used for other activities like restores, replications, and duplications during which backups jobs are also running.
Although it is tempting to minimize the number of MSDP storage servers and size pools to the max 960TB, there are some performance implications that are worth considering. It has been observed that heavy mixed workloads sent to a single, 960TB MSDP pool don't perform as well as constructing two MSDP pools at 480TB and grouping the workloads to back up to a consistent MSDP pool. For example, consider two workload types, namely VMware and Oracle which happen to both be very large. Sending both workloads to a single large pool, especially considering that VMware and Oracle are resource-intensive, and both generate high job counts, can affect performance. In this scenario, creating a 480TB MSDP pool as the target for VMware workloads and a 480TB MSDP pool Oracle workloads can often deliver better performance.
Some customers incorrectly believe that alternating MSDP pools as the target or the same data is a good idea. It isn't. In fact, this approach decreases deduplication efficacy. Veritas does not recommend that a client send the same client data to two different pools. Also, Veritas does not recommended that a client send the same workloads to two different pools. This action negatively affects solution performance and capacity.
The only exceptions would be in the case that the target MSDP pool isn't available due to maintenance, and the backup jobs can't wait until it is available, or perhaps the MSDP pool is tight on space and juggling workloads temporarily is necessary while additional storage resources are added.
Many customers believe that minimizing the number of MSDP pools while maximizing the number of fingerprint media servers (FPMS) can increase performance significantly. In the past, there has been some evidence that FPMS might be effective at offloading some of the compute activity from the storage server would increase performance. While there are some scenarios where it might still be helpful, those scenarios are less frequent. In fact, often the opposite is true. There has been repeated evidence that large numbers of FPMSs leveraging a small number of storage servers can be a waste of resources, increase complexity, and affect performance negatively by overwhelming the storage server. We have consistently seen that the use of more storage servers with MSDP pools in the range of 500TB tend to perform better than a handful of FPMSs directing workloads to a single MSDP storage server. Therefore, it is recommended that the use of FPMS be deliberate and conservative, if they are indeed required.
The larger the pool, the larger the MSDP cache. The larger the pool, the longer it takes to run an MSDP check when the need arises. The fewer number of pools, the more the effect of taking a single pool offline for maintenance can have on the overall capability of the solution. Therefore, considering more pools of a smaller size instead of a minimum number of pools at a larger size can provide flexibility in your solution design and increase performance.
For virtual platforms such as Flex, there is value to creating MSDP pools and associated storage server instances that act as a target for a specific workload type. With multiple MSDP pools that do not share physical LUNs, the end result produces less I/O contention while minimizing the physical footprint.
Customer who runs their environments very close to full capacity tend to put themselves in a difficult position when a single MSDP pool becomes unavailable for any reason. When designing a solution that involves defining the size and number of MSDP pools, it is important to minimize SPOF, whether due to capacity, maintenance, or component failure. Furthermore, in cases where there are a lot of secondary activities like duplications or replications, ensuring there is some additional capacity headroom is important, as certain types of maintenance activity might lead to a short-term SLP backlog. A guideline of 25% headroom in each MSDP pool is recommended for these purposes, whether SLP backlog or temporarily juggling workloads due to the aforementioned.