For backups, delays at the beginning and end of the job increase in length, and may eventually result in status 1, 2, 6, and other failures. This typically affects Oracle archive log and SQL-Server transaction log backups but may affect any other backup policy that runs frequently and generates a large number of image fragments. The distinguishing characteristic is that one (or a few) clients in one (or a few) policies are affected but other policies and clients backing up to the same deduplication storage pool are not affected.
The first indication is typically when the dbclient process begins to suffer timeouts waiting for progress indications from the media server. Initially this may coincide with deduplication processing, or when the storage server is under maximum backup/duplication loads, or when other CPU or disk I/O processing coincides with the backups that fail.
For duplication and replication, performance degradation will occur over time for the same backup images noted above. However, in the absence of a client process to suffer timeouts, the slower duplication times may go unnoticed until a backlog of unduplicated images has accumulated.
Before applying the solutions below, confirm the problem is present or eminent. All of the following must be true.
- Do the Job Details confirm that there are delays of 5+ minutes between the bptm 'start' and when it is ready to 'begin writing'?
6/11/2013 12:56:01 PM - Info bptm(pid=5829) start
6/11/2013 01:27:43 PM - begin writing
- Does the bptm or plugin debug log confirm there are significant delays (5+ minutes) processing the Path Objects (POs) for a single image by a single PID?
The <pid> and total PO count (<xxx>) will be the same for a single image.
$ grep 'PO.s remaining' <bptm_log_filename> | sort +1 -2
12:56:01.273 [<pid>] ... 2 out of <xxx> PO's remaining
01:27:43.083 [<pid>] ... <xxx> out of <xxx> PO's remaining
- Are significant numbers of POs present for the client/policy pair, or perhaps for other pairs that that are being backed up, duplicated, or replicated at the same time?
The POs can be counted by tallying the number of files in each of these directories on the storage server.
Client/policy pairs with 5,000 to 20,000 POs may be impacted if other concurrent processing is significant. If other processing is not significant, much larger PO counts (e.g. 80,000) may be present but not impacting performance in any significant way.
In extreme cases, the dbclient debug log shows additional symptoms. These symptoms can have other causes so ensure that the ones noted above are also present.
- The database agent may timeout waiting for the "Beginning backup ..." message in the comm file because of the PO processing delays on the media server.
12:56:00.080 [12860.3116] <4> handshake: INF - CONTINUE BACKUP message received
12:56:00.080 [12860.3116] <4> handshake: INF - Beginning backup timeout value = 900
12:56:01.094 [12860.3116] <4> readCommMessages: Entering readCommMessages
13:11:14.731 [12860.3116] <16> readCommFile: ERR - timed out after 900 seconds while reading from ...\Logs\user_ops\dbext\logs\3116.0.1370948152
13:26:28.368 [12860.3116] <16> CreateNewImage: ERR - serverResponse() failed
13:26:28.368 [12860.3116] <4> closeApi: INF - EXIT STATUS 6: the backup failed to back up the requested files
- The Application backup job completes with status 0 on the media server, but more than 15 minutes after the data transfer completed. Meanwhile, the database agent timed out after a hard coded 900 seconds waiting for server status.
00:38:50.655  <4> closeApi: INF - EXIT STATUS 0: the requested operation was successfully completed
00:38:50.722  <4> closeApi: serverReadTimeout is <900> seconds
00:38:50.722  <4> serverExitStatus: entering serverExitStatus.
00:38:50.722  <4> readCommMessages: Entering readCommMessages
00:53:51.796  <16> readCommFile: ERR - timed out after 900 seconds while reading from …/logs/user_ops/dbext/logs/26402.0.1361453415
00:53:51.796  <4> serverExitStatus: read from comm file:<>
00:53:51.796  <4> closeApi: INF - server EXIT STATUS = -1: invalid error number
Each image fragment in a backup generates 4 to 6 path objects (POs) in the catalog directory on the storage server for the client/policy pair that performed the backup. During backup/duplication/replication operations, all POs for the same client & policy are retrieved from the catalog manager (spad for MSDP, SPA metabase for PureDisk) to be examined. Over time, if backup images accumulate without expiring, the number of POs increases and processing them begins to consume significant amounts of CPU and disk I/O. This activity stresses the resources on the plugin host during backup, duplication, and the replication import processing. Eventually the processing delays are what cause the performance degradation and potentially dbclient timeouts during Oracle and SQL-Server backups.
The number of POs necessary to induce the problem varies by operational load and processing capability of the platform. But numbers above 20,000 for a client/policy pair often impact performance. Having multiple client/policy pairs above 5,000 and performing concurrent operations may have a similar effect.
In general, small, frequent [and therefore numerous] backups, when retained for extended periods, then affect all concurrent operations on the host.
The formal resolution for this issue is included in the following release(s):
For PDDO: NetBackup 220.127.116.11 with PureDisk 6.6.4 (D 18.104.22.168 on 5x20 appliances)
For MSDP: NetBackup 22.214.171.124 or Appliance 126.96.36.199
Please visit this link for download and README information:
NetBackup 7.1 – 7.5 and Appliance 2.0.x – 2.5.x
Can affect backups to deduplication pools.
Can affect duplications from non-deduplication storage to deduplication pool.
Can affect replication (AIR) of deduplicated images between NetBackup domains.
Does not affect optimized duplication between pools in the same NetBackup domain.