Incremental backup to EMC DataDomain via OST plugin fails with status 84.

Article: 100028533
Last Published: 2022-02-26
Ratings: 3 0
Product(s): NetBackup & Alta Data Protection

Problem

Incremental backups of very large file systems or those with millions of files may fail with status 84 (and storage server error 2060046), whereas the full backup of the filesystem may be successful.

Error Message

On the Media Server, the bptm log will show messages similar to the snippets below:

11:19:46.724 [19298] <2> bp_sts_write_and_include_image: MEMCPY bytesNeeded=1024 bytesStillNeeded=512
11:19:53.044 [19298] <2> bp_sts_write_and_include_image: MEMCPY bytesNeeded=1024 bytesStillNeeded=512
.
.  < 5057 Error message received form DataDomain>
.
11:20:04.150 [19298] <16> 5029640:bptm:19298:DD_Server: [4B62:11B4F50] ddp_read() failed Offset 3950729216, BytesToRead 42496, BytesRead 0 Err: 5057-nfs readext remote failed (nfs: Stale file handle)
11:20:04.150 [19298] <16> 5029640:bptm:19298:DD_Server: /usr/openv/lib/ost-plugins/libstspiDataDomain.so:stspi_read_image STS_EPLUGIN [DDErrNo = 5057 (File handle is stale)]
11:20:04.150 [19298] <16> verify_include_one: sts_read_image failed, cur_pos=3950729216, retval=2060046, read_len=42496, read_len_out=0
11:20:04.151 [19298] <32> do_include_image: accelerator verification failed: backupid=server01_1587614322, offset=3950729216, length=42496, error=2060022, error message:
11:20:04.152 [19298] <32> write_data: image write failed: error 2060022:
11:20:04.397 [19298] <16> 5029640:bptm:19298:DD_Server: [4B62:11B4F50] Boost HA has been disabled for file dd6800_01/server01_1588173008_C1_F2:1588173008:Server01-Backup:4:1::
.
11:20:13.738 [19298] <16> write_data: cannot write image to disk, Invalid argument
11:20:13.738 [19298] <4> write_backup: Calling close_all_ft_pipes
11:20:13.738 [19298] <2> KILL_MM_CHILD: Sending SIGUSR2 (kill) to child 19400 (../bptm.c:18635)
11:20:13.780 [19298] <2> bptm: EXITING with status 84 <----------

On the DataDomain Server, the ddfs.info log will show messages similar to the snippets below:

nfsproc3_ddcp_open_file_3_svc: ddcp ctx 47: did not close file /data/col1/testdd201-lsu1/test_1358695712_C1_F1:1358695712:Test_Prod1:4:0:: in 10800 seconds

Cause

The issue is related to the amount of time that it takes the backup client to send data to the DataDomain via the OST plugin on the media server, and the amount time that the DataDomain will wait for more data from the OST plugin (before it cancels the fragment write due to no updates for the OST_ABANDON_TIMEOUT period).

This has been observed with incremental backups (where the client OS takes too long traversing millions of files on a slow filesystem), and also on VMware backups that experience extensive delays on the backup host before writing any data.

Solution

Consult EMC documentation, or EMC technical support, to change the OST_ABANDON_TIMEOUT value on the Data Domain to a higher value to allow the incremental file scan to complete.

You can check the current DataDomain OST_ABANDON_TIMEOUT on the DataDomain server with the following commands on the DataDomain CLI:

priv set SE
se sysparam show OST_ABANDON_TIMEOUT

To change the setting:

1. Stop access to the device (unmount or shutdown of Networker)
2. Then at the DataDomain CLI:

priv set SE
filesystem disable
se sysparam set OST_ABANDON_TIMEOUT=43200
filesystem enable

Note: The time is in seconds, 43200 seconds is 12 hours, 10800 seconds is 3 hours.

The alternative is to troubleshoot the client filesystem performance to determine why the incremental file scan is taking so long.

 

Applies To

This issue has been observed with various versions of NetBackup and various Data Domain appliances.

Was this content helpful?