DOCUMENTATION: How Symantec NetBackup determines if a tape should be frozen or the status of a tape drive should be changed to down, and how to change this behavior
When a read, write, or position error occurs on tape, it is difficult to know whether the error is caused by media or by the drive itself. This is because the only error produced comes from the operating system, and only reports, "I/O ERROR". In an attempt to prevent bad media or drives from causing all backups in a given timeframe to fail, NetBackup developed a method to attempt to determine, based on past history, if a media or drive is bad.
Each time an I/O error occurs on a read, write, or position, bptm logs the error into an errors file. Each entry consists of the time of the error, the media ID, the drive index, and the type of error.
Sample entries in this file are:
05/21/06 04:15:17 A00167 4 WRITE_ERROR
05/26/06 12:37:47 A00168 4 READ_ERROR
Each time an entry is made, past entries in the file are scanned to determine if the same media id or drive has had the same type of error in the past "n" hours, where "n" is the TIME_WINDOW. The default time window is 12 hours. The command to freeze a media or down a drive does not normally occur the first time the error is encountered. There are two other parameters, MEDIA_ERROR_THRESHOLD and DRIVE_ERROR_THRESHOLD, the default value for each being 3.
- If the same media id gets write errors three times within the time window, on more than 1 drive, it is assumed that the media is bad and NetBackup freezes the media.
- If different media id's get the same error three times within the time window on the same drive, it is assumed the drive is bad and NetBackup places that drive into a "DOWN" state.
- If the same drive gets errors three times within the time window with the same media id, then NetBackup assumes the media is bad and freezes it.
The TIME_WINDOW, MEDIA_ERROR_THRESHOLD and DRIVE_ERROR_THRESHOLD values are all configurable. If the MEDIA_ERROR_THRESHOLD or DRIVE_ERROR_THRESHOLD value is set to 0, freeze or down occurs on the first error. MEDIA_ERROR_THRESHOLD is looked at first, so if both are set to 0, the freeze of the media overrides the downing of the drive. This configuration is not recommended.
If any one of a combination of the above files exist, the bptm shows a message indicating which value is used each time it goes through the algorithm. The log message shows:
"using time window of %d hours"
"using media error threshold of %d"
"using drive error threshold of %d"
where the %d comes from the number obtained from the file.
In general, the freeze and down behavior is designed to aid in getting backups completed successfully. If read errors occur during a restore attempt, freezing of the media has little effect, as it is still necessary to have that same tape to perform the restore (or another copy if it exists). In the case of a restore, downing a bad drive may help, assuming the problem is with the drive.
To view the error threshold and window settings, run the following nbemmcmd command:
<Install_Path>\Veritas\NetBackup\bin\admincmd>nbemmcmd -listsettings -machinename <machine name>
Several parameters will display, including the following:
To change the error threshold and window settings, run the following nbemmcmd command:
<Install_Path>\Veritas\NetBackup\bin\admincmd>nbemmcmd -changesetting -machinename <machine name>
DRIVE_ERROR_THRESHOLD <unsigned integer>
MEDIA_ERROR_THRESHOLD <unsigned integer>
TIME_WINDOW <unsigned integer>