Images written as part of a Storage Lifecycle Policy (SLP) may be prematurely marked as "lifecycle complete"

Article: 100020944
Last Published: 2015-08-25
Ratings: 0 0
Product(s): NetBackup

Problem

Images written as part of a Storage Lifecycle Policy (SLP) may be prematurely marked as "lifecycle complete" prior to scheduled duplication due to a problem allocating resources from the operating system. This may lead to data loss when these images expire, having not been duplicated.

Solution

Due to an issue in nbstserv which can result in requests for more open file descriptors than can be allocated, images written as part of a Storage Lifecycle Policy (SLP) may be erroneously marked "Lifecycle Complete" before they are duplicated.  This can lead to the only copies of an image expiring prematurely, resulting in data loss when the duplicated images are unexpectedly unavailable... due to the fact that they were never created.

This condition has only been reported on Solaris master servers.  Although it may be seen on any supported operating system, the underlying cause - problems attempting to access a file descriptor within a certain range (in the case of Solaris, values from 0-255) due to use of all values in that range - is uniquely constrained under Solaris so that it is more likely to be encountered there than upon systems using a different operating system, where the range of open file descriptors in use can rise well into the thousands before this issue would theoretically be seen.

This condition occurs when nbstserv requests an open file descriptor when none is available due to a file descriptor leak, resulting in an erroneous "defined storage service not found" report.  When this happens prior to an SLP duplication, the original image is marked as complete, even though duplication did not take place.  The image is then reassigned with the retention level as configured in the SLP backup setting - often a much shorter interval than the one assigned to the duplicated image - and expire as scheduled based upon that retention level.

What is Affected:
This condition is rare, but all operating systems are potentially affected running the following NetBackup versions and utilizing Storage Lifecycle Policies:
  • NetBackup 6.5 and above
Note: This issue has only been reported on Solaris systems.  It is expected that this issue would be extremely rare on other operating systems or architectures, if it could occur at all.

The following symptoms make this issue more likely to occur:
  • Solaris master
  • Many Storage Lifecycle Policies configured and running
  • nbstserv process running for a very long period of time
  • Regularly activating and deactivating SLP may contribute to a greater use of open file descriptors
  • Displaying many SLPs in a short time using the GUI may also exacerbate this issue

How to Determine if Affected:
Examine the nbstserv log (OID 226) for log entries containing the phrase " defined storage service not found":

# cd /usr/openv/logs
# grep -i defined 51216-226* | grep "storage service not found"

If results are returned, the log should be further scrutinized, especially if a reject image message is seen.

In the following example, this error is recorded and the image is marked complete:
03/18/09 07:16:15.347 226 PID:19187 TID:5 [No context] [ReplicationSession::SelectImagesForProcessing]        5:  ReplicationSession::SelectImagesForProcessing: ####
03/18/09 07:16:15.347 226 PID:19187 TID:5 [No context] [ReplicationSession::SelectImagesForProcessing]        5:  ReplicationSession::SelectImagesForProcessing: ######## new image: app1_1237360465
03/18/09 07:16:15.347 226 PID:19187 TID:5 [No context] [ReplicationSession::SelectImagesForProcessing]        5:  ReplicationSession::SelectImagesForProcessing: ####
03/18/09 07:16:15.347 226 PID:19187 TID:5 [No context] [ReplicationSession::getStorageServiceRec] +++ ENTERING +++ : obj = 31d9c70 (ReplicationSession.cpp:2032)
03/18/09 07:16:15.348 226 PID:19187 TID:5 [No context] [ReplicationSession::getStorageServiceRec] --- EXITING --- : obj = 31d9c70 (ReplicationSession.cpp:2032)
03/18/09 07:16:15.348 226 PID:19187 TID:5 [No context] [ReplicationSession::SelectImagesForProcessing] reject image: app1_1237360465 defined storage service not found, Storage service=SLP-PROD-1-MONTH(ReplicationSession.cpp:1199)      
03/18/09 07:16:15.348 226 PID:19187 TID:5 [No context] [ImageInProcess::processLifecycleComplete] backupid= app1_1237360465 storage service= SLP-PROD-1-MONTH(ImageInProcess.cpp:841)
03/18/09 07:16:15.348 226 PID:19187 TID:5 [No context] [ImageInProcess::findMaxFixedRetention] +++ ENTERING +++ : obj = 744630 (ImageInProcess.cpp:723)
03/18/09 07:16:15.348 226 PID:19187 TID:5 [No context] [ImageInProcess::findMaxFixedRetention] --- EXITING --- : obj = 744630 (ImageInProcess.cpp:723)
03/18/09 07:16:15.348 226 PID:19187 TID:5 [No context] [ImageInProcess::processLifecycleComplete] set fixed expiration for app1_1237360465 copy number 1, Current Expiration time=2147483647, Current Try_to_keep=1237619665, New Expiration time = 1237619665(ImageInProcess.cpp:892)
03/18/09 07:16:15.489 226 PID:19187 TID:5 [No context] [iipCopy::setExpireTime] +++ ENTERING +++ : obj = b4dbfe0 (ImageInProcess.cpp:216)
03/18/09 07:16:15.489 137 PID:19187 TID:5 [No context] [emmlib_UpdateImageCopyExpireTime] (1) Updating Image copy records from EMM database
               MasterServerName = < master1 >
               BackupID = < app1_1237360465 >
               CopyNumber = < 1 >
               ExpireTime = < 1237619665 >
03/18/09 07:16:15.514 137 PID:19187 TID:5 [No context] [emmlib_UpdateImageCopyExpireTime] (1) updateImageCopy successful, emmError = 0, nbError = 0
03/18/09 07:16:15.514 226 PID:19187 TID:5 [No context] [iipCopy::setExpireTime] --- EXITING --- : obj = b4dbfe0 (ImageInProcess.cpp:216)
...
03/18/09 07:16:15.606 226 PID:19187 TID:5 [No context] [catalogImgRec::retrieveCatalogImgRecord] +++ ENTERING +++ : obj = 744634 (ImageInProcess.cpp:2264)
03/18/09 07:16:15.607 226 PID:19187 TID:5 [No context] [catalogImgRec::retrieveCatalogImgRecord]        5:  catalogImgRec::retrieveCatalogImgRecord: pfi type = false
03/18/09 07:16:15.607 226 PID:19187 TID:5 [No context] [catalogImgRec::retrieveCatalogImgRecord]        5:  catalogImgRec::retrieveCatalogImgRecord: returning catalog record for image: app1_1237360465, primary copy: 1, number of copies: 1, PFI type: false
03/18/09 07:16:15.608 226 PID:19187 TID:5 [No context] [catalogImgRec::retrieveCatalogImgRecord] --- EXITING --- : obj = 744634 (ImageInProcess.cpp:2264)
03/18/09 07:16:15.608 226 PID:19187 TID:5 [No context] [catalogImgRec::setComplete] +++ ENTERING +++ : obj = 744634 (ImageInProcess.cpp:2179)
03/18/09 07:16:15.609 226 PID:19187 TID:5 [No context] [catalogImgRec::setComplete] catalog image app1_1237360465 set to completed: 3 status: 0(ImageInProcess.cpp:2211)    
03/18/09 07:16:15.609 226 PID:19187 TID:5 [No context] [catalogImgRec::setComplete] --- EXITING --- : obj = 744634 (ImageInProcess.cpp:2179)
03/18/09 07:16:15.609 226 PID:19187 TID:5 [No context] [catalogImgRec::setComplete] catalog image app1_1237360465 set to completed: 3 status: 0(ImageInProcess.cpp:2211)
03/18/09 07:16:15.609 226 PID:19187 TID:5 [No context] [catalogImgRec::setComplete] --- EXITING --- : obj = 744634 (ImageInProcess.cpp:2179)
03/18/09 07:16:15.609 226 PID:19187 TID:5 [No context] [ImageInProcess::processLifecycleComplete] setting image: app1_1237360465 as lifecycle complete in EMM database(ImageInProcess.cpp:1071)
03/18/09 07:16:15.609 226 PID:19187 TID:5 [No context] [ImageInProcess::setImageState] +++ ENTERING +++ : obj = 744630 (ImageInProcess.cpp:1300)
03/18/09 07:16:15.610 137 PID:19187 TID:5 [No context] [emmlib_UpdateImageStorageServiceState] (1) Updating Image records from EMM database
               MasterServerName = < master1 >
               BackupID = < app1_1237360465 >
               StorageServiceState = < 3 >
03/18/09 07:16:15.632 137 PID:19187 TID:5 [No context] [emmlib_UpdateImageStorageServiceState] (1) updateImage failed, emmError = 2007031, nbError = 0
03/18/09 07:16:15.633 226 PID:19187 TID:5 [No context] [ImageInProcess::setImageState] --- EXITING --- : obj = 744630 (ImageInProcess.cpp:1300)
03/18/09 07:16:15.633 226 PID:19187 TID:5 [No context] [ImageInProcess::processLifecycleComplete] image: app1_1237360465 - failed to set as lifecycle complete in EMM database!(ImageInProcess.cpp:1075)

From the bpdbm log on the master:
07:14:45.916 [506] <4> emmlib_UpdateImage: (0)
               MasterServerName = < master1 > < TRUE >
               BackupID = < app1_1237360465 > < TRUE >
               ClientName = < NULL > < FALSE >
               BackupTime = < 0 > < FALSE >
               PolicyName = < NULL > < FALSE >
               ClientType = < 0 > < FALSE >
               ScheduleType = < 0 > < FALSE >
               StorageServiceName = < SLP-PROD-1-MONTH > < TRUE >
               StorageServiceState = < 1 > < TRUE >
               TimeInProcess = < 0 > < FALSE >
               ClassificationID = < NULL > < FALSE >
07:14:45.939 [506] <4> emmlib_UpdateImage: (0) updateImage successful, emmError = 0, nbError = 0
07:14:45.939 [506] <4> db_EMMSetStorageState: emmlib_UpdateImage successfully updated image(0)
07:14:45.940 [506] <4> DeleteConnectionID: (-) Cleared Connection ID 0
07:14:45.941 [506] <4> uninitialize: (0) Destroying the orb
07:14:45.946 [506] <4> emmlib_uninitialize: (0) Successfully released EMM session and database objects
07:14:45.946 [506] <2> bpdbm: request complete: exit status 0  

bpdbm transaction once nbstserv sends request to update:
07:16:15.608 [921] <2> get_image_by_bid: likely found backupid in /usr/openv/netbackup/db/images/app1/1237000000/UNIX-PROD_1237360465_INCR
07:16:15.609 [921] <2> get_image_by_bid: found app1_1237360465
07:16:15.609 [921] <2> write_imagerec: /usr/openv/netbackup/db/images/app1/1237000000/UNIX-PROD_1237360465_INCR
07:16:15.609 [921] <2> write_imagerec: fragnum: 1, expir: 1237619665, vers: 8 copy: 1
07:16:15.609 [921] <2> write_imagerec: NULL prev_backup_image
07:16:15.609 [921] <2> read_next_image: calling db_IMAGEreceive() for next image
07:16:15.634 [919] <2> image_by_backupid: trying app1 for app1_1237360463
07:16:15.635 [919] <2> image_by_backupid: likely found backupid in /usr/openv/netbackup/db/images/app1/1237000000/UNIX-PROD_1237360463_INCR
07:16:15.635 [919] <2> image_by_backupid: found app1_1237360463
07:16:15.635 [919] <2> image_by_backupid: calling db_IMAGEreceive() for next image
07:16:15.637 [921] <2> update_and_chgexp: Q_IMAGE_UPDATE

Once that message is recorded, examine the corresponding log_<ctime> file located in /usr/openv/netbackup/db/error.  In this following example, log_1239826000 is examined:
1239826258 1 33412 4 slpmedia 756 753 0 slpmedia bptm begin writing backup id slpmedia_1239826251, copy 1, fragment 1, destination path /opt/fb1
1239826258 1 4 4 slpmedia 756 753 0 slpmedia bptm successfully wrote backup id slpmedia_1239826251, copy 1, fragment 1, 800 Kbytes at 9638.554 Kbytes/sec  
1239826361 1 2 4 slpmedia 0 0 0 *NULL* bpdbm successfully modified backupid slpmedia_1239826251

In the above excerpt, it can be seen that the backup image in question was modified a mere 103 seconds after it was written, due to being set "SLP complete."

Detection in the nbstserv log file can be hampered if nbstserv has stopped logging.  This can occur when the log normally rolls over at midnight if nbstserv attempts to open a file descriptor when it becomes time to roll over the log, but cannot due to this issue.

On Solaris systems, pfiles can be used to monitor file descriptors in use.  Two examples of pfiles output follow.

This is an example of output from a system where this issue may be seen:
# pgrep nbstserv
1102
# pfiles 1102
1102:   /usr/openv/netbackup/bin/nbstserv
 Current rlimit: 8192 file descriptors
  0: S_IFCHR mode:0666 dev:351,0 ino:6815752 uid:0 gid:3 rdev:13,2
     O_RDONLY
     /devices/pseudo/mm@0:null
  1: S_IFCHR mode:0666 dev:351,0 ino:6815752 uid:0 gid:3 rdev:13,2
     O_WRONLY|O_CREAT|O_TRUNC
     /devices/pseudo/mm@0:null
  2: S_IFCHR mode:0666 dev:351,0 ino:6815752 uid:0 gid:3 rdev:13,2
     O_WRONLY|O_CREAT|O_TRUNC
     /devices/pseudo/mm@0:null
  3: S_IFREG mode:0664 dev:32,14 ino:2067521 uid:0 gid:0 size:30197009
     O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
     /opt/openv653/logs/51216-226-2216927724-090428-0000000000.log
  4: S_IFDOOR mode:0444 dev:360,0 ino:57 uid:0 gid:0 size:0
     O_RDONLY|O_LARGEFILE FD_CLOEXEC  door to nscd[180]
     /var/run/name_service_door
  5: S_IFREG mode:0664 dev:32,14 ino:2067521 uid:0 gid:0 size:30197009
     O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
     /opt/openv653/logs/51216-226-2216927724-090428-0000000000.log
.
.   [uninterrupted sequence of file descriptors from 6 to 248]
.
249: S_IFSOCK mode:0666 dev:357,0 ino:1883 uid:0 gid:0 size:0
     O_RDWR
       SOCK_STREAM
       SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.0.192.0)
       sockname: AF_INET <IP Address>  port: 35837
       peername: AF_INET <IP Address>  port: 13724
250: S_IFSOCK mode:0666 dev:357,0 ino:1882 uid:0 gid:0 size:0
     O_RDWR
       SOCK_STREAM
       SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.0.192.0)
       sockname: AF_INET <IP Address>  port: 35843
       peername: AF_INET <IP Address>  port: 13724
251: S_IFSOCK mode:0666 dev:357,0 ino:1881 uid:0 gid:0 size:0
     O_RDWR
       SOCK_STREAM
       SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.0.192.0)
       sockname: AF_INET <IP Address>  port: 35849
       peername: AF_INET <IP Address>  port: 13724
252: S_IFSOCK mode:0666 dev:357,0 ino:1880 uid:0 gid:0 size:0
     O_RDWR
       SOCK_STREAM
       SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.0.192.0)
       sockname: AF_INET <IP Address>  port: 35855
       peername: AF_INET <IP Address>  port: 13724
253: S_IFSOCK mode:0666 dev:357,0 ino:1879 uid:0 gid:0 size:0
     O_RDWR
       SOCK_STREAM
       SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.0.192.0)
       sockname: AF_INET <IP Address>  port: 35861
       peername: AF_INET <IP Address>  port: 13724
254: S_IFSOCK mode:0666 dev:357,0 ino:1878 uid:0 gid:0 size:0
     O_RDWR
       SOCK_STREAM
       SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.0.192.0)
       sockname: AF_INET <IP Address>  port: 35867
       peername: AF_INET <IP Address>  port: 13724
255: S_IFSOCK mode:0666 dev:357,0 ino:1877 uid:0 gid:0 size:0
     O_RDWR
       SOCK_STREAM
       SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.0.192.0)
       sockname: AF_INET <IP Address> port: 35873
       peername: AF_INET <IP Address> port: 13724
256: S_IFIFO mode:0000 dev:359,0 ino:632 uid:0 gid:0 size:0
     O_RDWR|O_NONBLOCK FD_CLOEXEC
257: S_IFIFO mode:0000 dev:359,0 ino:632 uid:0 gid:0 size:0
     O_RDWR FD_CLOEXEC
.
.     [uninterrupted sequence of file descriptors from 258 to 622]
.
623: S_IFSOCK mode:0666 dev:357,0 ino:42470 uid:0 gid:0 size:0
     O_RDWR
       SOCK_STREAM
       SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.0.192.0)
       sockname: AF_INET <IP Address>  port: 39392
       peername: AF_INET <IP Address>  port: 13724

Note:  In the example above, all file descriptors from 0 to 255 have been exhausted.  In general, it is abnormal for nbstserv to consume this many file descriptors.

This is an example output from a system running normally:
Current rlimit: 8192 file descriptors
  0: S_IFIFO mode:0000 dev:358,0 ino:433506 uid:0 gid:0 size:723
     O_RDWR
  1: S_IFCHR mode:0620 dev:350,0 ino:12582954 uid:0 gid:7 rdev:24,19
     O_RDWR|O_LARGEFILE
     /devices/pseudo/pts@0:19
  2: S_IFCHR mode:0620 dev:350,0 ino:12582954 uid:0 gid:7 rdev:24,19
     O_RDWR|O_LARGEFILE
     /devices/pseudo/pts@0:19
  3: S_IFDOOR mode:0444 dev:359,0 ino:58 uid:0 gid:0 size:0
     O_RDONLY|O_LARGEFILE FD_CLOEXEC  door to nscd[167]
     /var/run/name_service_door
  4: S_IFREG mode:0600 dev:32,8 ino:10419 uid:0 gid:0 size:537
     O_RDWR|O_CREAT
     /etc/vx/vss/VRTSat.conf.lock
  5: S_IFREG mode:0600 dev:32,12 ino:25252 uid:0 gid:1 size:20
     O_RDWR|O_CREAT
     /var/VRTSat/.VRTSat/profile/VRTSatlocal.conf.lock
  6: S_IFREG mode:0600 dev:32,12 ino:24904 uid:0 gid:0 size:0
     O_RDWR
     /var/VRTSat/.VRTSat/profile/certstore/keystore/KeyStore.lock
  7: S_IFREG mode:0664 dev:32,14 ino:2067530 uid:0 gid:0 size:49
     O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
     /opt/openv653/logs/51216-261-2216927724-090426-0000000000.log
256: S_IFIFO mode:0000 dev:358,0 ino:433668 uid:0 gid:0 size:0
     O_RDWR|O_NONBLOCK FD_CLOEXEC
257: S_IFIFO mode:0000 dev:358,0 ino:433668 uid:0 gid:0 size:0
     O_RDWR FD_CLOEXEC
258: S_IFIFO mode:0000 dev:358,0 ino:433669 uid:0 gid:0 size:0
     O_RDWR|O_NONBLOCK FD_CLOEXEC
259: S_IFIFO mode:0000 dev:358,0 ino:433669 uid:0 gid:0 size:0
     O_RDWR FD_CLOEXEC
260: S_IFSOCK mode:0666 dev:356,0 ino:51296 uid:0 gid:0 size:0
     O_RDWR|O_NONBLOCK FD_CLOEXEC
       SOCK_STREAM
       SO_KEEPALIVE,SO_SNDBUF(65536),SO_RCVBUF(57344),IP_NEXTHOP(0.0.224.0)
       sockname: AF_INET <IP Address>  port: 44505
       peername: AF_INET <IP Address> port: 1556

In this example, far fewer file descriptors are in use and plenty in the 0-255 range are available.

Formal Resolution:
A partial resolution to this issue is included in the following release:
  • NetBackup 6.5 Release Update 4 (6.5.4), recently released linked in the Related Articles
In this release, the conditions leading to the file descriptor leak will be addressed.  This should dramatically reduce the already rare potential for this issue to be seen.

The formal resolution to this issue (Etrack 1588246) is scheduled to be included in the following release:
  • NetBackup 6.5 Release Update 5 (6.5.5), scheduled for release in the fourth quarter of 2009.
When NetBackup 6.5.5 is released, please visit the following link for download and readme information:
  https://www.veritas.com/support/en_US/15143.html


Workaround:
Stop and restart nbstserv before the amount of file descriptors (FD) grows too large.   pfiles may be utilized to monitor file descriptor use by nbstserv.

If this workaround is unfeasible, please contact Veritas technical support, referencing this document ID and Etrack 1588155 to obtain a fix for this issue.
 

Was this content helpful?