Problem
A NetBackup Master Server at version 8.1 encountered a scenario where the NetBackup Policy Execution Manager (NBPEM) is periodically hanging, in rare circumstances when this scenario has been reached can prevent any new scheduled or immediate backup tasks from successfully starting.
Error Message
Example: An extract taken from a GNU Debugger (GDB) stack trace, which illustrates a number of threads which have encountered a deadlock condition.
Thread 13 (Thread 0x7f7310c96700 (LWP 2779)):
#0 0x00007f731c0a24cd in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f731c09dde6 in _L_lock_870 () from /lib64/libpthread.so.0
#2 0x00007f731c09dcdf in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007f731f8ee9e6 in v6::ACE_Recursive_Thread_Mutex::acquire() () from /usr/openv/lib/libvxACE.so.6
#4 0x00000000006fe731 in Symantec::NetBackup::PEM::PemPolicyCache::getPolicy(Symantec::NetBackup::PEM::AutoDereference<Symantec::NetBackup::PEM::PemNameCaseSensitive>&, Symantec::NetBackup::PEM::PemTask&, void*, Symantec::NetBackup::PEM::PolicyReloadOptions, bool, unsigned long) ()
#5 0x00000000004e2f6e in Symantec::NetBackup::PEM::BaseJob::run(void*, Symantec::NetBackup::PEM::PemEvent*) ()
#6 0x000000000052f5bf in Symantec::NetBackup::PEM::CompoundJob::run(void*, Symantec::NetBackup::PEM::PemEvent*) ()
#7 0x00000000004c1acc in Symantec::NetBackup::PEM::ApplicationCompoundJob::run(void*, Symantec::NetBackup::PEM::PemEvent*) ()
#8 0x000000000073b62a in Symantec::NetBackup::PEM::PemTask::invokeRun(void*, Symantec::NetBackup::PEM::PemEvent*) ()
#9 0x00000000006c9f87 in Symantec::NetBackup::PEM::PemHandlerEntry::invokeRunNoCancelSafety(Symantec::NetBackup::PEM::PemEvent*) ()
#10 0x0000000000737662 in Symantec::NetBackup::PEM::PemTask::invokeRunWithCancelSafety(Symantec::NetBackup::PEM::PemHandlerEntry*, Symantec::NetBackup::PEM::PemEvent*, bool) ()
#11 0x00000000006cbbd3 in Symantec::NetBackup::PEM::PemHandlerEntry::invokeRunWithCancelSafety(Symantec::NetBackup::PEM::PemEvent*, bool) ()
#12 0x00000000007bd0b5 in Symantec::NetBackup::PEM::QueueHandler::run(v6::ACE_Guard<Symantec::NetBackup::PEM::PemLockRecursive>&) ()
#13 0x00000000007bdefd in Symantec::NetBackup::PEM::QueueHandler::svc() ()
#14 0x00007f731f94a49b in v6::ACE_Task_Base::svc_run(void*) () from /usr/openv/lib/libvxACE.so.6
#15 0x00007f7324a9ab2c in Symantec::NetBackup::Ncf::ACEThreadHook::start(void* (*)(void*), void*) () from /usr/openv/lib/libncf.so
#16 0x00007f73253b0d46 in ThreadHook::start(void* (*)(void*), void*) () from /usr/openv/lib/libnborb.so
#17 0x00007f731f8f7b67 in v6::ACE_Thread_Adapter::invoke() () from /usr/openv/lib/libvxACE.so.6
#18 0x00007f731c09bdd5 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f731bdc5b3d in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x7f7311299700 (LWP 2772)):
#0 0x00007f731c0a24cd in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f731c09dde6 in _L_lock_870 () from /lib64/libpthread.so.0
#2 0x00007f731c09dcdf in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007f731f8ee9e6 in v6::ACE_Recursive_Thread_Mutex::acquire() () from /usr/openv/lib/libvxACE.so.6
#4 0x000000000073742c in Symantec::NetBackup::PEM::PemTask::invokeRunWithCancelSafety(Symantec::NetBackup::PEM::PemHandlerEntry*, Symantec::NetBackup::PEM::PemEvent*, bool) ()
#5 0x00000000006cbbd3 in Symantec::NetBackup::PEM::PemHandlerEntry::invokeRunWithCancelSafety(Symantec::NetBackup::PEM::PemEvent*, bool) ()
#6 0x00000000006f5b3e in Symantec::NetBackup::PEM::PemPolicyCache::offLoadThread() ()
#7 0x000000000070a681 in Symantec::NetBackup::PEM::PemOffLoadProcessing::handle_exception(int) ()
#8 0x00007f731f91173a in v6::ACE_Select_Reactor_Notify::dispatch_notify(v6::ACE_Notification_Buffer&) () from /usr/openv/lib/libvxACE.so.6
#9 0x00007f731f91326b in v6::ACE_TP_Reactor::handle_notify_events(int&, v6::ACE_TP_Token_Guard&) () from /usr/openv/lib/libvxACE.so.6
#10 0x00007f731f9138ea in v6::ACE_TP_Reactor::dispatch_i(v6::ACE_Time_Value*, v6::ACE_TP_Token_Guard&) () from /usr/openv/lib/libvxACE.so.6
#11 0x00007f731f913a1f in v6::ACE_TP_Reactor::handle_events(v6::ACE_Time_Value*) () from /usr/openv/lib/libvxACE.so.6
#12 0x00007f732017ac4b in v6::TAO_ORB_Core::run(v6::ACE_Time_Value*, int) () from /usr/openv/lib/libvxTAO.so.6
#13 0x00007f7325361cea in Orb::RunFunc(void*) () from /usr/openv/lib/libnborb.so
#14 0x00007f7324a9ab2c in Symantec::NetBackup::Ncf::ACEThreadHook::start(void* (*)(void*), void*) () from /usr/openv/lib/libncf.so
#15 0x00007f73253b0d46 in ThreadHook::start(void* (*)(void*), void*) () from /usr/openv/lib/libnborb.so
#16 0x00007f731f8f7b67 in v6::ACE_Thread_Adapter::invoke() () from /usr/openv/lib/libvxACE.so.6
#17 0x00007f731c09bdd5 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f731bdc5b3d in clone () from /lib64/libc.so.6
Cause
The actual cause has been identified, where a deadlock condition has been reached when a thread entered a waiting state held by another waiting thread, if that waiting thread is unable to adjust its own status indefinitely then is considered to have reached a deadlock.
This scenario is a very rare instance, which can sometimes result with the nbpem process to unexpectedly terminate.
Solution
The formal resolution for this reported issue is available via a binary fix to address the code defect NetBackup Policy Execution Manager (NBPEM).
NetBackup_8.1_3958927
NetBackup_8.1.1_3940466
NetBackup_8.1.2_3959197
This issue is resolved in the next scheduled release of NetBackup (8.2).