Long running disk image cleanup job fails with error code 23.

Article: 100040524
Last Published: 2017-10-29
Ratings: 0 0
Product(s): NetBackup & Alta Data Protection

Problem

For image cleanup jobs, a CORBA connection is established between Media Server (bpdm) and Master Server (nbjm). Status is sent across this connection after each image is deleted.  If the cleanup of a disk image takes a long time, there is no activity on the connection for that amount of time (say more than an hour). This idle connection may get closed by processes external to NetBackup after a certain amount of time. When such a connection is attempted to be reused, the job failure is reported with error 23 (socket read failed)

Error Message

Status code 23, socket read failed

Some errors in bpdbm logs:
jmcomm_processException: retrying call after CORBA::COMM_FAILURE

Some errors in libnbpxyhelper logs can be observed as below:
Error Desc:       NBPXY_SE_UNKNOWN
Status Msg:       A SSL socket read failed. Status: 5 Msg: , nbu status = 23, 

Cause

There may be TCP problems with the long idle socket or a Firewall between [or on] master server and media server might drop idle connections.
The timeout may vary depending upon firewall and TCP Keepalive settings.  

Solution

To prevent the connection from being dropped external to NetBackup, configure the TCP Keepalive settings appropriately that should keep the socket from being idle. This should be done on the master server, and optionally the media server.  Refer the article on TCP Keepalive Best Practices - detecting network drops and preventing idle socket timeout.

References

Etrack : 3927293

Was this content helpful?