SQL Server parent backup job fails with status code 636 even though the child backup jobs complete successfully
Problem
SQL Server parent backup job fails with status code 636 even though the child backup jobs complete successfully.
The same SQL Server parent job completes successfully only if the destination storage unit resides on the master server itself.
There are no issues with file system backups irrespective of its destination storage unit.
Error Message
From Job Details
================
BPHDB (the parent job that runs the batch file on the client) completed with status code 0 but the job ended with status code 636 :
7/5/2013 2:33:40 PM - Info bphdb(pid=7696) done. status: 0: the requested operation was successfully completed
read from input socket failed(636)
From BPBRM log
==============
The BPBRM process received status code 0 from the client but then encountered an error when closing the socket to NBJM :
14:33:40.566 [7472.10224] <2> bpbrm Exit: client backup EXIT STATUS 0: the requested operation was successfully completed
14:34:17.179 [7472.10224] <8> vnet_close_socket_safely: [vnet.c:2017] error on read EOF 0 0x0
14:34:17.179 [7472.10224] <2> vnet_close_socket_safely: [vnet.c:2029] safe close 9 0x9
14:34:17.179 [7472.10224] <2> bpbrm Exit: Error occured during closure of socket to nbjm, vnet status 9
NBJM itself did not receive any indication of the exit status from BPBRM and then 5 minutes after the Backup has completed it failed the job with status code 636 :
7/5/2013 14:39:45.679 [Diagnostic] NB 51216 nbjm 117 PID:5984 File ID:117 [jobid=42856 parentid=42856] 1 V-117-239 [BackupJob::terminateThisJob] terminated job, jobid=42856, status=636
From NBJM log
=============
Cause
The TCP KeepAliveTime value on master server which was already reduced to 900,000 ms was still too high for this environment.
Solution
After reducing the TCP KeepAliveTime setting on Master server to 300,000 ms (5 mins), followed by a reboot of the master server, SQL Server parent backup jobs were now able to complete successfully when backing up to the affected media servers.
Applies To
-- Master server "NBMASTER1" running Windows 2008 R2 and NBU 7.5.0.4.
-- Master server also functions as a Media Server.
-- SQL backups to master server's storage unit are successful.
-- SQL Servers running Windows 2008 R2 and NBU 7.5.0.4, and are part of a 2 node Microsoft cluster "SQLNODE1" and "SQLNODE2".
-- SQL Servers also function as SAN Media servers.
-- SQL backups (parent jobs) to BOTH media servers' storage units are failing with status code 636.
-- TCP KeepAliveTime setting on both Master and Media servers already set to 900,000 ms (15 mins), previously reduced from the default of 7,200,000 ms (2 hours).
-- Windows Firewall disabled on master server "NBMASTER1" and on the affected media servers "SQLNODE1" and "SQLNODE2".
-- BPBRM on the affected media servers and NBJM on the master server communicate over secondary / backup NICs using HP FlexFabric 10Gb 2-port 554FLB Adapters.
-- These HP 10Gb NICs are connected via a 10Gb SAN Switch, thus there is no firewall present on this 10Gb link.