failed waiting for child process(34) and media manager - system error occurred (174) experienced when attempting a Replication SLP operation using NetBackup Replication Director feature
Problem
failed waiting for child process(34) and media manager - system error occurred (174) experienced when attempting a Replication SLP operation using NetBackup Replication Director feature.
The above error is seen when using the SnapVault or SnapMirror replication methods.
Error Message
04/03/2012 23:44:37 - Error (pid=5036) ReplicationJob::WaitForReplicationCommandStatus: Replication failed for backup id netapp1_1330904461: media manager - system error occurred (174)
04/03/2012 23:44:37Replicate failed for backup id netapp1_1330904461 with status 174
04/03/2012 23:44:37 - end operation
failed waiting for child process(34)
Cause
Common causes for Status 174 and 34 are:
- Incorrect source/destination storage controller relationships/access permissions
- Name resolution failures, network issues
- DFM member(s) removed from a dataset(a member is the volume/path being protected in a dataset)
- Broken corrupted member relationships within DFM
- Hosts (storage controllers) being added to the "ignore list" within DFM
- Non-conformant datasets (suffering from the above symptoms within DFM)
Solution
Consider the following:
The first operation in the SLP is a snapshot to the primary storage unit. The second operation is a replication to a target storage unit. This operation fails with the error:
04/03/2012 23:44:37 - Error (pid=5036) ReplicationJob::WaitForReplicationCommandStatus: Replication failed for backup id netapp1_1330904461: media manager - system error occurred (174)
04/03/2012 23:44:37Replicate failed for backup id netapp1_1330904461 with status 174
04/03/2012 23:44:37 - end operation
failed waiting for child process(34)
Storage controller relationships/access permissions/name resolution/network issues:
Check source and destination filer console for messages similar to the following (in this case the following message was displayed on the destination filer's console):
Sun Mar 4 21:02:34 GMT [netapp2:replication.dst.err:error]: SnapVault: destination transfer from netapp1:/vol/vol2/- to /vol/vol2_1/NetBackup_1330893853_netapp1_vol2 : cannot connect to source filer.
The snapmirror log:
dst Sun Mar 4 21:02:29 GMT netapp1:/vol/vol2/- netapp2:/vol/vol2_1/NetBackup_1330893853_netapp1_vol2 Request (Initialize)
dst Sun Mar 4 21:02:34 GMT netapp1:/vol/vol2/- netapp2:/vol/vol2_1/NetBackup_1330893853_netapp1_vol2 Abort (cannot connect to source filer)
dst Sun Mar 4 21:03:03 GMT netapp1:/vol/vol2/- netapp2:/vol/vol2_1/NetBackup_1330893853_netapp1_vol2 Request (Retry)
dst Sun Mar 4 21:03:04 GMT netapp1.:/vol/vol2/- netapp2:/vol/vol2_1/NetBackup_1330893853_netapp1_vol2 Abort (cannot connect to source filer)
cmd Sun Mar 4 21:03:09 GMT - netapp2:/vol/vol2_1/NetBackup_1330893853_netapp1_vol2 Stop_command
Other network related errors can show:
dst Tue Jun 25 13:01:32 BST netapp2.domain.com:xxx_xxx netapp1:NetBackup_1364xxx_mirror_netapp2_xxx_xxx Request (Update)
dst Tue Jun 25 13:01:40 BST netapp2.domain.com:xxx_xxx netapp1:NetBackup_1364xxx_mirror_netapp2_xxx_xxx Abort (transfer aborted because of network error)
Ensure connectivity and correct name resolution between source and destination filers via /etc/hosts or dns.
Check:
- Short/long host name mixture as well as upper/lower case host names
- Lack of connectivity on designated ip addresses via a basic ping test, for example:
- netapp1> ping 10.x.x.16
no answer from 10.x.x.16
netapp1> ping netapp2
no answer from netapp2
- netapp1> ping 10.x.x.16
- Check DNS domain name is correct - "options dns"
- If using "legacy" for snapmirror.access, ensure the names in snapmirror.allow are resolved correctly (forward and reverse lookups). It's preferred to use "snapmirror.access host=" and "snapvault.access host="
- The snapmirror.checkip.enable on/off setting determines how the ip address and hostnames in snapmirror.allow are verified (refer to NetApp documentation for more information)
DFM configuration:
When a NetBackup Replication Director policy is run for the first time (this example uses an NDMP policy), a "Create relationship" job is created in DFM which creates and defines the dataset, and the relationship between the primary member and its destination. Subsequent runs of the policy (if the policy's backup selection list is not added to) will generate "On-demand protection" jobs in DFM rather than the initial "Create relationship" job:
Ensure protected members are not missing from the dataset:
210 Backup 202 NetBackup_filer1_vol1_q_958nbmast volume filer2:/vol1_2
If a member, such as ID 152 above (Primary data) was to be removed from the dataset or the environment, when NetBackup runs the Replication Director policy and consquently the DFM "On-demand protection" job, the operation may fail with Status 174 or 34.
Non-conformant datasets:
Inconsistencies in the dataset, may result in a "Nonconformant" error being displayed in the "Status" box of the "Datasets" window in the NetApp Management Console, as indicated by the yellow arrow below:
To view possible reasons why the dataset is in a nonconformant state or to look for inconsistencies that are not already apparent, you can run a "dfpm dataset conform -D (dataset ID)" (ensure you use the -D option as this is a non-destructive "dry-run" without making any changes). It is wise to engage NetApp for further assistance with a nonconformant dataset.
Hosts set to ignore:
Ensure hosts involved in Replication Director operations are not set to ignore:
If it is the case where a host is ignored and should not be, right click on the host and select "Undo Ignore".
Applies To
DFM 5.0 - OnCommand
DataONTAP 8.1
NetBackup 7.5