Problem
For network environments where there is packet loss, or heavy packet reordering, ‘secondary logging,' or logging of data on the secondary’s SRL (Storage Replicator Log) may result in a hang or a panic at the DR (Disaster Recovery) site in VVR/CVR (Volume Replicator) configurations.
1. ‘vx’ commands on DR Site may hang with one of the following stack traces:
vxio:vol_commit_iowait_objects()
vxio:vol_commit_iolock_objects()
vxio:vol_ktrans_commit()
vxio:volconfig_ioctl() - frame recycled
B.
genunix:cv_wait ()
genunix:delay_common ()
genunix:delay ()
vxio:vol_rv_transaction_prepare()
vxio:vol_commit_iolock_objects()
vxio:vol_ktrans_commit()
vxio:volconfig_ioctl() - frame recycled
2. The DR Site may panic with one of the following stack traces:
A.unix:panic ()
unix:mutex_enter() - frame recycled
vxio:vol_rv_sec_write_childdone()
vxio:vol_rv_transaction_prepare()
vxio:vol_commit_iolock_objects()
vxio:vol_ktrans_commit()
vxio:volconfig_ioctl() - frame recycled
vxio:volsioctl_real()
B.unix:panic()
unix:mutex_enter() - frame recycled
vxio:vol_rv_check_logend_queue()
vxio:vol_rv_sec_write_childdone()
vxio:volsiodone() - frame recycled
vxio:vol_subdisksio_done()
vxio:volkcontext_process()
C.unix:vpanic()
vxio:vol_mv_write_childdone()
vxio:volsiodone()
vxio:volsiodone_fun()
vxio:volkcontext_process()
D.unix:trap()
<trap>vxio:vol_rv_inactivate_wsio()
vxio:vol_rv_restart_wsio()
vxio:vol_rv_serialise_sec_logging()
vxio:vol_rv_serialize()
vxio:vol_rv_errorhandler_start()
vxio:voliod_iohandle()
vxio:voliod_loop()
Both of the following conditions must occur for the issue to be triggered:
The DR site's SRL logs out-of-ordered writes, which can only happen when there is heavy packet re-ordering, or packet loss in the network
Active VxVM transaction at this time
As the issue is timing-related and dependent on the network, it may not necessarily occur each and every time, even if the above conditions are present.
Cause
Reason for the Hang of ‘vx’ commands
- Vxconfigd hangs, waiting for the VxVM transaction to complete
- A deadlock occurs between the VxVM transaction and the out-of-order writes, resulting in the hang
Reason for the Panic
When out-of-order writes occur on the DR site's SRL, these are unwound at the time of a VxVM transaction. Unwinding of these IO transactions may race with their natural completion, and may cause a system panic if both of them occur simultaneously.
Solution
Disable secondary logging via tunables, which is persistent across reboots.
Here are the implementation procedures for the following VVR versions:
6.0.X
6.1 to 7.0
Procedure to Disable Secondary Logging for Versions 6.0.x
To disable secondary logging, the tunables that are associated with secondary logging should be reset to ‘0’. Replication should be paused and resumed for the tunables to take effect.
Note: The Secondary’s data should be fully synchronized once (such as with autosync) after disabling secondary logging.
Detailed steps:
1. Pause replication on both the primary and secondary sites of the VVR pair.
vxrlink -g [dgname] pause [rlink_name]
Note: Repeat the above operation at both the primary and secondary sites for all the RVGs/Rlinks that are configured in the system.
2. Run the following commands on all of the servers, at both the primary and secondary sites, of the VVR/CVR pair, to disable the tunables.
# vxtune vol_rv_sec_logging_enabled 0
Note: If any of the VVR pairs are in CVM (Cluster Volume Manager)/CVR, these tunables should be turned off on all the of hosts within the cluster, including the master and all slaves of the primary and secondary sites.
To check the current value of these tunables, following commands are useful:
A. Variable values before modification:
# vxtune vol_rv_do_secondary_loggingTunable Current Value Default Value Reboot
--------------------------------- --------------- ------------- ------
vol_rv_do_secondary_logging 1 1 N
# vxtune vol_rv_sec_logging_enabledTunable Current Value Default Value Reboot
--------------------------------- --------------- ------------- ------
vol_rv_sec_logging_enabled 1 1 N
B. Variable values after modification:
# vxtune vol_rv_do_secondary_loggingTunable Current Value Default Value Reboot
--------------------------------- --------------- ------------- ------
vol_rv_do_secondary_logging 0 1 N
# vxtune vol_rv_sec_logging_enabledTunable Current Value Default Value Reboot
--------------------------------- --------------- ------------- ------
vol_rv_sec_logging_enabled 0 1 N
3. Resume replication from both the primary and secondary sites of the VVR pair:
Note: Repeat the above operation at both sites (both primary & DR) for all of RVGs/Rlinks configured in the system.
4. Start and perform an autosync on all of the RVGs (Replicated Volume Groups) existing in the system where these tunables were turned off. Notice that these are system-wide tunables. Therefore, autosync needs to be performed for all of the RVGs present on the setup.
Procedure to Disable Secondary Logging for Versions 6.1 to 7.0
For versions 6.1 to 7.0, ‘bulk transfer’ must be disabled, along with secondary logging. To disable secondary logging and bulk transfer, tunables associated with them should be reset to ‘0’. Replication should be paused and resumed for the tunables to take effect. The secondary’s data should be fully synchronized once (i.e autosync) after disabling secondary logging/bulk transfer.
Detailed steps:
1. Pause replication on both Primary & Secondary Sites of the VVR pair:
vxrlink -g [dgname] pause [rlink_name]
Note: Repeat the above operation at both sites (Primary & DR) for all the RVGs/Rlinks configured in the system.
2. Execute following commands on all the servers at both Primary & Secondary Sites of the VVR/CVR pair (to disable the tunables):
# vxtune vol_rv_sec_logging_enabled 0
# vxtune vol_rv_bulk_transfer 0
Note: If any of the VVR pairs are in CVM/CVR/FSS, then these tunables should be turned off on all the hosts of the cluster (i.e Master & All Slaves of Primary and DR Sites).
To check the current value of these tunables, the following commands are useful:
A. Variable values before modification:
Tunable Current Value Default Value Reboot Clusterwide
------------------------------- ------------- ------------- ------ -----------
vol_rv_do_secondary_logging 1 1 N N
# vxtune vol_rv_sec_logging_enabled
Tunable Current Value Default Value Reboot Clusterwide
------------------------------- ------------- ------------- ------ -----------
vol_rv_sec_logging_enabled 1 1 N N
# vxtune vol_rv_bulk_transfer
Tunable Current Value Default Value Reboot Clusterwide
------------------------------- ------------- ------------- ------ -----------
vol_rv_bulk_transfer 1 1 N N
B. Variable values after modification:
Tunable Current Value Default Value Reboot Clusterwide
------------------------------- ------------- ------------- ------ -----------
vol_rv_do_secondary_logging 0 1 N N
# vxtune vol_rv_sec_logging_enabled
Tunable Current Value Default Value Reboot Clusterwide
------------------------------- ------------- ------------- ------ -----------
vol_rv_sec_logging_enabled 0 1 N N
# vxtune vol_rv_bulk_transfer
Tunable Current Value Default Value Reboot Clusterwide
------------------------------- ------------- ------------- ------ -----------
vol_rv_bulk_transfer 0 1 N N
3. Resume replication from both the Primary & Secondary Sites of the VVR pair:
Note: Repeat the above operation at both sites (Primary & DR) for all the RVGs/Rlinks configured in the system.
4. Perform an autosync on all the RVGs existing in the system where these tunables were turned off. Note that these are system-wide tunables, and therefore, autosync needs to be performed for all the RVGs present on the setup.
The feature of Secondary Logging is being redesigned and will be available in a future release.
The fix for Etrack 3729078 must be applied.
Please note that the fix for the following etrack incident must be applied.
Etrack 3729078 - Object flags for RVG should not use the same flag value(s) reserved for Base Object
The above incident causes VVR to misinterpret another kernel flag as secondary logging flag. This causes various issues, e.g. panic, VVR-related hangs, rlink continuing to disconnect and connect. For example, rlink can get stuck in the following message loop.
VxVM VVR vxio V-5-0-265 Rlink rlk_name connected to remote
VxVM VVR vxio V-5-0-267 Rlink rlk_name disconnecting due to ack timeout on start_update message
VxVM VVR vxio V-5-0-266 rlink rlk_name disconnected from remote
VxVM VVR vxio V-5-0-330 Unable to connect to rlink rlk_name on rvg rvg_name: Rlink already connected to remote
The following fixes can be downloaded from the Services and Operations Readiness Tools (SORT) website to fix the issue.
https://sort.veritas.com/patch/patch_lookup
SF 6.2.1
vm-sles11_x86_64-Patch-6.2.1.300
vm-sol11_sparc-Patch-6.2.1.400
vm-sol10_sparc-Patch-6.2.1.400
vm-aix-Patch-6.2.1.400
sfha-rhel7.3_x86_64-Patch-6.2.1.400
sfha-rhel6_x86_64-Patch-6.2.1.300
sfha-sles12sp2_x86_64-Patch-6.2.1.300
SF 6.1.1
sfha-rhel5_x86_64-Patch-6.1.1.100
sfha-rhel6.8_x86_64-Patch-6.1.1.400
sfha-rhel6.9_x86_64-Patch-6.1.1.600
sfha-rhel6_x86_64-Patch-6.1.1.500
sfha-sles11sp4_x86_64-Patch-6.1.1.100
sfha-sles11_x86_64-Patch-6.1.1.200
sfha-sol10_sparc-Patch-6.1.1.100
sfha-sol11_sparc-Patch-6.1.1.100
sfha-aix-Patch-6.1.1.200
vm-rhel5_x86_64-Patch-6.1.1.400
vm-rhel6_x86_64-Patch-6.1.1.400
vm-sles11_x86_64-Patch-6.1.1.200
vm-sles11_x86_64-Patch-6.1.1.400
vm-sol10_sparc-Patch-6.1.1.400
vm-sol11_sparc-Patch-6.1.1.400
vm-aix-Patch-6.1.1.400
SF 6.0.5
sfha-rhel6.8_x86_64-Patch-6.0.5.400
sfha-sles11sp4_x86_64-Patch-6.0.5.100
sfha-sol11.3_sparc-Patch-6.0.5.200
sfha-sol11.3_x64-Patch-6.0.5.200
vm-rhel5_x86_64-Patch-6.0.5.300
vm-rhel6_x86_64-Patch-6.0.5.300
vm-sles10_x86_64-Patch-6.0.5.300
vm-sles11_x86_64-Patch-6.0.5.300
vm-sol10_sparc-Patch-6.0.5.300
vm-sol10_x64-Patch-6.0.5.300
vm-sol11_sparc-Patch-6.0.5.300
vm-sol11_x64-Patch-6.0.5.300
vm-aix-Patch-6.0.5.300
vm-hpux1131-Patch-6.0.5.300
Additional Notes
What is Secondary Logging and what is the Impact of disabling it?
Secondary logging is an advanced feature to improve replication performance and throughput. It utilizes the DR/Secondary’s SRL to stage data before writing on to the data volumes. When Secondary Logging is disabled, replication performance (i.e the rate of data transfer from Primary to DR) is still enhanced in 6.x but improvement in CVR environments would cease i.e the performance boost obtained for network transfer due to logging of the data on Secondary’s SRL cannot be leveraged. However, the replication performance would be as good as pre 6.X versions.
What is Bulk Transfer and Impact of disabling it?
With DG Versions 190 and above, Bulk Transfer is automatically enabled to effectively use network bandwidth for replication; data is replicated to a disaster recovery (DR) site in bulk at 256 KB. However, bulk transfer requires Secondary Logging to be enabled. Therefore, disabling Secondary Logging/Bulk Transfer will disable replication of data in bulk. However, the replication performance would be as good as pre 6.X versions.
Note) Since 7.1, Secondary logging was redesigned, and enabling it became mandatory.
# vxtune vol_rv_do_secondary_logging 1
# vxtune vol_rv_sec_logging_enabled 1
# vxtune vol_rv_bulk_transfer 1