Resolving vxconfigd Core Dump Due to Duplicate Disk Media Records

Article: 100074241
Last Published: 2025-04-25
Ratings: 0 0
Product(s): InfoScale & Storage Foundation

Description

In a shared storage CVM cluster with inconsistent enclosure names enabled, the vxconfigd process crashes with a segmentation fault (SIGSEGV) during storage array live upgrades or transient storage path failure, resulting in unwanted disk group offline events. This issue occurs due to duplicate Disk Access (DA) records pointing to the same device, leading to invalid memory access and core dumps.

 

Issue Overview

During a live storage upgrade with Veritas Volume Manager (VxVM) online, disks are temporarily removed and re-added almost simultaneously across cluster nodes. This sequence of events can lead to inconsistencies in Disk Access (DA) and Disk Media (DM) record associations, causing vxconfigd to crash.

Key Observations

  1. Disk Removal and Re-addition:
    • Disks are removed during the upgrade, and VxVM sets no disk access (DA) records for associated Disk Media (DM) records.
    • After disks are re-added, VxVM begins re-onlining the impacted disks.
  2. Time Lag Between Nodes:
    • A small time lag exists between the two nodes detecting the re-added disks.
    • The slave node detects the disk first and initiates the volume asymmetry process:
      • Adds a remote DA record on the master node (which hasn’t detected the disk yet).
      • Associates the remote DA with the DM record.
      • Starts volume recovery.
  3. Simultaneous Detection on Master Node:
    • The master node detects the same disk locally and creates a local DA record.
    • Volume asymmetry process occasionally selects the local DA record for DM association (expected behavior).
  4. Conflict Between Local and Remote DA Records:
    • The master node onlines the local DA record and notices a remote DA record with the same UDID.
    • Replaces the remote DA record with the local DA record.
    • Deletes the local DA record after completing the online process.
    • Fails to recognize that the local DA record was already associated with a DM record:
      • Local DA record is deleted.
      • Remote DA record (converted) is left unassociated with the DM record.
  5. Resulting State:
    • DM record points to an invalid DA record.
    • This inconsistency causes vxconfigd to core dump.

 

Impact

  • vxconfigd crash disrupts volume management and may impact application availability.
  • Duplicate DA records pointing to the same device are observed in the environment.
 

 

Solution

Workaround

To resolve the issue and prevent further vxconfigd crashes, follow these steps:

Step 1: Identify Duplicate Records

  1. Run the following command to look for duplicate entries of Disk Access (DA) records:

                        vxdisk -o alldgs list | awk '{print $1}' | grep -v DEVICE | uniq -c

 

Step 2: Clean Up Duplicate Records

Perform the following steps on the affected node(s):

  1. Take the Node Out of the Cluster:
    • Stop the cluster services on the affected node.
    • Ensure the node is isolated from the cluster.
  2. Rename Configuration Files:
    • Rename the /etc/vx/salrecs file:

                            mv /etc/vx/salrecs /etc/vx/salrecs.bkp

    • Rename the /etc/vx/disk.info file:

                            mv /etc/vx/disk.info /etc/vx/disk.info.bkp

  1. Reset vxconfigd:
    • Run the following command to reset vxconfigd:

                            vxconfigd -kr reset

  1. Verify Duplicate Records Are Removed:
    • Check the output of the following command to ensure no duplicate records exist:

                            vxdisk list

  1. Rejoin the Node to the Cluster:
    • Start the cluster services on the node.
    • Rejoin the node to the cluster.
  2. Final Verification:
    • Run the following command again to confirm no duplicate records exist:

                            vxdisk list

 

 

Preventive Measures

To avoid encountering this issue in the future:

  1. Stop VxVM Before Storage Array Upgrades:
    • Always stop VxVM services before performing live storage upgrades.
  2. Unify Enclosure Names Across Nodes:
    • Ensure consistent enclosure names are used across all nodes in the cluster.
  3. Clean Up Stale Configuration Files:
    • Remove stale entries from /etc/vx/salrecs and /etc/vx/disk.info files before performing storage upgrades.
  4. Ensure Stable Storage Connectivity:
    • Verify that storage paths are stable and consistent across all nodes.
 

 

Technical Details

Root Cause Analysis (RCA)

  • The issue occurs due to inconsistent handling of Disk Access (DA) and Disk Media (DM) records during live storage upgrades.
  • Duplicate DA records are created when the slave node detects the disk first and initiates volume asymmetry, while the master node detects the same disk locally shortly afterward.
  • The master node replaces the dangling remote DA record with the local DA record associated with a DM record and deletes the local DA record, leaving the DM record pointing to an invalid DA record.

Logs and Evidence

  • Logs show simultaneous detection of the disk by both nodes, leading to duplicate DA records.
  • Core dump analysis indicates invalid memory access due to the dangling DA pointer.
 

 

Conclusion

By following the workaround steps and implementing preventive measures, the vxconfigd core dump issue can be resolved and avoided in the future. It is recommended to use consistent enclosure names for shared storage and stop CVM services before performing live storage upgrades to prevent similar issues.

 

 

 

References

JIRA : STESC-9264

Was this content helpful?