How to replace a failed disk that is under Volume Manager control

Article: 100019230
Last Published: 2019-04-03
Ratings: 6 0
Product(s): InfoScale & Storage Foundation

Problem

This article discusses how to replace a failed disk that is under Volume Manager control.


Solution

There are several scenarios for replacing a failed disk.
  • Disk failed, but came back online after some time.
  • Disk failed, corrupt, needs to be replaced, and spare Disk is already available in the configuration.
  • Disk failed, corrupt, needs to be replaced, and new Disk needs to be added to the configuration.

Note: To replace a disk, a disk must be available that is not already in a disk group.


Disk failed, but came back online after some time

Let's assume we have the following configuration.
Two Disks (c1t13d0, c1t14d0) in a Disk group (testdg), one mirrored Volume (testvol01):
c1t13d0s2    auto:cdsdisk    disk01       testdg       online
c1t14d0s2    auto:cdsdisk    disk02       testdg       online
c1t15d0s2    auto:none       -            -            online invalid


dg testdg       default      default  24000    1223026544.30.jerome

dm disk01       c1t13d0s2    auto     65536    35774960 -
dm disk02       c1t14d0s2    auto     65536    35774960 -

v  testvol01    -            ENABLED  ACTIVE   2097152  SELECT    -        fsgen
pl testvol01-01 testvol01    ENABLED  ACTIVE   2097152  CONCAT    -        RW
sd disk01-01    testvol01-01 disk01   0        2097152  0         c1t13d0  ENA
pl testvol01-02 testvol01    ENABLED  ACTIVE   2097152  CONCAT    -        RW
sd disk02-02    testvol01-02 disk02   2097152  2097152  0         c1t14d0  ENA


Now, c1t13d0 went offline caused by a SAN issue.
 
c1t13d0s2    auto            -            -            error
c1t14d0s2    auto:cdsdisk    disk02       testdg       online
c1t15d0s2    auto:none       -            -            online invalid
-            -         disk01       testdg       failed was:c1t13d0s2


dg testdg       default      default  24000    1223026544.30.jerome

dm disk01       -            -        -        -        NODEVICE
dm disk02       c1t14d0s2    auto     65536    35774960 -

v  testvol01    -            ENABLED  ACTIVE   2097152  SELECT    -        fsgen
pl testvol01-01 testvol01    DISABLED NODEVICE 2097152  CONCAT    -        RW
sd disk01-01    testvol01-01 disk01   0        2097152  0         -        RLOC
pl testvol01-02 testvol01    ENABLED  ACTIVE   2097152  CONCAT    -        RW
sd disk02-02    testvol01-02 disk02   2097152  2097152  0         c1t14d0  ENA


You can see that c1t13d0 is now showing as failed, and the plex is disabled. After some time, the SAN is back online, and the disk is available again on the System.
 
c1t13d0s2    auto:cdsdisk    -            (testdg)     online
c1t14d0s2    auto:cdsdisk    disk02       testdg       online
c1t15d0s2    auto:none       -            -            online invalid
-            -         disk01       testdg       failed was:c1t13d0s2

You can see it is now back online, but still showing as failed. All we need to do is run vxreattach ​to check if we can reattach the disk.
 
# vxreattach -c c1t13d0s2


This will reattach the disk to its old disk media name inside its old disk group, and run a vxrecover in the background if needed.

# vxreattach -rb c1t13d0s2


Use vxtask to check the status
 
# vxtask list
TASKID  PTID TYPE/STATE    PCT      PROGRESS
  206           PARENT/R  0.00%    1/0(1) VXRECOVER disk01 testdg
  207   207     ATCOPY/R 04.10%    0/2097152/86016 PLXATT testvol01 testvol01-01 testdg


We should be back to normal after the vxrecover is finished.
 
c1t13d0s2    auto:cdsdisk    disk01       testdg       online
c1t14d0s2    auto:cdsdisk    disk02       testdg       online
c1t15d0s2    auto:none       -            -            online invalid



Disk failed, corrupt, needs to be replaced, and new disk needs to be added to the configuration.

We are taking the same configuration as before, but this time the disk is corrupt and needs to be replaced.

c1t13d0s2    auto            -            -            error
c1t14d0s2    auto:cdsdisk    disk02       testdg       online
c1t15d0s2    auto:none       -            -            online invalid
-            -         disk01       testdg       failed was:c1t13d0s2

As we already have a disk available (c1t15d0), we can use that for replacement.

There are two ways to achieve this.
  • Using vxdiskadm, which will help you get the disk replaced by running all the needed commands in the background.
  • Or you can run the commands by yourself. We will show you both.

First, the vxdiskadm way:

Run vxdiskadm and select option 5.

# vxdiskadm

Volume Manager Support Operations
Menu: VolumeManager/Disk

1      Add or initialize one or more disks
2      Encapsulate one or more disks
3      Remove a disk
4      Remove a disk for replacement
5      Replace a failed or removed disk
6      Mirror volumes on a disk
7      Move volumes from a disk
8      Enable access to (import) a disk group
9      Remove access to (deport) a disk group
10     Enable (online) a disk device
11     Disable (offline) a disk device
12     Mark a disk as a spare for a disk group
13     Turn off the spare flag on a disk
14     Unrelocate subdisks back to a disk
15     Exclude a disk from hot-relocation use
16     Make a disk available for hot-relocation use
17     Prevent multipathing/Suppress devices from VxVM's view
18     Allow multipathing/Unsuppress devices from VxVM's view
19     List currently suppressed/non-multipathed devices
20     Change the disk naming scheme
21     Get the newly connected/zoned disks in VxVM view
22     Change/Display the default disk layouts
23     Mark a disk as allocator-reserved for a disk group
24     Turn off the allocator-reserved flag on a disk
list   List disk information


?      Display help about menu
??     Display help about the menuing system
q      Exit from menus

Select an operation to perform:  5


On the next page we select the disk to be replaced, and the disk which we are going to use for the replacement
 
Replace a failed or removed disk
Menu: VolumeManager/Disk/ReplaceDisk
 Use this menu operation to specify a replacement disk for a disk
 that you removed with the "Remove a disk for replacement" menu
 operation, or that failed during use.  You will be prompted for
 a disk name to replace and a disk device to use as a replacement.
 You can choose an uninitialized disk, in which case the disk will
 be initialized, or you can choose a disk that you have already
 initialized using the Add or initialize a disk menu operation.

Select a removed or failed disk [<disk>,list,q,?] list

Disk group: testdg

DM NAME         DEVICE       TYPE     PRIVLEN  PUBLEN   STATE

dm disk01       -            -        -        -        NODEVICE


Select a removed or failed disk [<disk>,list,q,?] disk01
 The following devices are available as possible replacements after being
 initialized (or reinitiliazed):

       c0t0d0 c1t15d0

 You can choose one of these devices to replace disk01.
 Choose "none" to abort the replacement of disk01.

Choose a device, or select "none"
[<device>,none,q,?] (default: c0t0d0) c1t15d0
 VxVM  INFO V-5-2-378
The requested operation is to initialize disk device c1t15d0 and
 to then use that device to replace the removed or failed disk
 disk01 in disk group testdg.

Continue with operation? [y,n,q,?] (default: y) y

Use FMR for plex resync? [y,n,q,?] (default: n) n
 VxVM  INFO V-5-2-282
Replacement of disk disk01 in group testdg with disk device
 c1t15d0 completed successfully.

Replace another disk? [y,n,q,?] (default: n) n


This initializes the disk. Add it to the disk group with the old disk media name, and run a vxrecover on the volume.
 
c1t13d0s2    auto            -            -            error
c1t14d0s2    auto:cdsdisk    disk02       testdg       online
c1t15d0s2    auto:cdsdisk    disk01       testdg       online


dg testdg       default      default  24000    1223026544.30.jerome

dm disk01       c1t15d0s2    auto     65536    35774960 -
dm disk02       c1t14d0s2    auto     65536    35774960 -

v  testvol01    -            ENABLED  ACTIVE   2097152  SELECT    -        fsgen
pl testvol01-01 testvol01    ENABLED  ACTIVE   2097152  CONCAT    -        RW
sd disk01-01    testvol01-01 disk01   0        2097152  0         c1t15d0  ENA
pl testvol01-02 testvol01    ENABLED  ACTIVE   2097152  CONCAT    -        RW
sd disk02-02    testvol01-02 disk02   2097152  2097152  0         c1t14d0  ENA


The only thing left here is to remove the disk to get it physically replaced. You can do this with option 3 in vxdiskadm. We will show you how to add a new disk in the next scenario.

Now, the same can be done manually on the command line.
If we go back to our problem:
 
c1t13d0s2    auto            -            -            error
c1t14d0s2    auto:cdsdisk    disk02       testdg       online
c1t15d0s2    auto:none       -            -            online invalid
-            -         disk01       testdg       failed was:c1t13d0s2


Here are the steps needed to replace the Disk on the command line.
We will initialize the Disk, add it to the Disk group with the same Media Name, and then run a recovery in the background
 
# vxdisksetup -i c1t15d0 format=cdsdisk
# vxdg -g testdg -k adddisk disk01=c1t15d0s2
# vxrecover -b testvol01


After the recovery is done

c1t13d0s2    auto            -            -            error
c1t14d0s2    auto:cdsdisk    disk02       testdg       online
c1t15d0s2    auto:cdsdisk    disk01       testdg       online


dg testdg       default      default  24000    1223026544.30.jerome

dm disk01       c1t15d0s2    auto     65536    35774960 -
dm disk02       c1t14d0s2    auto     65536    35774960 -

v  testvol01    -            ENABLED  ACTIVE   2097152  SELECT    -        fsgen
pl testvol01-01 testvol01    ENABLED  ACTIVE   2097152  CONCAT    -        RW
sd disk01-01    testvol01-01 disk01   0        2097152  0         c1t15d0  ENA
pl testvol01-02 testvol01    ENABLED  ACTIVE   2097152  CONCAT    -        RW
sd disk02-02    testvol01-02 disk02   2097152  2097152  0         c1t14d0  ENA


We need to remove the corrupt Disk and physically replace it. We will show you how to add a new Disk in the next scenario.

# vxdisk rm c1t13d0s2



Disk failed, corrupt, needs to be replaced, and new Disk needs to be added to the configuration.

In the last two scenarios we either replaced the failed Disk with the same Disk, or one which was already added to the configuration. Now, we will show you how to replace a failed Disk by physically removing the failed one, and get it replaced by a new Disk.

Here is our configuration:
c1t13d0s2    auto:cdsdisk    disk01       testdg       online
c1t14d0s2    auto:cdsdisk    disk02       testdg       online


dg testdg       default      default  24000    1223026544.30.jerome

dm disk01       c1t13d0s2    auto     65536    35774960 -
dm disk02       c1t14d0s2    auto     65536    35774960 -

v  testvol01    -            ENABLED  ACTIVE   2097152  SELECT    -        fsgen
pl testvol01-01 testvol01    ENABLED  ACTIVE   2097152  CONCAT    -        RW
sd disk01-01    testvol01-01 disk01   0        2097152  0         c1t13d0  ENA
pl testvol01-02 testvol01    ENABLED  ACTIVE   2097152  CONCAT    -        RW
sd disk02-02    testvol01-02 disk02   2097152  2097152  0         c1t14d0  ENA


Now, again, c1t13d0 fails.

c1t13d0s2    auto            -            -            error
c1t14d0s2    auto:cdsdisk    disk02       testdg       online
-            -         disk01       testdg       failed was:c1t13d0s2


We first need to remove the failed disk for replacement.
This is option 4 in vxdiskadm

Volume Manager Support Operations
Menu: VolumeManager/Disk

1      Add or initialize one or more disks
2      Encapsulate one or more disks
3      Remove a disk
4      Remove a disk for replacement
5      Replace a failed or removed disk
6      Mirror volumes on a disk
7      Move volumes from a disk
8      Enable access to (import) a disk group
9      Remove access to (deport) a disk group
10     Enable (online) a disk device
11     Disable (offline) a disk device
12     Mark a disk as a spare for a disk group
13     Turn off the spare flag on a disk
14     Unrelocate subdisks back to a disk
15     Exclude a disk from hot-relocation use
16     Make a disk available for hot-relocation use
17     Prevent multipathing/Suppress devices from VxVM's view
18     Allow multipathing/Unsuppress devices from VxVM's view
19     List currently suppressed/non-multipathed devices
20     Change the disk naming scheme
21     Get the newly connected/zoned disks in VxVM view
22     Change/Display the default disk layouts
23     Mark a disk as allocator-reserved for a disk group
24     Turn off the allocator-reserved flag on a disk
list   List disk information


?      Display help about menu
??     Display help about the menuing system
q      Exit from menus

Select an operation to perform: 4


Then we select the failed disk to be removed.
 
Remove a disk for replacement
Menu: VolumeManager/Disk/RemoveForReplace
 Use this menu operation to remove a physical disk from a disk
 group, while retaining the disk name.  This changes the state
 for the disk name to a "removed" disk.  If there are any
 initialized disks that are not part of a disk group, you will be
 given the option of using one of these disks as a replacement.

Enter disk name [<disk>,list,q,?] list

Disk group: testdg

DM NAME         DEVICE       TYPE     PRIVLEN  PUBLEN   STATE

dm disk01       -            -        -        -        NODEVICE
dm disk02       c1t14d0s2    auto     65536    35774960 -

Enter disk name [<disk>,list,q,?] disk01
 VxVM  NOTICE V-5-2-371
The following volumes will lose mirrors as a result of this
 operation:

       testvol01

 No data on these volumes will be lost.
 VxVM  NOTICE V-5-2-381
The requested operation is to remove disk disk01 from disk group
 testdg.  The disk name will be kept, along with any volumes using
 the disk, allowing replacement of the disk.

 Select "Replace a failed or removed disk" from the main menu
 when you wish to replace the disk.

Continue with operation? [y,n,q,?] (default: y) y
 VxVM  INFO V-5-2-265 Removal of disk disk01 completed successfully.

Remove another disk? [y,n,q,?] (default: n) n


After this we should see the following output:
 
c1t13d0s2    auto            -            -            error
c1t14d0s2    auto:cdsdisk    disk02       testdg       online
-            -         disk01       testdg       removed was:c1t13d0s2


Now you can remove the disk and replace it with another one. Once you have replaced it, we need to let Volume Manager know that there is a new disk.

Run the following:
 
# vxdiskconfig
 VxVM  INFO V-5-2-1401 This command may take a few minutes to complete execution
 Executing Solaris command: devfsadm (part 1 of 2) at 10:35:05 BST
 Executing VxVM command: vxdctl enable (part 2 of 2) at 10:35:18 BST
 Command completed at 10:35:21 BST


It will use devfsadm to check the OS for new devices, and after that it will run vxdctl enable to add them to Volume Manager.
Now we should see a new disk.

c1t13d0s2    auto            -            -            error
c1t14d0s2    auto:cdsdisk    disk02       testdg       online
c1t15d0s2    auto:none       -            -            online invalid
-            -         disk01       testdg       removed was:c1t13d0s2


Run vxdiskadm again, and select option 5 to replace the removed disk
 
Volume Manager Support Operations
Menu: VolumeManager/Disk

1      Add or initialize one or more disks
2      Encapsulate one or more disks
3      Remove a disk
4      Remove a disk for replacement
5      Replace a failed or removed disk
6      Mirror volumes on a disk
7      Move volumes from a disk
8      Enable access to (import) a disk group
9      Remove access to (deport) a disk group
10     Enable (online) a disk device
11     Disable (offline) a disk device
12     Mark a disk as a spare for a disk group
13     Turn off the spare flag on a disk
14     Unrelocate subdisks back to a disk
15     Exclude a disk from hot-relocation use
16     Make a disk available for hot-relocation use
17     Prevent multipathing/Suppress devices from VxVM's view
18     Allow multipathing/Unsuppress devices from VxVM's view
19     List currently suppressed/non-multipathed devices
20     Change the disk naming scheme
21     Get the newly connected/zoned disks in VxVM view
22     Change/Display the default disk layouts
23     Mark a disk as allocator-reserved for a disk group
24     Turn off the allocator-reserved flag on a disk
list   List disk information


?      Display help about menu
??     Display help about the menuing system
q      Exit from menus

Select an operation to perform: 5


Now select the removed disk, and the new one to be used as a replacement
 
Replace a failed or removed disk
Menu: VolumeManager/Disk/ReplaceDisk
 Use this menu operation to specify a replacement disk for a disk
 that you removed with the "Remove a disk for replacement" menu
 operation, or that failed during use.  You will be prompted for
 a disk name to replace and a disk device to use as a replacement.
 You can choose an uninitialized disk, in which case the disk will
 be initialized, or you can choose a disk that you have already
 initialized using the Add or initialize a disk menu operation.

Select a removed or failed disk [<disk>,list,q,?] list

Disk group: testdg

DM NAME         DEVICE       TYPE     PRIVLEN  PUBLEN   STATE

dm disk01       -            -        -        -        REMOVED


Select a removed or failed disk [<disk>,list,q,?] disk01
 The following devices are available as possible replacements after being
 initialized (or reinitiliazed):

       c0t0d0 c1t15d0

 You can choose one of these devices to replace disk01.
 Choose "none" to abort the replacement of disk01.

Choose a device, or select "none"
[<device>,none,q,?] (default: c0t0d0) c1t15d0
 VxVM  INFO V-5-2-378
The requested operation is to initialize disk device c1t15d0 and
 to then use that device to replace the removed or failed disk
 disk01 in disk group testdg.

Continue with operation? [y,n,q,?] (default: y) y

Use FMR for plex resync? [y,n,q,?] (default: n) n
 VxVM  INFO V-5-2-282
Replacement of disk disk01 in group testdg with disk device
 c1t15d0 completed successfully.

Replace another disk? [y,n,q,?] (default: n) n


It will automatically run a vxrecover in the background. Once this is done, we can remove the old disk entry.
 
# vxdisk rm c1t13d0s2

And then we should be back to normal

c1t14d0s2    auto:cdsdisk    disk02       testdg       online
c1t15d0s2    auto:cdsdisk    disk01       testdg       online


dg testdg       default      default  24000    1223026544.30.jerome

dm disk01       c1t15d0s2    auto     65536    35774960 -
dm disk02       c1t14d0s2    auto     65536    35774960 -

v  testvol01    -            ENABLED  ACTIVE   2097152  SELECT    -        fsgen
pl testvol01-01 testvol01    ENABLED  ACTIVE   2097152  CONCAT    -        RW
sd disk01-01    testvol01-01 disk01   0        2097152  0         c1t15d0  ENA
pl testvol01-02 testvol01    ENABLED  ACTIVE   2097152  CONCAT    -        RW
sd disk02-02    testvol01-02 disk02   2097152  2097152  0         c1t14d0  ENA


You can do the same from the command line. Instead of using option 4 in vxdiskadm, you use this command to remove the disk:

# vxdg -g testdg -k rmdisk disk01
 

Then run this to get the OS to scan for the new disk, add it to Volume Manger, and initialize it for use.
 
# vxdiskconfig
# vxdisksetup -i c1t15d0 format=cdsdisk


Then, run the following to add the new disk with the old Disk Media name, recover the mirror and remove the old disk entry from Volume Manager.
Once the recover is done, we should see the following:
 
# vxdg -g testdg -k adddisk disk01=c1t15d0s2
# vxrecover -b testvol01
# vxdisk rm c1t13d0s2


c1t14d0s2    auto:cdsdisk    disk02       testdg       online
c1t15d0s2    auto:cdsdisk    disk01       testdg       online


dg testdg       default      default  24000    1223026544.30.jerome

dm disk01       c1t15d0s2    auto     65536    35774960 -
dm disk02       c1t14d0s2    auto     65536    35774960 -

v  testvol01    -            ENABLED  ACTIVE   2097152  SELECT    -        fsgen
pl testvol01-01 testvol01    ENABLED  ACTIVE   2097152  CONCAT    -        RW
sd disk01-01    testvol01-01 disk01   0        2097152  0         c1t15d0  ENA
pl testvol01-02 testvol01    ENABLED  ACTIVE   2097152  CONCAT    -        RW
sd disk02-02    testvol01-02 disk02   2097152  2097152  0         c1t14d0  ENA
 

 

Was this content helpful?