My Experience with passthru state of Replicated Vo...

Zahid_Haseeb · ‎12-21-2012

Recently I had experienced that one of my client was facing the RVG state is in passthru state means "A SAN disk that was part of an SRL volume started logging I/O errors on the host and VVR went into pass thru mode"

""More Details of Problem""

As far as FAILING disks are concern, sometimes there are just transient errors that could come from anywhere and clearing the FAILING flag to see if it returns is the easiest way to detect. If there really is a problem with the disk, the FAILING flag will return. Regarding the SAN, if the problem is persistent, then it needs to be followed up with the SAN vendor. It doesn't matter how reliable you think your SAN is, if a host sees an I/O error, you have to investigate the root cause with the SAN vendor(also the channels of connectivity between SAN and host). This is a data integrity issue that should be taken seriously.

My Environment

Operating System rhel = 5.3

SFHA/DR version = 5.0 MP3 RP3

Cluster Nodes = Two at Primary Site and one at Secondary/DR Site (Its a Global Cluster)

Two Service Group = One is Application Service Group and Second is Replication Service Group

Disks = Two with 100GB each

Volumes = One is Data Volume(149GB) and second is SRL Volume(49GB). See the below snap as example

Problem

Primary Site RVG state is in passthru mode. See the below repstatus for reference:

# vradmin -g DISKGROUP repstatus RVG

Replicated Data Set: RVG

Primary:

Host name: 192.168.0.1

RVG name: RVG

DG name: DISKGROUP

RVG state: enabled for I/O (passthru)

Data volumes: 1

VSets: 0

SRL name: VOL-SRL

SRL size: XXX.00 G

Total secondaries: 1

Secondary:

Host name: 192.168.0.2

RVG name: RVG

DG name: DISKGROUP

Data status: consistent, stale

Replication status: stopped (primary detached)

Current mode: N/A

Logging to: N/A

Timestamp Information: N/A

Activity Plan

Freeze Replication Service Group
Check repstatus to verify which RVG state is passthru
Remove the FAILLING flag from disk
detach rlink if attach
Diassocate and verify SRL Volume from the RVG which is in passthru state
Associate and verify SRL Volume back to the same RVG
Stop, Start and verify Replication status.
un freeze the Service Group

1.) Freeze Replication Service Group (May go without it)

Open Java Console, Select Replication Service Group, right click on Replication Service Group, click on freeze and select temporary or persistent as per your need.

Save Configuration (if freeze was persistent)

2.) Check repstatus to verify which RVG state is passthru

Run the below command to verify which Replicated Volume Group/RVG is in pasthru state.

#vradmin –g DiskGroup repstatus RVGname

Example result of the above command

# vradmin -g DISKGROUP repstatus RVG

Replicated Data Set: RVG

Primary:

Host name: 192.168.0.1

RVG name: RVG

DG name: DISKGROUP

RVG state: enabled for I/O (passthru)

Data volumes: 1

VSets: 0

SRL name: VOL-SRL

SRL size: XXX.00 G

Total secondaries: 1

Secondary:

Host name: 192.168.0.2

RVG name: RVG

DG name: DISKGROUP

Data status: consistent, stale

Replication status: stopped (primary detached)

Current mode: N/A

Logging to: N/A

Timestamp Information: N/A

3.) Remove the failing flag from disk

Run the vxprint command to verify which disk is in failing state. See the below snap as a reference:

# vxprint

Disk group: DISKGROUP

TY NAME ASSOC KSTATE LENGTH PLOFFS STATE TUTIL0 PUTIL0

dg DISKGROUP DISKGROUP - - - - - -

dm DISKGROUP01 ibm_ds8x000_2015 - 209616640 - - - -

dm DISKGROUP02 ibm_ds8x000_2016 - 209616640 - - - -

dm DISKGROUP03 ibm_ds8x000_5555 - 209616640 - - - -

dm DISKGROUP04 ibm_ds8x000_6666 - 209616640 - FAILING - -

OR

You can also see the failing flag on Disk from the below command

#vxdisk list

Now from the below command you can clear the failing flag.

# vxedit -g DiskGroup set failing=off DIskName

4.) detach rlink if attach (if required)

From the command prompt run the below command if you need to diassociate the Rlink.

#vxrlink –g DiskGroup det RlinkName

5.) Diassocate and verify SRL Volume from the RVG which is in passthru state

From the command prompt run the below command to Diassociate the SRL Volume from the Replicated Volume Group/RVG

# vxvol -g DiskGroup dis SrlVolumeName

Example result before running the above command

# vxprint

Disk group: sourcesafe

TY NAME ASSOC KSTATE LENGTH PLOFFS STATE TUTIL0 PUTIL0

dg sourcesafe sourcesafe - - - - - -

dm sourcesafe01 sdc - 670990080 - - - -

dm sourcesafe02 sdd - 20872960 - - - -

rv home-rvg - ENABLED - - ACTIVE - -

rl rlk_192.168.1.102_home-rvg home-rvg CONNECT - - ACTIVE - -

v home home-rvg ENABLED 624951296 - ACTIVE - -

pl home-01 home ENABLED 624951296 - ACTIVE - -

sd sourcesafe01-01 home-01 ENABLED 624951296 0 - - -

pl home-02 home ENABLED LOGONLY - ACTIVE - -

sd sourcesafe02-01 home-02 ENABLED 512 LOG - - -

pl home-03 home ENABLED LOGONLY - ACTIVE - -

sd sourcesafe01-02 home-03 ENABLED 512 LOG - - -

v home-srl home-rvg ENABLED 20971520 SRL ACTIVE - -

pl home-srl-01 home-srl ENABLED 20971520 - ACTIVE - -

sd sourcesafe01-03 home-srl-01 ENABLED 20971520 0 - - -

Example result after running the above command

# vxprint

Disk group: sourcesafe

TY NAME ASSOC KSTATE LENGTH PLOFFS STATE TUTIL0 PUTIL0

dg sourcesafe sourcesafe - - - - - -

dm sourcesafe01 sdc - 670990080 - - - -

dm sourcesafe02 sdd - 20872960 - - - -

v home-srl home-rvg ENABLED 20971520 SRL ACTIVE - -

pl home-srl-01 home-srl ENABLED 20971520 - ACTIVE - -

sd sourcesafe01-03 home-srl-01 ENABLED 20971520 0 - - -

rv home-rvg - ENABLED - - ACTIVE - -

rl rlk_192.168.1.102_home-rvg home-rvg CONNECT - - ACTIVE - -

v home home-rvg ENABLED 624951296 - ACTIVE - -

pl home-01 home ENABLED 624951296 - ACTIVE - -

sd sourcesafe01-01 home-01 ENABLED 624951296 0 - - -

pl home-02 home ENABLED LOGONLY - ACTIVE - -

sd sourcesafe02-01 home-02 ENABLED 512 LOG - - -

pl home-03 home ENABLED LOGONLY - ACTIVE - -

sd sourcesafe01-02 home-03 ENABLED 512 LOG - - -

The above result show that the volume is diassociated from the Replicated Volume Group/RVG

6.) Associate and verify SRL Volume back to the same RVG

From the below command you can associate the SRL Volume back to Replicated Volume Group/RVG again

# vxvol -g DiskGroup aslog RVGname SrlVolumeName

After running the above command you will again able to see the same result which you were able to see at time before running the command which disassociated the SrlVolume.

Here you need to start the Replicated Volume Group/RVG.

7.) Stop, Start and verify Replication status.

Now you need to reestablish the Replication between the Site’s

Stop Replication command #vradmin –g DiskGroup –f stoprep RVGname

Start Replication command #vradmin –g DiskGroup –a startrep RVGname

Replication Status command #vradmin –g repstatus RVGname

Now here you will not see the passthru word when you run the Replication status command. Earlier the RVG state was like this message

RVG state: enabled for I/O (passthru)

8.) Mark the resources Critical again

Mark the resource critical again which you did un critical in step 1

Save Configuration

9.) un freeze the Service Group

Mark the Service Group unfreeze again which you did freeze in step 2

Save Configuration (if the freeze was persistent)

Now here the activity is completed of fixing the passthru state of Replicated Volume Group/RVG

Please Note: If Diassociate/Associate(Step 5-7) does not work for you means that after sometime you face again RVG state is in passthru mode, then make sure the Disk on which the SRL Volume created is not faulty. If Disk found faulty replace Disk via using option 4 and 5 of vxdiskadm. These options may vary vxvm version to version so let me show the option below as a reference:

4 Remove a disk for replacement

5 Replace a failed or removed disk

Finally in the end remove the faulty Disk(s) from DiskGroup via using option 3 of vxdiskadm. These options may vary vxvm version to version so let me show the option below as a reference:

3 Remove a disk

As per my experience when you are going to replace the disk(s) as I mentioned above, must verify from your SAN team that Disk(s) really has problem, as in my case SAN team was not ready to accept that they are able to see SAN Disk has any problem at SAN console, means as per SAN team's perspective the SAN Luns are 100 fine. I am only able to see IO errors in system logs. I am not able to understand either this is the problem with Symantec Product or OS or etc. See the below IO errors as a reference which I am able to see under OS logs:

Nov 21 04:39:37 NODE-2 kernel: sd 3:0:0:0: SCSI error: return code = 0x00020000 Nov 21 04:39:37 NODE-2 kernel: end_request: I/O error, dev sdb, sector 137918720 Nov 21 04:39:37 NODE-2 kernel: VxVM vxdmp V-5-0-112 disabled path 8/0x10 belonging to the dmpnode 201/0x30 Nov 21 04:39:37 NODE-2 kernel: sd 3:0:0:0: SCSI error: return code = 0x00020000 Nov 21 04:39:37 NODE-2 kernel: end_request: I/O error, dev sdb, sector 137918848 Nov 21 04:39:37 NODE-2 kernel:

Nov 21 04:39:37 NODE-2 kernel: VxVM vxdmp V-5-0-111 disabled dmpnode 201/0x30 Nov 21 04:39:37 NODE-2 kernel:

Nov 21 04:39:37 NODE-2 kernel: VxVM VVR vxio V-5-0-0 Subdisk XXDGXX03-01 block 137852928: Uncorrectable write error Nov 21 04:39:37 NODE-2 kernel: VxVM VVR vxio V-5-0-0 Subdisk XXDGXX03-01 block 137853056: Uncorrectable write error Nov 21 04:39:37 NODE-2 kernel: sd 3:0:0:0: SCSI error: return code = 0x00020000 Nov 21 04:39:37 NODE-2 kernel: end_request: I/O error, dev sdb, sector 137959192 Nov 21 04:39:37 NODE-2 kernel: VxVM vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error buffer ffff8107df5e6600 on device 0xc900030 to DMP Nov 21 04:39:37 NODE-2 kernel: VxVM VVR vxio V-5-0-0 Subdisk XXDGXX03-01 block 137893400: Uncorrectable write error Nov 21 04:39:42 NODE-2 kernel: Buffer I/O error on device VxVM65533, logical block 17655604 Nov 21 04:39:42 NODE-2 kernel: lost page write due to I/O error on VxVM65533

TIP: It is not must that Disk has problem, Cross check that the Fiber Cables, HBA Switch, or connectivity between SAN and the machine on the Disk is mapped may have any problem which may prevent IO on SrlVolume/SAN Luns/Disks. (But this is not with my case)

Strong Recommandation: One thing should keep in mind when going to run the vxdiskadm command and going to use the option 4 and 5, make sure that any DATA volume which is running fine should not have any sub disk on the Disk which you are going to replace.

TIP: when you are making a new Cluster a good thing is should keep the SRL Volume on a Disk which is not being used by another Data Volume. (In my case Client had to give me a SAN Disk chunk which is not enough for the Volume which I need to create so I had to use one and half of a disk which has SRL Volume)

I hope this Article will help to people who may face the time which I faced. This is my experience which I would like to elaborate as much as I can, If anyone find any mistake then kindly do share their comments.

VOX

My Experience with passthru state of Replicated Volume Group/RVG