Graceful shutdown of a SFRAC 5.1SP1 node, triggers panic on the other node.

Article: 100008644
Last Published: 2012-12-20
Ratings: 0 0
Product(s): InfoScale & Storage Foundation

Problem

Graceful shutdown of a SFRAC 5.1SP1 node, triggers panic on the other node.

Error Message

From messages log of the node 0 that is being shutdown for maintenance, we see the CVM did not go down clean because the application is still not shutdown properly.

MM DD HH:MM:SS top1 snmpd[5609]: Received TERM or STOP signal... shutting down...
MM DD HH:MM:SS top1 xinetd[6111]: Exiting...
MM DD HH:MM:SS top1 kernel: GAB INFO V-15-1-20032 Port d closed
MM DD HH:MM:SS top1 kernel: GAB ERROR V-15-1-20015 unconfigure failed: clients still registered <<<<<===


From the vxfen.log, we see vxfenconfig –u failing because VCS is still running. 

MM DD HH:MM:SS  vxfenconfig -U returned 1
MM DD HH:MM:SS vxfenconfig -U output is VXFEN vxfenconfig ERROR V-11-2-1023 Unable to unconfigure fencing since clients still active
VXFEN vxfenconfig ERROR V-11-2-1060 Please retry after shutting down VCS (GAB port h) <<<<<===
and/or CVM (GAB ports u/v/w).


From the vxfendebug log captured after vxfen stop failed. Note this node 0 is racing for coordinator disks because of the force unload triggered it.

lbolt: 5367264401 vxfen_io.c ln 2574 VXFEN: vxfen_set_roles: RACER NODE is: 0
lbolt: 5367264401 vxfen_io.c ln 2575 VXFEN: Change fence state node: 0 from: STADIUM_OPEN to: RACER
lbolt: 5367264401 vxfen_io.c ln 2621 VXFEN: vxfen_set_roles: end
lbolt: 5367264401 vxfen_fence.c ln 154 VXFEN: vxfen_grab_coord_pt - begin
lbolt: 5367264401 vxfen_scsi3.c ln 198 VXFEN: vxfen_grab_coord_disks: - begin
lbolt: 5367264401 vxfen_scsi3.c ln 210 VXFEN: vxfen_grab_coord_disks: lowest_node: -1
lbolt: 5367264401 vxfen_scsi3.c ln 465 vxfen_grab_coord_disks: ejecting other node: 1
lbolt: 5367264401 vxfen_scsi3_device.c ln 469 VXFEN: vxfen_preempt_abort: begin lowest node 1
lbolt: 5367264401 vxfen_scsi3_device.c ln 483 VXFEN: node_num: 0
lbolt: 5367264401 vxfen_scsi3_device.c ln 494 VXFEN: resv_key: V victim: V <<<<==
these are the PGR keys.
...
...

and was able to clear the PGR  keys for node 1.

 

Cause

Because node 1 sees the GAB ports u/v/w/h being active when node 0 went down, node 1 (top2) fencing gets triggered and it races for fencing disks. But, it is unable to find its PGR keys because node 0 had already cleared it before shutting down. So, node 1 thinks it lost the race and panics.

MM DD HH:MM:SS top2 Had[10485]: VCS INFO V-16-1-10077 Received new cluster membership
MM DD HH:MM:SS top2 kernel: sd 1:0:0:88: reservation conflict
MM DD HH:MM:SS top2 kernel: VXFEN WARNING V-11-1-65 Could not eject node 0 from disk
MM DD HH:MM:SS top2 kernel: with serial number 60060480000190103338533031374632 since
MM DD HH:MM:SS top2 kernel: keys of node 1 are not registered with it
MM DD HH:MM:SS top2 kernel: sd 1:0:0:89: reservation conflict

..
..

From the core, we see that following in vxfendebug:
lbolt: 4577747279 vxfen_linux.c ln 557 VXFEN: vxfen_plat_pgr_in: end. Status: 0
lbolt: 4577747279 vxfen_scsi3_device.c ln 114 VXFEN: vxfen_readkeys: end ret: 0
lbolt: 4577747279 vxfen_scsi3_device.c ln 170 VXFEN: vxfen_checkreg: num_keys: 2
<<<== two PGR keys found
lbolt: 4577747279 vxfen_scsi3_device.c ln 208 VXFEN: vxfen_checkreg: end
lbolt: 4577747279 vxfen_scsi3.c ln 343 vxfen_grab_coord_disks:READ KEYS shows LOCAL NODE no longer registered.
<<<<=== but not for this node
lbolt: 4577747289 vxfen_scsi3.c ln 69 vxfen_skip_multipaths: skipping the device: device_num: 8388736 serial_num: 60060480000190103338533031374634 npaths: 1
lbolt: 4577747289 vxfen_scsi3.c ln 432 Total coord disks: 3, grabbed disks: 0
lbolt: 4577747289 vxfen_scsi3.c ln 443: vxfen_grab_coord_disks: node 1 lost the race and committing suicide <<<<===
lbolt: 4577747289 vxfen_io.c ln 1627 VXFEN: vxfen_bcast_msg: begin
lbolt: 4577747289 vxfen_io.c ln 1643 VXFEN: vxfen_bcast_msg: end
lbolt: 4577747289 vxfen_fence.c ln 307 VXFEN: vxfen_racer_lost: Sent VXFEN_MSG_LOST_RACE. Shall panic when broadcast completes, or after 6 secondslbolt: 45777472

The key thing here is the cluster did not get shutdown gracefully and it triggered the vxfencing on both the nodes to race for fencing. However, VCS Fencing did not have the checks to stop the node going down from winning the race for coordinator disks. These checks are included in the 6.0 and 5.1SP1RP3 releases, so a node leaving (graceful or ungraceful) won't win the race for coordinator disks.

Solution

This race condition can be prevented if the application is shutdown before unmounting the cluster filesystems. 

Veritas recommend configuring the application as Application resource under VCS so the application gets shutdown properly and all files closed before CFS/CVM shutdown gets initiated.

VCS 6.0 is improved to handle this situation more efficiently. The same is documented on Page 31 of Veritas™ Cluster Server Release Notes: Linux

 

 

 

References

Etrack : 2273238 Etrack : 2531558

Was this content helpful?