Description
----------------------------------------------------------------------------------------------------------------------------
# Split Brain Condition.
----------------------------------------------------------------------------------------------------------------------------
A network partition can lead to split brain condition, an issue faced by all cluster implementations. This problem occurs when the HA software cannot distinguish between a system failure and an interconnect failure. The symptoms look identical.
In this case, some nodes determine that their peer has departed and attempt to take corrective action. This can result in data corruption if nodes are able to take control of storage in an uncoordinated manner.
Other scenarios can cause this situation. If a system is so busy that it appears to be hung such that it seems to have failed, its services can be started on another system. This can also happen on systems where the hardware supports a break and resume function. If the system is dropped to command-prompt level with a break and subsequently resumed, the system can appear to have failed. The cluster is reformed and then the system recovers and begins writing to shared storage again.
----------------------------------------------------------------------------------------------------------------------------
# Data Protection Requirements
----------------------------------------------------------------------------------------------------------------------------
The key to protecting data in a shared storage cluster environment is to guarantee that there is always a single consistent view of cluster membership. In other words, when one or more systems stop sending heartbeats, the HA software must determine which systems can continue to participate in the cluster membership and how to handle the other systems.
----------------------------------------------------------------------------------------------------------------------------
# I/O Fencing Concepts and Components
----------------------------------------------------------------------------------------------------------------------------
In order to guarantee data integrity in the event of communication failures among cluster members, full protection requires determining who should be able to remain in the cluster (membership) as well as guaranteed access blocking to storage from any system that is not an acknowledged member of the cluster.
VCS 4.x and above use a mechanism called I/O fencing to guarantee data protection. I/O fencing uses SCSI-3 persistent reservations (PR) to fence off data drives to prevent split-brain condition.
I/O fencing uses an enhancement to the SCSI specification, known as SCSI-3 persistent reservations, (SCSI-3 PR or just PR). SCSI-3 PR is designed to resolve the issues of using SCSI reservations in a modern clustered SAN environment. SCSI-3 PR supports multiple nodes accessing a device while at the same time
blocking access to other nodes. Persistent reservations are persistent across SCSI bus resets and PR also supports multiple paths from a host to a disk.
----------------------------------------------------------------------------------------------------------------------------
# I/O Fencing Components
----------------------------------------------------------------------------------------------------------------------------
VCS uses fencing to allow write access to members of the active cluster and to block access to nonmembers.
I/O fencing in VCS consists of several components. The physical components are coordinator disks and data disks. Each has a unique purpose and uses different physical disk devices.
*** Coordinator Disks
The coordinator disks act as a global lock mechanism, determining which nodes are currently registered in the cluster. This registration is represented by a unique key associated with each node that is written to the coordinator disks. In order for a node to access a data disk, that node must have a key registered on coordinator disks.
When system or interconnect failures occur, the coordinator disks ensure that only one cluster survives.
*** Data Disks
Data disks are standard disk devices used for shared data storage. These can be physical disks or RAID logical units (LUNs). These disks must support SCSI-3 PR. Data disks are incorporated into standard VM disk groups. In operation, Volume Manager is responsible for fencing data disks on a disk group basis.
Disks added to a disk group are automatically fenced, as are new paths to a device are discovered.
----------------------------------------------------------------------------------------------------------------------------
# Registration with Coordinator Disks
----------------------------------------------------------------------------------------------------------------------------
After GAB has started and port a membership is established, each system registers with the coordinator disks. HAD cannot start until registration is complete.
Registration keys are based on the LLT node number. Each key is eight characters the left-most character is the ASCII character corresponding to the LLT node ID.
The registration key is NOT actually written to disk, but is stored in the drive electronics or RAID controller.
All systems are aware of the keys of all other systems, forming a membership of registered systems. This fencing membership�maintained by way of GAB port b is the basis for determining cluster membership and fencing data drives.
----------------------------------------------------------------------------------------------------------------------------
# Note: SCSI3 Persistent Reservation
----------------------------------------------------------------------------------------------------------------------------
When systems registers PR keys, SCSI3 protocol is used basically with key values. That is, systems inquiry an array to register reservation keys. Physically, these requests are received by the drive electronics or RAID controllers generally.
Then, these drive electronics or RAID controllers check requests. If there is no issue, they save the reservation key with other information in the internal/external storage where holds binary or data like GUI, configuration, LUN map table, reservation key table, and so on.
And, these devices returned the result of key registration. Later, if there is a SCSI3 inquiry about reservation keys, these devices check data and reply proper information.
----------------------------------------------------------------------------------------------------------------------------
# I/O Fencing with Multiple Nodes
----------------------------------------------------------------------------------------------------------------------------
The lowest available LLT node nmber in each mini-cluster is selected as 'racer', and the others are 'spector'. 'Spector' waits for 'racer' to finish the racing.
If 'racer' is winner, all nodes in a mini-cluster including 'winner' serve continuously. If not, all nodes in the mini-cluster got panic.