What is the general flow of an i/o between a volume and a disk (Including DMP) ?

Article: 100016357
Last Published: 2022-01-20
Ratings: 1 0
Product(s): InfoScale & Storage Foundation

Problem

What is the general flow of an i/o between a volume and a disk (Including DMP) ?

Solution

The general flow of i/o through a multi-pathed stack can be summarized in the diagram below:

                   Application
                         |
                       FS (i.e VxFS or UFS)
                         |
                       vxio (VxVM)
                         |
                       vxdmp (DMP)
                         |
                   -------------
                 |            |
                 sd            sd (OS disk driver)
                 |            |
                 HBA          HBA (Host Bus Adaptor)
                 |            |
                 SAN          SAN
                 |            |
                   -------------
                         |
                     Disk/LUN
 
Applications that run on top of the file system typically send I/O requests to files while databases fire them directly on raw devices. In the former case, based upon the path name of file, the I/O will then be directed to the relevant file system. The file system will intercept this request and channel its own I/O to the volume device that it resides on:   /dev/vx/[r]dsk/diskgroup_name/volume_name. As I/Os enter this device they are received by the VxVM kernel driver, vxio. The latter maintains the volume-plex-subdisk-disk configuration in the kernel. For a given I/O, vxio will ascertain from the in-memory volume configuration, which disk(s) the I/O is destined to be serviced by, and sends the I/O (buffer) to the relevant dmp metanode for that disk device.

The DMP metanode is a pseudo device located in /dev/vx/[r]dmp/ and is a representation of the disk with all its paths. When the I/O is directed at the DMP metanode device, it is handled by the DMP kernel module, vxdmp. The vxdmp driver creates its own buffer to service the I/O and piggybacks the incoming (vxio) I/O on this buffer. DMP will select one of the sub-paths to send the I/O and passes the buffer to the disk driver instance for that path. The buffer will include the lbolt value (the number of clock ticks since boot time) at the time of firing the I/O to the disk driver and also the number of times the I/O has been retried. DMP will now wait on the I/O as it leaves its domain and is now in the SCSI disk driver.
 
Note:
  • If VxVM foreign device support is used, the I/O will bypass the DMP layer, i.e vxio passes the buffer directly to the relevant third-party driver.
  • On Solaris, I/Os will enter the DMP layer even if only one path exists.
  • In the case of HP-UX, even though vxio sends the buffer on a raw DMP metanode, DMP sends it down the block interface of the SCSI driver.
  • In the case of AIX, if the LUN is single pathed, then the I/O is sent directly to the SCSI driver, bypassing DMP. This fastpathing is done only for single pathed LUNs. Also, if the SCSI driver returned the I/O with error, the I/O is then retried through vxdmp.

The SCSI disk driver (sd, ssd, etc) will now process the I/O and send it to the relevant HBA driver which eventually sends it across to the relevant disk/LUN. On completion of the I/O, the buffer returns all the way up the stack back to the calling system call that initiated the I/O.


What happens if there is an I/O failure?

If there is an I/O failure in the disk sub-system, the error will be propagated up the I/O stack. If the SCSI disk driver detects the error, it will go through its own timeout & retry mechanism. When exhausted, the SCSI disk driver should return the buffer back to DMP with the B_ERROR flag set. The buffer is then placed on the DMP error processing queue. The DMP error daemon will then test the problematic path to see if it is a transient or permanent problem. This is done by sending a SCSI inquiry ioctl to the path to check if the device is accessible. If the SCSI inquiry succeeds, then DMP will re-issue the I/O down the same path. This process will be repeated dmp_retry_count times by vxdmp. If all the retries are exhausted then it is interpreted as media error. DMP assumes that some driver above it in the stack may wish to retry the I/O(with data relocation etc) and hence the error is returned to vxio without marking the DMP metanode as "failed". (No vxdmp message is logged).

If the SCSI inquiry fails, DMP will log the path as "disabled" then check the state of the device by sending SCSI inquiries to the other paths in the path set. If at least one path inquiry succeeds, then the I/O is re-issued down the next enabled and active path. If all the path inquiries fail, the device is determined to be a dead LUN and the DMP metanode is marked as "failed". The vxio buffer that has piggybacked on the buffer with the I/O error is then extracted and passed back to vxio, which in turn logs a vxio read/write error for the relevant subdisk and block offset.

Vxio will then initiate its own error processing. Depending on the configuration of the volume, the plex with the failed subdisk will be detached. If the volume has redundancy, then the error will not be reported to the upper layer (file system etc) as the volume is tolerant to the failure. If the volume has no redundancy, then the I/O to the volume has failed and is propagated to the file system layer above. The filesystem will log the failed I/O and return error to the system call of the application. In VxFS, depending on the failure and the mount options, the filesystem may itself be disabled (logged).

Once a DMP metanode is marked as failed, subsequent I/Os on that dmpnode (to that LUN) will return immediately from DMP. If multiple I/O buffers are returned with errors for different LUNs/regions, these are processed serially by vxdmp.

When does a temporarily "failed" path get enabled again?

A path may be marked as "failed" due to a transient error due to cabling/switch or any other issues.   The retry mechanism as described above, is meant to avoid disabling a path straightaway in case the I/O failed due to a transient error. Once a path is marked as "failed", it will remain in that state unless the DMP restore daemon checks the health of the path and determines it to be good. The DMP restore daemon wakes up every " dmp_restore_daemon_interval" seconds and checks if the paths are okay, by doing SCSI inquiry ioctl. The paths chosen for health check depend on the " dmp_restore_daemon_policy". If it finds the path to be good, the same is marked as active & enabled and is available for sending down I/Os. dmp_restore_daemon_interval defaults to 300 seconds and can be changed with vxdmpadm .

What is a "disabled" path?

A given path or paths under a given controller can be marked as "disabled" through administrative intervention. When vxdmpadm command is used to disable path(s) in this manner, they are marked as "disabled". Such paths can be brought back to "active enabled" state only through the administrative command " vxdmpadm enable".

Active Active (A/A) and Active Passive (A/P) arrays and DMP behavior
For A/A arrays, DMP does load balancing by distributing the I/O across the multiple paths that exist to a given LUN.
 
For A/P arrays, the I/Os are fired through the primary path(s) of the DMP metanode, which correspond to the "active" port on the array. The secondary paths of the dmpnode will correspond to the "passive" port. If the primary fails for some reason, DMP fails over to the secondary path so that the secondary becomes the active path. When the primary comes back, DMP would fail back to the primary path.
 
For A/PF (explicit failover) arrays, a command will be sent to enable the new path in case of failover / failback. If using auto trespass mode, the first I/O that goes through the new path will make it the active path.  

Insane device behavior
It has been seen in extreme and rare scenarios that devices can go "insane", where I/O failures are not returned for long periods of time, yet SCSI inquiries respond promptly. This could cause a hang conditions for the I/O. Consider such a case where vxdmp fires the I/O to the subsystem which holds on to the I/O for ten minutes but then returns it as failed. DMP will test the path with a SCSI inquiry, if successful then I/O is re-issued to the problem path as it appears as a transient error. This I/O may also take ten minutes and the process repeats until dmp_retry_count is exhausted. In such cases I/Os have taken many minutes to fail, this will clearly cause an issue for the upper level application.

 
Hung I/O
The components in the I/O stack should always complete the I/O request by signaling success or failure. If any component hold onto the I/O without signaling success or failure, the I/O will hang. As more requests are channeled through the I/O subsystem, the hung I/O may cause a bottleneck and the whole system may grind to a halt. In such hung systems, the cause of the hang should be investigated.
 
Note: VxVM/DMP has no initial timeout mechanism and relies upon the error notification of the driver layers beneath (sd etc). If the sublayer drivers do no detect the error or hold on to the I/O, then DMP will be unable to process the error and rectify the problem (such as using another path).
 

 

Was this content helpful?