System panic is observed after applying Infoscale 8.0.2 Update 6 on RHEL 8.x/9.x and sles15 due to VxFS LRU list inconsistency
Error MessageThe issue is seen both on Parallel Filesystem (CFS) as well as local/Failover VxFS Mount and the Panic String looks as below:
#0 [ffffc13000bbfb58] machine_kexec at ffffffff9f26dde3
#1 [ffffc13000bbfbb0] __crash_kexec at ffffffff9f3b9caa
#2 [ffffc13000bbfc70] crash_kexec at ffffffff9f3babe1
#3 [ffffc13000bbfc88] oops_end at ffffffff9f22c131
#4 [ffffc13000bbfca8] do_trap at ffffffff9f228397
#5 [ffffc13000bbfcf0] do_invalid_op at ffffffff9f2290d6
#6 [ffffc13000bbfd10] invalid_op at ffffffff9fe00df4
[exception RIP: __list_del_entry_valid.cold.1+32]
RIP: ffffffff9f6f9308 RSP: ffffc13000bbfdc0 RFLAGS: 00010246
RAX: 0000000000000054 RBX: ffff9ef89d99e220 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9f0b60cde698 RDI: ffff9f0b60cde698
RBP: ffff9ef96fa30fc0 R8: 0000000000000000 R9: c0000000ffff7fff
R10: 0000000000000001 R11: ffffc13000bbfbe0 R12: 0000000000000029
R13: ffff9f0e399714a8 R14: ffffffffc2d40df8 R15: ffffffffc2d40e08
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffffc13000bbfdc0] vx_ddelete at ffffffffc2b48c3e [vxfs]
#8 [ffffc13000bbfe00] dput at ffffffff9f589b30
#9 [ffffc13000bbfe18] do_renameat2 at ffffffff9f5806d3
#10 [ffffc13000bbff28] do_renameat2 at ffffffff9f580335
#11 [ffffc13000bbff38] do_syscall_64 at ffffffff9f203cab
#12 [ffffc13000bbff50] entry_SYSCALL_64_after_hwframe at ffffffff9fe0012e
To improve parallel lookup performance, enhancements were made to the Least Recently Used (LRU) list logic. As part of this evolution, an unexpected behavior was observed in the VxFS rename operation.
In certain scenarios, specific memory lists may become misaligned due to lock interactions designed to manage list integrity. These conditions are isolated to internal list structures and do not result in any file system or data corruption.
The situation reflects a metadata-level irregularity, which remains fully non-impactful to user data.
SolutionThe Product Engineering team currently plans to address this issue through a patch or hotfix in the current software version. Please note that our company reserves the right to withdraw any fix from the targeted release if it fails quality assurance tests. Development plans are subject to change, and any actions you take based on this information, or your reliance on it, are at your own risk.
The InfoScale 8.0.2 U6 patch has been refreshed and now includes updated VxFS (8.0.2.2800) and ODM (8.0.2.2800) packages to resolve this system panic caused by VxFS LRU list inconsistency. Customers who have already installed the original U6 bundle (ie pre Aug 13th) can refresh U6 with the installer or download the VxFS and odm 8.0.2.2701 hotfixes from the Download Center or install the VxFS and odm 8.0.2.2800 versions from the yum repository. New deployments targeting U6 should use this refreshed patch bundle directly.
The u6 patch can be downloaded from the Download Center using the following URLs:
RHEL 8 : https://www.veritas.com/support/en_US/downloads/update.UPD552281
RHEL 9 : https://www.veritas.com/support/en_US/downloads/update.UPD367442
SLES 15 : https://www.veritas.com/support/en_US/downloads/update.UPD945272
VxFS HotFix :
RHEL 8 : https://www.veritas.com/content/support/en_US/downloads/update.UPD680412
RHEL 9 : https://www.veritas.com/content/support/en_US/downloads/update.UPD321672
SLES 15 : https://www.veritas.com/content/support/en_US/downloads/update.UPD725294
ODM HotFix :
RHEL 8 : https://www.veritas.com/content/support/en_US/downloads/update.UPD549595
RHEL 9 : https://www.veritas.com/content/support/en_US/downloads/update.UPD929313
SLES 15 : https://www.veritas.com/content/support/en_US/downloads/update.UPD250348
Internal Notes
More details regarding Cause
=====================
Multiple Least Recently Used [LRU] list changes are introduced to avoid single LRU lock contention seen during parallel lookups.
A panic is seen during a rename operation on the filesystem. The panic is caused due to internal memory lists becoming corrupted and does not cause any file system or data corruption. The panic is caused by memory lists being inconsistent due to the wrong locks being taken that protect the lists. With the multiple LRU lists changes, d_name.hash is used to pick which LRU list the dentry should be put on, and the same LRU list lock is taken when the same dentry is taken off the LRU list based on its d_name.hash. However, a dentry’s d_name could be changed in between adding it to or removing it from the LRU list which was picked based on its initial d_name.hash, because of this, the wrong LRU list lock is taken unexpectedly based on its current (already changed) d_name.hash. This then allows the lists to be corrupted, which caused panic.
Solution
===================
The VxFS code changes are made to ensure once a dentry LRU list is picked based on its initial d_name.hash, make sure the dentry sits on the same LRU list until it gets deleted.
Notes regarding the original hfs provided to fix this panic for the original u6 patch (pre-refresh):
A hotfix is now available for this issue in the current version of the product(s) mentioned. Refer to the Hotfix link under Related Articles to obtain the hotfix needed to resolve the issue.. Please note that our company reserves the right to withdraw any fix from the targeted release if it fails quality assurance tests. Development plans are subject to change, and any actions you take based on this information, or your reliance on it, are at your own risk.
VXFS HotFix :
https://www.veritas.com/content/support/en_US/downloads/update.UPD680412 (RHEL 8)
https://www.veritas.com/content/support/en_US/downloads/update.UPD321672 (RHEL 9)
https://www.veritas.com/content/support/en_US/downloads/update.UPD725294 (SLES 15)
ODM HotFix :
https://www.veritas.com/content/support/en_US/downloads/update.UPD549595 (RHEL 8)
https://www.veritas.com/content/support/en_US/downloads/update.UPD929313 (RHEL 9)
https://www.veritas.com/content/support/en_US/downloads/update.UPD250348 (SLES 15)
Currently, Hotfix is available on VRTSvxfs 8.0.2.2603. This Hotfix has to be installed with VRTSodm patch 8.0.2.2400.
http://release.arctera.io/re/release_train/linux/8.0.2/patch_central/HF/fs/rhel8_x86_64/8.0.2.2603/
http://release.arctera.io/re/release_train/linux/8.0.2/patch_central/HF/fs/rhel9_x86_64/8.0.2.2603/
Engineering has also released HF on top of 8.0.2 U6
=====================
HF Install Instructions.
-> If VVR is configured, please ensure the replication status is up-to-date . Wait until replication became consistent/up-to-date.
# vradmin -g <DG> repstatus <RVG>
-> Offline all the Service Groups in order.
#hagrp -offline <SG> -any (Except CVM for FSS environment)
# hastop -all
-> Unmount local VxFS FS Cleanly
# umount </MNTPT>
-> Ensure all the Diskgroups are deported.
# vxdg deport <DG>
# vxdg list
-> Perform below steps to install the VxFS+ODM HF in a single step.
## Go to the VxFS patch folder
# ./installVRTSvxfs802P2701 -patch_path /var/tmp/HF/odm/ [-require /var/tmp/CPI/patches/<CPI Patch>]
Use latest CPI if the system is not connected to the Internet
-> Post installing HF verify all modules are loaded successfully and packages are successfully installed
# lsmod |grep -i vx
# rpm -aq |egrep -i "VRTSvx|VRTSodm" << This should show both the RPMS should be upgraded to 802.2701
-> Perform a reboot
# shutdown -Fr now OR init 6
-> Validate the FS are mounted cleanly and all VCS Service groups come online successfully
# df -h
# hastatus -sum
-> Perform IO operations on the FS