Problem
SDS Operator Pod in CrashLoopBackOff
state following OpenShift node removal
Error Message
CrashLoopBackOff
Cause
In a disk-based fencing configuration using the VIKE solution, when a worker node is completely removed from the OpenShift cluster, the sds-operator is unable to transition to a running state even though it successfully initiates the removal process for the node.
Example:
# oc get infoscalecluster
Name VERSION CLUSTERID STATE DISKGROUPS STATUS AGE
isc-primary 8.0.400 1000 ProcessingRemoveNode vrts_kube_dg-1000 Degraded 262d
Solution
Please contact Arctera Support to obtain the updated SDS Operator images compatible with version 8.0.400
Steps to replace the image:
- Load the image into the private registry.
Login to private registry
podman load -i <sds-operator-image>
podman tag <image_id> <registry_path>/infoscale-sds-operator:8.0.400-rhel
podman push <registry_path>/infoscale-sds-operator:8.0.400-rhel Login into the node where the sds-operator pod is running with the core user.
Elevate to root user: sudo su - root
Pull the updated images from the registry podman pull <registry image path>/infoscale-sds-operator:8.0.400-rhel
- On bastion host, edit the SDS Operator deployment
oc edit deployment infoscale-sds-operator
- Change both occurrences from image: to <registry image path>/infoscale-sds-operator:8.0.400-rhel
- At the top of spec: add:
nodeName: <hostname of the worker node where sds operator in crashloopbackoff>