Cohesity Cloud Scale Technology Manual Deployment Guide for Kubernetes Clusters
- Introduction
- Section I. Configurations
- Prerequisites
- Preparing the environment for NetBackup installation on Kubernetes cluster
- Prerequisites for Snapshot Manager (AKS/EKS)
- Prerequisites for Kubernetes cluster configuration
- Prerequisites for Cloud Scale configuration
- Prerequisites for deploying environment operators
- Prerequisites for using private registry
- Recommendations and Limitations
- Configurations
- Configuration of key parameters in Cloud Scale deployments
- Tuning touch files
- Setting maximum jobs per client
- Setting maximum jobs per media server
- Enabling intelligent catalog archiving
- Enabling security settings
- Configuring email server
- Reducing catalog storage management
- Configuring zone redundancy
- Enabling client-side deduplication capabilities
- Parameters for logging (fluentbit)
- Managing media server configurations in Web UI
- Prerequisites
- Section II. Deployment
- Section III. Monitoring and Management
- Monitoring NetBackup
- Monitoring Snapshot Manager
- Monitoring fluentbit
- Monitoring MSDP Scaleout
- Managing NetBackup
- Managing the Load Balancer service
- Managing PostrgreSQL DBaaS
- Managing logging
- Performing catalog backup and recovery
- Section IV. Maintenance
- PostgreSQL DBaaS Maintenance
- Patching mechanism for primary, media servers, fluentbit pods, and postgres pods
- Upgrading
- Cloud Scale Disaster Recovery
- Uninstalling
- Troubleshooting
- Troubleshooting AKS and EKS issues
- View the list of operator resources
- View the list of product resources
- View operator logs
- View primary logs
- Socket connection failure
- Resolving an issue where external IP address is not assigned to a NetBackup server's load balancer services
- Resolving the issue where the NetBackup server pod is not scheduled for long time
- Resolving an issue where the Storage class does not exist
- Resolving an issue where the primary server or media server deployment does not proceed
- Resolving an issue of failed probes
- Resolving issues when media server PVs are deleted
- Resolving an issue related to insufficient storage
- Resolving an issue related to invalid nodepool
- Resolve an issue related to KMS database
- Resolve an issue related to pulling an image from the container registry
- Resolving an issue related to recovery of data
- Check primary server status
- Pod status field shows as pending
- Ensure that the container is running the patched image
- Getting EEB information from an image, a running container, or persistent data
- Resolving the certificate error issue in NetBackup operator pod logs
- Pod restart failure due to liveness probe time-out
- NetBackup messaging queue broker take more time to start
- Host mapping conflict in NetBackup
- Issue with capacity licensing reporting which takes longer time
- Local connection is getting treated as insecure connection
- Backing up data from Primary server's /mnt/nbdata/ directory fails with primary server as a client
- Storage server not supporting Instant Access capability on Web UI after upgrading NetBackup
- Taint, Toleration, and Node affinity related issues in cpServer
- Operations performed on cpServer in cloudscale-values.yaml file are not reflected
- Elastic media server related issues
- Failed to register Snapshot Manager with NetBackup
- Post Kubernetes cluster restart, flexsnap-listener pod went into CrashLoopBackoff state or pods were unable to connect to flexsnap-rabbitmq
- Post Kubernetes cluster restart, issues observed in case of containerized Postgres deployment
- Request router logs
- Issues with NBPEM/NBJM
- Issues with logging feature for Cloud Scale
- The flexsnap-listener pod is unable to communicate with RabbitMQ
- Job remains in queue for long time
- Extracting logs if the nbwsapp or log-viewer pods are down
- Helm installation failed with bundle error
- Deployment fails with private container registry and Postgres fails to pull the images
- Troubleshooting AKS-specific issues
- Troubleshooting EKS-specific issues
- Resolving the primary server connection issue
- NetBackup Snapshot Manager deployment on EKS fails
- Wrong EFS ID is provided in cloudscale-values.yaml file
- Primary pod is in ContainerCreating state
- Webhook displays an error for PV not found
- Cluster Autoscaler initialization issue
- Catalog backup job fails with an error (Status 9202)
- Troubleshooting issue for bootstrapper pod
- Troubleshooting issues for kubectl plugin
- Troubleshooting AKS and EKS issues
- Appendix A. CR template
- Appendix B. MSDP Scaleout
- About MSDP Scaleout
- Prerequisites for MSDP Scaleout (AKS\EKS)
- Limitations in MSDP Scaleout
- MSDP Scaleout configuration
- Installing the docker images and binaries for MSDP Scaleout (without environment operators or Helm charts)
- Deploying MSDP Scaleout
- Managing MSDP Scaleout
- MSDP Scaleout maintenance
Upgrade Cloud Scale using the kubectl plugin
Note the following:
During upgrade ensure that the value of minimumReplica of media server CR is same as that of media server before upgrade.
Upgrading a Cloud Scale deployment using the
kubectlplugin is supported for version 10.5 or later.By default, is set to . As a result, only stdout logs will be collected and available for download unless the flag is manually enabled.
Upgrading Cloud Scale deployment using kubectl plugin
- Before you proceed with upgrade, ensure that the following prerequisites are met:
Infrastructure readiness: The Kubernetes cluster and Cloud Scale environment is up and running.
Required container images: All Cloud Scale related images of the version you would like to upgrade to are pushed to your container registry.
Helm setup:
Helm is installed and configured.
The
jetstackrepository is added:helm repo add jetstack https://charts.jetstack.ioThe cert-manager and trust-manager charts are installed via helm.
Also review the prerequisites in the following section:
- Execute the binary at
bin/with the upgrade option using the following command:./kubectl-cloudscale upgrade
- Follow the upgrade steps as follows:
/root/VRTSk8s-netbackup-11.1.0.2-0019/bin/kubectl-cloudscale upgrade Cloud Scale Upgrade Before you proceed with installation, please ensure the following prerequisites are in place: 1. Infrastructure readiness: - The Kubernetes cluster is up and running - Cloud Scale environment is up and running 2. Required container images: - All Cloud Scale related images of the version you would like to upgrade to are pushed to your container registry 3. Helm setup: - Helm is installed and configured - The "jetstack" repository is added: helm repo add jetstack https://charts.jetstack.io - The cert-manager and trust-manager charts are installed via helm 4. Container registry is logged in to this machine Also review the 'Prerequisites for Cloud Scale Technology upgrade' section in the 'Cohesity Cloud Scale Technology Manual Deployment Guide for Kubernetes Clusters' document for additional required steps. Once everything is ready, you can safely continue with the upgrade. Would you like to continue? (y/n): y Checking if the input file already exists. Input file is not present. Provide the file path of the extracted Cloud Scale folder: /root/VRTSk8s-netbackup-11.1.0.2-0019/ Provide a namespace for the Environment: netbackup Environment is ready for upgrade *******************************Checking for cert-manager******************************** Checking if Cert Manager is installed or not Cert-manager is installed with version 'v1.13.3'. It is recommended to upgrade cert-manager to version v1.18.2 post Cloud Scale deployment *******************************Checking for trust-manager******************************** Checking if Trust Manager is installed or not Trust-manager is installed with version ''v0.7.0''. It is recommended to upgrade trust-manager to version v0.19.0 post Cloud Scale deployment Provide the image tag for the NetBackup images that need to be upgraded: 11.1.0.2-0019 Validating images... Provide the image tag for the MSDP images that need to be upgraded: 21.1.0.2-0013 Validating images... Provide the image tag for the NetBackup Snapshot Manager images that need to be upgraded: 11.1.0.2-3006 Validating images... Operator Namespace: netbackup-operator-system Upgrade of operators started... The operators chart has been installed successfully. Provide the image tag for the PostgreSQL images that need to be upgraded: 16.11.1.0-0001 Validating images... Current version of cloudscale: 11.1 Starting upgrade of Helm-based CloudScale environment in namespace netbackup Successfully upgraded CloudScale Helm release to version 4 Successfully initiated the upgrade for Helm-based CloudScale environment in netbackup namespace. Ensure that all pods are up and running before using the NetBackup. For kubectl plugin specific logs, check the log at /root/.cloudscale/setup-cloudscale.log
Post upgrade, the flexsnap-listener pod would be migrated to cp control nodepool as per the node selector settings in the environment CR. To reduce the TCO, user can change the minimum size of CP data nodepool to 0 through the portal.
Post upgrade, for cost optimization, user has the option to change the value of of media server CR to 0. User can change the minimum size of media nodepool to 0 through the portal.
(For Azure) Post upgrade, patch StorageClass and PV with
nconnect=4for better NFS performance.The
nconnectmount option in Azure is a client-side NFS feature that enables multiple TCP connections between an NFS client and the NFS endpoint. This improves large-scale performance by reducing the number of clients needed to reach the maximum bandwidth of high-capacity SSD file shares. Azure Files supports up to 16 nconnect channels, withnconnect=4recommended for optimal performance. For this reason, it is recommended to setnconnect=4in the StorageClass specification for Cloud Scale.
To patch StorageClass and PV with nconnect=4 for better NFS performance
- Patch the StorageClass (affects newly created PVs):
Identify the StorageClass used by the NFS-mounted volume.
Use the following command to list the PVCs in the netbackup namespace:
$ kubectl get pvc -n netbackup
Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGEcatalog-nbu-primary-0 Bound pvc-37f2d9be-84e8-4e6c-8463-dceb16642370 110Gi RWX nb-file-premium <unset> 19dcertauth-pvc Bound pvc-defd5f28-0be5-4e70-8453-0e2472bfca06 1Gi RWO nb-disk-premium <unset> 19dcloudpoint-pvc Bound pvc-e910737e-4d69-4e0b-8fc5-d57e5787eace 5Gi RWX nb-file-premium <unset> 19ddata-flexsnap-rabbitmq-0 Bound pvc-794271f2-5440-47b8-b285-2929b0ab6669 5Gi RWO nb-disk-premium <unset> 19d
From this output, the PVC catalog-nbu-primary-0 uses the StorageClass
nb-file-premium, so this is the StorageClass we will patch.Patch the StorageClass.
Apply the
nconnect=4mount option:kubectl patch storageclass nb-file-premium --type merge -p '{"mountOptions":["nconnect=4"]}'
The following content must be displayed:
storageclass.storage.k8s.io/nb-file-premium patched
Verify the StorageClass patch.
kubectl get sc nb-file-premium -o yaml | sed -n '/mountOptions/,$p'
The following content must be displayed:
mountOptions: - nconnect=4
- Patch the existing PV (applies to currently mounted volumes):
Use the following command to export the PV name:
export PV=pvc-37f2d9be-84e8-4e6c-8463-dceb16642370
Use the following command to check whether the PV already has the mount option:
kubectl get pv "$PV" -o json | jq -e '.spec.mountOptions // [] | index("nconnect=4")'
If output is null, the option is not present.
Use the following command to patch the PV:
kubectl patch pv "$PV" --type merge -p '{"spec": {"mountOptions": ["nconnect=4"]}}'
Example output:
persistentvolume/<PV_NAME> patched
- Restart pods to apply the new NFS mount option:
Use the following command to delete each MediaServer replica pod:
export POD=media1-media-0 kubectl delete pod $POD -n netbackup
The pod will restart and re-mount with the updated NFS settings.
- Verify
nconnect=4inside the pod:Once the pod has restarted, run:
mount | grep nfs
Example output:
... nconnect=4 ...
- Apply changes to primary nodes:
Scale down the Primary nodepool to 0: This forces NFS unmounting. Then scale it back up so that all primary pods remount the storage with the updated
nconnect=4option.Verify from primary pods:
mount | grep nfs
Example output:
... nconnect=4 ...
- Verify from all other related pods:
For example, kubectl exec -it nbu-nbatd-0 -n netbackup -- mount | grep nfs
The
nconnect=4must be visible on on all relevant NFS mount paths.
More Information