Important Update: Cohesity Products Documentation
All Cohesity product documentation are now managed via the Cohesity Docs Portal: https://docs.cohesity.com/HomePage/Content/home.htm. Some documentation available here may not reflect the latest information or may no longer be accessible.
Cohesity Cloud Scale Technology Manual Deployment Guide for Kubernetes Clusters
- Introduction
- Section I. Configurations
- Prerequisites
- Preparing the environment for NetBackup installation on Kubernetes cluster
- Prerequisites for Snapshot Manager (AKS/EKS)
- Prerequisites for Kubernetes cluster configuration
- Prerequisites for Cloud Scale configuration
- Prerequisites for deploying environment operators
- Prerequisites for using private registry
- Recommendations and Limitations
- Configurations
- Configuration of key parameters in Cloud Scale deployments
- Tuning touch files
- Setting maximum jobs per client
- Setting maximum jobs per media server
- Enabling intelligent catalog archiving
- Enabling security settings
- Configuring email server
- Reducing catalog storage management
- Configuring zone redundancy
- Enabling client-side deduplication capabilities
- Parameters for logging (fluentbit)
- Managing media server configurations in Web UI
- Prerequisites
- Section II. Deployment
- Section III. Monitoring and Management
- Monitoring NetBackup
- Monitoring Snapshot Manager
- Monitoring fluentbit
- Monitoring MSDP Scaleout
- Managing NetBackup
- Managing the Load Balancer service
- Managing PostrgreSQL DBaaS
- Managing logging
- Performing catalog backup and recovery
- Section IV. Maintenance
- PostgreSQL DBaaS Maintenance
- Patching mechanism for primary, media servers, fluentbit pods, and postgres pods
- Upgrading
- Cloud Scale Disaster Recovery
- Uninstalling
- Troubleshooting
- Troubleshooting AKS and EKS issues
- View the list of operator resources
- View the list of product resources
- View operator logs
- View primary logs
- Socket connection failure
- Resolving an issue where external IP address is not assigned to a NetBackup server's load balancer services
- Resolving the issue where the NetBackup server pod is not scheduled for long time
- Resolving an issue where the Storage class does not exist
- Resolving an issue where the primary server or media server deployment does not proceed
- Resolving an issue of failed probes
- Resolving issues when media server PVs are deleted
- Resolving an issue related to insufficient storage
- Resolving an issue related to invalid nodepool
- Resolve an issue related to KMS database
- Resolve an issue related to pulling an image from the container registry
- Resolving an issue related to recovery of data
- Check primary server status
- Pod status field shows as pending
- Ensure that the container is running the patched image
- Getting EEB information from an image, a running container, or persistent data
- Resolving the certificate error issue in NetBackup operator pod logs
- Pod restart failure due to liveness probe time-out
- NetBackup messaging queue broker take more time to start
- Host mapping conflict in NetBackup
- Issue with capacity licensing reporting which takes longer time
- Local connection is getting treated as insecure connection
- Backing up data from Primary server's /mnt/nbdata/ directory fails with primary server as a client
- Storage server not supporting Instant Access capability on Web UI after upgrading NetBackup
- Taint, Toleration, and Node affinity related issues in cpServer
- Operations performed on cpServer in environment.yaml file are not reflected
- Elastic media server related issues
- Failed to register Snapshot Manager with NetBackup
- Post Kubernetes cluster restart, flexsnap-listener pod went into CrashLoopBackoff state or pods were unable to connect to flexsnap-rabbitmq
- Post Kubernetes cluster restart, issues observed in case of containerized Postgres deployment
- Request router logs
- Issues with NBPEM/NBJM
- Issues with logging feature for Cloud Scale
- The flexsnap-listener pod is unable to communicate with RabbitMQ
- Job remains in queue for long time
- Extracting logs if the nbwsapp or log-viewer pods are down
- Helm installation failed with bundle error
- Deployment fails with private container registry and Postgres fails to pull the images
- Troubleshooting AKS-specific issues
- Troubleshooting EKS-specific issues
- Resolving the primary server connection issue
- NetBackup Snapshot Manager deployment on EKS fails
- Wrong EFS ID is provided in cloudscale-values.yaml file
- Primary pod is in ContainerCreating state
- Webhook displays an error for PV not found
- Cluster Autoscaler initialization issue
- Catalog backup job fails with an error (Status 9202)
- Troubleshooting issue for bootstrapper pod
- Troubleshooting issues for kubectl plugin
- Troubleshooting AKS and EKS issues
- Appendix A. CR template
- Appendix B. MSDP Scaleout
- About MSDP Scaleout
- Prerequisites for MSDP Scaleout (AKS\EKS)
- Limitations in MSDP Scaleout
- MSDP Scaleout configuration
- Installing the docker images and binaries for MSDP Scaleout (without environment operators or Helm charts)
- Deploying MSDP Scaleout
- Managing MSDP Scaleout
- MSDP Scaleout maintenance
Upgrade Cloud Scale
Note the following:
During upgrade ensure that the value of minimumReplica of media server CR is same as that of media server before upgrade.
Upgrading a Cloud Scale deployment using the
kubectlplugin is supported only from version 11.0 to 11.1.By default, is set to . As a result, only stdout logs will be collected and available for download unless the flag is manually enabled.
Upgrading Cloud Scale deployment
- Before you proceed with upgrade, ensure that the following prerequisites are met:
Infrastructure readiness: The Kubernetes cluster and Cloud Scale environment is up and running.
Required container images: All Cloud Scale related images of the version you would like to upgrade to are pushed to your container registry.
Helm setup:
Helm is installed and configured.
The
jetstackrepository is added:helm repo add jetstack https://charts.jetstack.ioThe cert-manager and trust-manager charts are installed via helm.
Also review the prerequisites in the following section:
- Execute the binary at
bin/with the upgrade option using the following command:./kubectl-cloudscale upgrade
- Follow the upgrade steps as follows:
Cloudscale Plugin Installer - Setup Requirements Before you proceed with installation, please ensure the following prerequisites are in place: 1. Infrastructure readiness: - A Kubernetes cluster is up and running - Node pools and external IPs are configured - Networking and access policies are in place 2. Required container images: - All Cloudscale-related images are built and pushed to your container registry 3. Helm setup: - Helm is installed and configured - The "jetstack" repository is added: helm repo add jetstack https://charts.jetstack.io Once everything is ready, you can safely continue with the plugin installation. Would you like to continue? (y/n): y Checking if the input file already exists. Input file is not present. Provide the file path of the extracted Cloud Scale folder: /new-disk/VRTSk8s-netbackup-11.1-xxxx Provide a namespace for the Environment: netbackup *******************************Checking for cert-manager******************************** Checking if Cert Manager is installed or not Cert-manager is installed with version 'v1.13.3'. It is recommended to upgrade cert-manager to version v1.18.2 post Cloud Scale deployment *******************************Checking for trust-manager******************************** Checking if Trust Manager is installed or not Trust-manager is installed with version ''v0.7.0''. It is recommended to upgrade trust-manager to version v0.19.0 post Cloud Scale deployment Provide the image tag for NetBackup images: 11.1-xxxx Provide the image tag for MSDP images: 21.1-0016 Provide the image tag for NetBackup Snapshot Manager images: 11.1-xxxx Operator Namespace: netbackup-operator-system Upgrade of operators started... Operators chart upgraded successfully. Provide the image tag for PostgreSQL images : 16.0.10.1-0001-ar2 Current version of cloudscale: 11.0 Started annotating resources in netbackup namespace for cloudscale upgrade... Annotation completed Starting upgrade of cloudscale environment in namespace netbackup... Retrieving configuration values from existing Fluentbit and PostgreSQL Helm charts. Successfully retrieved and saved configuration values for PostgreSQL and FluentBit Helm charts. PostgreSQL values have been merged successfully. Fluentbit values have been merged successfully. Successfully loaded and merged the Environment CR. All user-provided tags have been successfully validated in the generated cloudscale-values.yaml file. Cloudscale values have been successfully saved to cloudscale-values.yaml Successfully initiated the upgrade of environment in netbackup namespace. Ensure all pods are up & running before using NetBackup. For kubectl plugin specific logs, check the log at /home/user1/kubectl-plugin/setup-cloudscale.log
Post upgrade, the flexsnap-listener pod would be migrated to cp control nodepool as per the node selector settings in the environment CR. To reduce the TCO, user can change the minimum size of CP data nodepool to 0 through the portal.
Post upgrade, for cost optimization, user has the option to change the value of of media server CR to 0. User can change the minimum size of media nodepool to 0 through the portal.
(For Azure) Post upgrade, patch StorageClass and PV with
nconnect=4for better NFS performance.The
nconnectmount option in Azure is a client-side NFS feature that enables multiple TCP connections between an NFS client and the NFS endpoint. This improves large-scale performance by reducing the number of clients needed to reach the maximum bandwidth of high-capacity SSD file shares. Azure Files supports up to 16 nconnect channels, withnconnect=4recommended for optimal performance. For this reason, it is recommended to setnconnect=4in the StorageClass specification for Cloud Scale.
To patch StorageClass and PV with nconnect=4 for better NFS performance
- Patch the StorageClass (affects newly created PVs):
Identify the StorageClass used by the NFS-mounted volume.
Use the following command to list the PVCs in the netbackup namespace:
$ kubectl get pvc -n netbackup
Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGEcatalog-nbu-primary-0 Bound pvc-37f2d9be-84e8-4e6c-8463-dceb16642370 110Gi RWX nb-file-premium <unset> 19dcertauth-pvc Bound pvc-defd5f28-0be5-4e70-8453-0e2472bfca06 1Gi RWO nb-disk-premium <unset> 19dcloudpoint-pvc Bound pvc-e910737e-4d69-4e0b-8fc5-d57e5787eace 5Gi RWX nb-file-premium <unset> 19ddata-flexsnap-rabbitmq-0 Bound pvc-794271f2-5440-47b8-b285-2929b0ab6669 5Gi RWO nb-disk-premium <unset> 19d
From this output, the PVC catalog-nbu-primary-0 uses the StorageClass
nb-file-premium, so this is the StorageClass we will patch.Patch the StorageClass.
Apply the
nconnect=4mount option:kubectl patch storageclass nb-file-premium --type merge -p '{"mountOptions":["nconnect=4"]}'
The following content must be displayed:
storageclass.storage.k8s.io/nb-file-premium patched
Verify the StorageClass patch.
kubectl get sc nb-file-premium -o yaml | sed -n '/mountOptions/,$p'
The following content must be displayed:
mountOptions: - nconnect=4
- Patch the existing PV (applies to currently mounted volumes):
Use the following command to export the PV name:
export PV=pvc-37f2d9be-84e8-4e6c-8463-dceb16642370
Use the following command to check whether the PV already has the mount option:
kubectl get pv "$PV" -o json | jq -e '.spec.mountOptions // [] | index("nconnect=4")'
If output is null, the option is not present.
Use the following command to patch the PV:
kubectl patch pv "$PV" --type merge -p '{"spec": {"mountOptions": ["nconnect=4"]}}'
Example output:
persistentvolume/<PV_NAME> patched
- Restart pods to apply the new NFS mount option:
Use the following command to delete each MediaServer replica pod:
export POD=media1-media-0 kubectl delete pod $POD -n netbackup
The pod will restart and re-mount with the updated NFS settings.
- Verify
nconnect=4inside the pod:Once the pod has restarted, run:
mount | grep nfs
Example output:
... nconnect=4 ...
- Apply changes to primary nodes:
Scale down the Primary nodepool to 0: This forces NFS unmounting. Then scale it back up so that all primary pods remount the storage with the updated
nconnect=4option.Verify from primary pods:
mount | grep nfs
Example output:
... nconnect=4 ...
- Verify from all other related pods:
For example, kubectl exec -it nbu-nbatd-0 -n netbackup -- mount | grep nfs
The
nconnect=4must be visible on on all relevant NFS mount paths.
More Information