Cohesity Cloud Scale Technology Manual Deployment Guide for Kubernetes Clusters
- Introduction
- Section I. Configurations
- Prerequisites
- Preparing the environment for NetBackup installation on Kubernetes cluster
- Prerequisites for Snapshot Manager (AKS/EKS)
- Prerequisites for Kubernetes cluster configuration
- Prerequisites for Cloud Scale configuration
- Prerequisites for deploying environment operators
- Prerequisites for using private registry
- Recommendations and Limitations
- Configurations
- Configuration of key parameters in Cloud Scale deployments
- Tuning touch files
- Setting maximum jobs per client
- Setting maximum jobs per media server
- Enabling intelligent catalog archiving
- Enabling security settings
- Configuring email server
- Reducing catalog storage management
- Configuring zone redundancy
- Enabling client-side deduplication capabilities
- Parameters for logging (fluentbit)
- Managing media server configurations in Web UI
- Prerequisites
- Section II. Deployment
- Section III. Monitoring and Management
- Monitoring NetBackup
- Monitoring Snapshot Manager
- Monitoring fluentbit
- Monitoring MSDP Scaleout
- Managing NetBackup
- Managing the Load Balancer service
- Managing PostrgreSQL DBaaS
- Managing logging
- Performing catalog backup and recovery
- Section IV. Maintenance
- PostgreSQL DBaaS Maintenance
- Patching mechanism for primary, media servers, fluentbit pods, and postgres pods
- Upgrading
- Cloud Scale Disaster Recovery
- Uninstalling
- Troubleshooting
- Troubleshooting AKS and EKS issues
- View the list of operator resources
- View the list of product resources
- View operator logs
- View primary logs
- Socket connection failure
- Resolving an issue where external IP address is not assigned to a NetBackup server's load balancer services
- Resolving the issue where the NetBackup server pod is not scheduled for long time
- Resolving an issue where the Storage class does not exist
- Resolving an issue where the primary server or media server deployment does not proceed
- Resolving an issue of failed probes
- Resolving issues when media server PVs are deleted
- Resolving an issue related to insufficient storage
- Resolving an issue related to invalid nodepool
- Resolve an issue related to KMS database
- Resolve an issue related to pulling an image from the container registry
- Resolving an issue related to recovery of data
- Check primary server status
- Pod status field shows as pending
- Ensure that the container is running the patched image
- Getting EEB information from an image, a running container, or persistent data
- Resolving the certificate error issue in NetBackup operator pod logs
- Pod restart failure due to liveness probe time-out
- NetBackup messaging queue broker take more time to start
- Host mapping conflict in NetBackup
- Issue with capacity licensing reporting which takes longer time
- Local connection is getting treated as insecure connection
- Backing up data from Primary server's /mnt/nbdata/ directory fails with primary server as a client
- Storage server not supporting Instant Access capability on Web UI after upgrading NetBackup
- Taint, Toleration, and Node affinity related issues in cpServer
- Operations performed on cpServer in environment.yaml file are not reflected
- Elastic media server related issues
- Failed to register Snapshot Manager with NetBackup
- Post Kubernetes cluster restart, flexsnap-listener pod went into CrashLoopBackoff state or pods were unable to connect to flexsnap-rabbitmq
- Post Kubernetes cluster restart, issues observed in case of containerized Postgres deployment
- Request router logs
- Issues with NBPEM/NBJM
- Issues with logging feature for Cloud Scale
- The flexsnap-listener pod is unable to communicate with RabbitMQ
- Job remains in queue for long time
- Extracting logs if the nbwsapp or log-viewer pods are down
- Helm installation failed with bundle error
- Deployment fails with private container registry and Postgres fails to pull the images
- Troubleshooting AKS-specific issues
- Troubleshooting EKS-specific issues
- Resolving the primary server connection issue
- NetBackup Snapshot Manager deployment on EKS fails
- Wrong EFS ID is provided in cloudscale-values.yaml file
- Primary pod is in ContainerCreating state
- Webhook displays an error for PV not found
- Cluster Autoscaler initialization issue
- Catalog backup job fails with an error (Status 9202)
- Troubleshooting issue for bootstrapper pod
- Troubleshooting issues for kubectl plugin
- Troubleshooting AKS and EKS issues
- Appendix A. CR template
- Appendix B. MSDP Scaleout
- About MSDP Scaleout
- Prerequisites for MSDP Scaleout (AKS\EKS)
- Limitations in MSDP Scaleout
- MSDP Scaleout configuration
- Installing the docker images and binaries for MSDP Scaleout (without environment operators or Helm charts)
- Deploying MSDP Scaleout
- Managing MSDP Scaleout
- MSDP Scaleout maintenance
MSDP-X and Primary server corrupted
- Note the storage server, cloud LSU and cloud bucket name.
Note the DR Passphrase also.
- Copy DRPackages files (packages) from the pod to the local VM if not received over the email using the following command:
kubectl cp <primary-pod-namespace>/<primary-pod-name>:/mnt/nbdb/usr/openv/drpackage_<storageservername> <Path_where_to_copy_on_host_machine>
- Delete the corrupted MSDP and Primary server by running the following command:
kubectl delete -f environment.yaml -n <namespace>
Note:
Perform this step carefully as it would delete NetBackup.
- Clean the PV and PVCs of primary and MSDP server as follows:
Get names of PV attached to primary and MSDP server PVC (catalog, log and data) using the kubectl get pvc -n <namespace> -o wide command.
Delete primary and MSDP server PVC (catalog, log and data) using the kubectl delete pvc <pvc-name> -n <namespace> command.
Delete the PV linked to primary server PVC using the kubectl delete pv <pv-name> command.
- (EKS-specific) Navigate to mounted EFS directory and delete the content from primary_catalog folder by running the rm -rf /efs/* command.
- Modify the
environment.yamlfile with the paused: true field in the MSDP and Media sections.Change CR spec from paused: false to paused: true in MSDP Scaleout and media servers. Save it.
Note:
Ensure that only primary server is deployed. Now apply the modified
environment.yamlfile.Save the
environment.yamlfile. Apply theenvironment.yamlfile using the following command:kubectl apply -f environment.yaml -n <namespace>
- After the primary server is up and running, perform the following:
Execute the kubectl exec -it -n <namespace> <primary-pod-name> -- /bin/bash command in the primary server pod.
Increase the debug logs level on primary server.
Create a directory
DRPackagesat persisted location usingmkdir /mnt/nblogs/DRPackages.
- Copy earlier copied DR files to primary pod at
/mnt/nblogs/DRPackagesusing the kubectl cp <Path_of_DRPackages_on_host_machine> <primary-pod-namespace>/<primary-pod-name>:/mnt/nblogs/DRPackages command. - Execute the following steps (after exec) into the primary server pod:
Change ownership of files in
/mnt/nblogs/DRPackagesusing the chown nbsvcusr:nbsvcusr <file-name> command.Deactivate NetBackup health probes using the /opt/veritas/vxapp-manage/nb-health deactivate command.
Stop the NetBackup services using the /usr/openv/netbackup/bin/bp.kill_all command.
Execute the /usr/openv/netbackup/bin/admincmd/nbhostidentity -import -infile /mnt/ndbdb/usr/openv/drpackage/<filename>.drpkg command.
Clear NetBackup host cache by running the bpclntcmd -clear_host_cache command.
Restart the pods as follows:
Navigate to the
VRTSk8s-netbackup-<version>/scriptsfolder.Run the
cloudscale_restart.shscript as follows:./cloudscale_restart.sh <action> <namespace>
Provide the namespace and the required action:
stop: Stops all the services under primary server (waits until all the services are stopped).
start: Starts all the services and waits until the services are up and running under primary server.
restart: Stops the services and waits until all the services are down. Then starts all the services and waits until the services are up and running.
Note:
Ignore if policy job pod does not come up in running state. Policy job pod would start once primary services start.
Refresh the certificate revocation list using the /usr/openv/netbackup/bin/nbcertcmd -getcrl command.
- The SHA fingerprint is updated in the primary CR's status.
Run the primary server reconciler as follows:
Run the following command to pause the primary server CR:
helm upgrade cloudscale cloudscale-<version>.tgz -n netbackup --reuse-values --set environment.primary.paused=true
Run the following command to un-pause the primary server CR:
helm upgrade cloudscale cloudscale-<version>.tgz -n netbackup --reuse-values --set environment.primary.paused=false
- From Web UI, allow reissue of token from primary server for MSDP, media and Snapshot Manager server as follows:
Navigate to Security > Host Mappings for the MSDP storage server and select Allow Auto reissue Certificate.
Repeat this for media and Snapshot Manager server entries.
- Edit the environment using kubectl edit environment -n <namespace> command and change paused field to false for MSDP.
- Perform from step 2 in the following section:
- Edit environment CR and change
paused: falsefor media server. - Once media server pods are ready, perform full catalog recovery using one of the following options:
Trigger a catalog recovery from the Web UI.
Or
Exec into primary pod and run bprecover -wizard command.
- Once recovery is completed, restart the NetBackup services:
Stop NetBackup services using the /usr/openv/netbackup/bin/bp.kill_all command.
Start NetBackup services using the /usr/openv/netbackup/bin/bp.start_all command.
- Activate NetBackup health probes using the /opt/veritas/vxapp-manage/nb-health activate command.
- Verify/Backup/Restore the backup images in NetBackup server to check if the MSDP-X cluster has recovered or not.
- Verify that the Primary, Media, MSDP and Snapshot Manager server are up and running.
- Verify that the Snapshot Manager is running.