Cohesity Cloud Scale Technology Manual Deployment Guide for Kubernetes Clusters
- Introduction
- Section I. Configurations
- Prerequisites
- Preparing the environment for NetBackup installation on Kubernetes cluster
- Prerequisites for Snapshot Manager (AKS/EKS)
- Prerequisites for Kubernetes cluster configuration
- Prerequisites for Cloud Scale configuration
- Prerequisites for deploying environment operators
- Prerequisites for using private registry
- Recommendations and Limitations
- Configurations
- Configuration of key parameters in Cloud Scale deployments
- Tuning touch files
- Setting maximum jobs per client
- Setting maximum jobs per media server
- Enabling intelligent catalog archiving
- Enabling security settings
- Configuring email server
- Reducing catalog storage management
- Configuring zone redundancy
- Enabling client-side deduplication capabilities
- Parameters for logging (fluentbit)
- Managing media server configurations in Web UI
- Prerequisites
- Section II. Deployment
- Section III. Monitoring and Management
- Monitoring NetBackup
- Monitoring Snapshot Manager
- Monitoring fluentbit
- Monitoring MSDP Scaleout
- Managing NetBackup
- Managing the Load Balancer service
- Managing PostrgreSQL DBaaS
- Managing logging
- Performing catalog backup and recovery
- Section IV. Maintenance
- PostgreSQL DBaaS Maintenance
- Patching mechanism for primary, media servers, fluentbit pods, and postgres pods
- Upgrading
- Cloud Scale Disaster Recovery
- Uninstalling
- Troubleshooting
- Troubleshooting AKS and EKS issues
- View the list of operator resources
- View the list of product resources
- View operator logs
- View primary logs
- Socket connection failure
- Resolving an issue where external IP address is not assigned to a NetBackup server's load balancer services
- Resolving the issue where the NetBackup server pod is not scheduled for long time
- Resolving an issue where the Storage class does not exist
- Resolving an issue where the primary server or media server deployment does not proceed
- Resolving an issue of failed probes
- Resolving issues when media server PVs are deleted
- Resolving an issue related to insufficient storage
- Resolving an issue related to invalid nodepool
- Resolve an issue related to KMS database
- Resolve an issue related to pulling an image from the container registry
- Resolving an issue related to recovery of data
- Check primary server status
- Pod status field shows as pending
- Ensure that the container is running the patched image
- Getting EEB information from an image, a running container, or persistent data
- Resolving the certificate error issue in NetBackup operator pod logs
- Pod restart failure due to liveness probe time-out
- NetBackup messaging queue broker take more time to start
- Host mapping conflict in NetBackup
- Issue with capacity licensing reporting which takes longer time
- Local connection is getting treated as insecure connection
- Backing up data from Primary server's /mnt/nbdata/ directory fails with primary server as a client
- Storage server not supporting Instant Access capability on Web UI after upgrading NetBackup
- Taint, Toleration, and Node affinity related issues in cpServer
- Operations performed on cpServer in cloudscale-values.yaml file are not reflected
- Elastic media server related issues
- Failed to register Snapshot Manager with NetBackup
- Post Kubernetes cluster restart, flexsnap-listener pod went into CrashLoopBackoff state or pods were unable to connect to flexsnap-rabbitmq
- Post Kubernetes cluster restart, issues observed in case of containerized Postgres deployment
- Request router logs
- Issues with NBPEM/NBJM
- Issues with logging feature for Cloud Scale
- The flexsnap-listener pod is unable to communicate with RabbitMQ
- Job remains in queue for long time
- Extracting logs if the nbwsapp or log-viewer pods are down
- Helm installation failed with bundle error
- Deployment fails with private container registry and Postgres fails to pull the images
- Troubleshooting AKS-specific issues
- Troubleshooting EKS-specific issues
- Resolving the primary server connection issue
- NetBackup Snapshot Manager deployment on EKS fails
- Wrong EFS ID is provided in cloudscale-values.yaml file
- Primary pod is in ContainerCreating state
- Webhook displays an error for PV not found
- Cluster Autoscaler initialization issue
- Catalog backup job fails with an error (Status 9202)
- Troubleshooting issue for bootstrapper pod
- Troubleshooting issues for kubectl plugin
- Troubleshooting AKS and EKS issues
- Appendix A. CR template
- Appendix B. MSDP Scaleout
- About MSDP Scaleout
- Prerequisites for MSDP Scaleout (AKS\EKS)
- Limitations in MSDP Scaleout
- MSDP Scaleout configuration
- Installing the docker images and binaries for MSDP Scaleout (without environment operators or Helm charts)
- Deploying MSDP Scaleout
- Managing MSDP Scaleout
- MSDP Scaleout maintenance
Environment backup
Note down the MSDP operator Namespace, NodeSelector, StorageClassName, Tolerations and Image tag as follows:
Obtain the name of the msdp operator statefulset using the following command:
kubectl get statefulset -n <msdp-operator-system-namespace>
Use the following command to backup MSDP operator Image tag, Tolerations, and NodeSelector:
kubectl get sts <msdp-operator-statefulset-name> -n <msdp-operator-sample-namespace> -o=jsonpath='{"Namespace :"}{$.metadata.namespace}{$"\nImage :"}{$.spec.template.spec.containers[0].image}{$"\nNodeSelector :"}{$.spec.template.spec.nodeSelector}{$"\nTolerations :"}{$.spec.template.spec.tolerations[2]}{$"\nStorageClassName :"}{$.spec.volumeClaimTemplates[0].spec.storageClassName}{$"\n"}'
From the output, note down the Image tag, StorageClassName, Tolerations and NodeSelector:
Sample Output: Namespace :msdp-operator-system Image :nbuk8sreg.azurecr.io/msdp-operator:21.0 NodeSelector :{"agentpool":"nbuxpool"} Tolerations :{"key":"agentpool","operator":"Equal","value":"nbuxpool"} StorageClassName :nb-disk-premiumIf toleration is not provided for msdp operator, then use the following command:
kubectl get sts <msdp-operator-statefulset-name> -n <msdp-operator-sample-namespace> -o=jsonpath='{"Namespace :"}{$.metadata.namespace}{$"\nImage :"}{$.spec.template.spec.containers[0].image}{$"\nNodeSelector :"}{$.spec.template.spec.nodeSelector}{$"\nStorageClassName :"}{$.spec.volumeClaimTemplates[0].spec.storageClassName}{$"\n"}'
Sample Output: Namespace :msdp-operator-system Image :nbuk8sreg.azurecr.io/msdp-operator:21.0 NodeSelector :{"agentpool":"nbuxpool"} StorageClassName :nb-disk-premiumBackup the above msdp-operator storageClass using the following command:
kubectl get sc <msdp-operator-storageclass-name> -o yaml > msdpopstorageclass_backup.yaml
Note down the NetBackup operator Namespace, NodeSelector, Tolerations and Image tag as follows:
Obtain the name of the NetBackup operator deployment using the following command:
kubectl get deployment -n <netbackup-operator-system-namespace>
Use the following command to backup NetBackup operator Image tag, Tolerations, and NodeSelector:
kubectl get deployment <netbackup-operator-deployment-name> -n <netbackup-operator-system-namespace> -o=jsonpath='{"Namespace :"}{$.metadata.namespace}{$"\nImage :"}{$.spec.template.spec.containers[0].image}{$"\nNodeSelector :"}{$.spec.template.spec.nodeSelector}{$"\nTolerations: "}{$.spec.template.spec.tolerations}{$"\n"}'
From the output, note down the Image tag, Tolerations and NodeSelector:
Sample Output: Namespace :netbackup-operator-system Image :nbuk8sreg.azurecr.io/netbackup/operator:11.1.x.x.xxxx NodeSelector :{"agentpool":"agentpool"} Tolerations: [{"key":"agentpool","operator":"Equal","value":"agentpool"}]Note down the flexsnap-operator Namespace, NodeSelector, Tolerations and Image tag as follows:
Obtain the name of the flexsnap-operator deployment using the following command:
kubectl get deployment -n <netbackup-operator-system-namespace>
Use the following command to backup flexsnap operator Image tag, Tolerations, and NodeSelector:
kubectl get deployment <flexsnap-operator-deployment-name> -n <netbackup-operator-system-namespace> -o=jsonpath='{"Namespace :"}{$.metadata.namespace}{$"\nImage :"}{$.spec.template.spec.containers[0].image}{$"\nNodeSelector :"}{$.spec.template.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0]}{$"\nTolerations :"}{$.spec.template.spec.tolerations}{$"\n"}'From the output, note down the Image tag, Tolerations and NodeSelector:
Sample Output: Namespace :netbackup-operator-system Image :nbuk8sreg.azurecr.io/veritas/flexsnap-deploy:11.1.x.x.xxxx NodeSelector :{"key":"agentpool","operator":"In","values":["agentpool"]} Tolerations :[{"effect":"NoSchedule","key":"agentpool","operator":"Equal","value":"agentpool"}](For DBaaS) Note the FQDN of the Postgres server created.
(Applicable only if unified container is created) Note the Postgres unified container image tag, containerPort:
kubectl get statefulset.apps/nb-postgresql -n <sample-namespace> -o=jsonpath='{$"\nImage :"}{$.spec.template.spec.containers[0].image}{$"\ncontainerPort :"}{$.spec.template.spec.containers[0].ports[0].containerPort}{$"\n"}'
Sample output:
Image :cpautomation.azurecr.io/netbackup/postgresql:16.10.1.0-0001-DR1 containerPort :13787
Obtain the fluentbit image tags and nodeselector using the following command:
kubectl get deployment.apps/nb-fluentbit-collector -n netbackup -o=jsonpath='{$"\nImage :"}{$.spec.template.spec.containers[0].image}{$"\nImage2 :"}{$.spec.template.spec.containers[1].image}{"\n"}'
Sample output:
Image1:cpautomation.azurecr.io/netbackup/fluentbit:11.1.x-xxxx Image2:cpautomation.azurecr.io/netbackup/fluentbit-log-cleanup:11.1.x-xxxx
Take backup of
operator-values.yamlfile using the following command:helm get values operators -n netbackup-operator-system > operator-values.yaml
Or
Save the
operator-values.yamlfile.Take backup of
cloudscale-values.yamlfile using the following command:helm get values cloudscale -n netbackup > cloudscale-values.yaml
Or
Save the
cloudscale-values.yamlfile fromkubectl-plugin.
Note down the values of spec, cpServer, storage, log, storageClassName from the
cloudscale-values.yamlfile using the following command:kubectl get sc nb-file-premium -o yaml > CPServerLog_storageclass_backup.yaml
Search the storageClassName in
cloudscale-values.yamlfile using the kubectl get sc nb-disk-standardssd -o yaml > storageclass_backup.yaml command and provide these storageclasses name in the following command:For example,
nb-disk-standardssdkubectl get sc <storageclass-name1> <storageclass-name2> <storageclass-name3> -o yaml > storageclass_backup.yaml
Note down and save the required values (names) of the secrets obtained from
cloudscale-values.yamlfile in the above step:For example,
credSecretName: primary-credential-secretSave the secrets yaml file as follows:
kubectl get secret <secret-name1> <secret-name2> <secret-name3> -n <sample-namespace> -o yaml > secret_backup.yaml
For example, kubectl get secret primary-credential-secret kms-secret example-key-secret -n example-ns -o yaml > secret_backup.yaml
Note:
(For DBaaS) The
primary-credential-secretandkms-secretkey values will not be there incloudscale-values.yamlfile that have been backed up. Use the helm command to get values from the above step. By default helm would be using the values provided during the deployment.Save the secrets named as Msdp credential and drInfoSecret during creation. As the operator would delete these secrets after using it.
MSDP credential: Step 2 in the following section:
drInfoSecret: Step 2 in the following section:
(For DBaaS) Note the password changed during DBaaS cluster deployment:
(For Azure) To check the OLD_ DBADMINPASSWORD execute the following command after executing into the primary pod:
[root@user-primary-0 mnt]# ls -a . .. atdata .db-cert nbdata nbdb .nb-kmsdb nblogs .nb-pgdb nbu-primary-env .nb-user .passphrase-secret .password-secret podemptydir .token-secret [root@user-primary-0 mnt]# cd .nb-pgdb/ [root@user-primary-0 .nb-pgdb]# ls dbadminlogin dbadminpassword dbport dbserver pgbouncerport [root@user-primary-0 .nb-pgdb]# cat dbadminlogin dbadminlogin [root@user-primary-0 .nb-pgdb]# cat dbadminpassword 5MrJrGaLSnDKxTJ0UiPvotqmbqSzqU5a [root@user-primary-0 .nb-pgdb]# cat dbport 5432 [root@user-primary-0 .nb-pgdb]# cat dbserver user-postgres.postgres.database.azure.com [root@user-primary-0 .nb-pgdb]# cat pgbouncerport 6432 DBADMINPASSWORD DBADMINPASSWORD DBADMINPASSWORD
(For EKS) Login to AWS UI, navigate to Secrets Manager and find adminSecret. Naming convention for admin secrets are as follows:
admin-secret-<use cluster name remove prefix eks->
Note the values (names) of the secretProviderClass.
For example,
dbSecretProviderClass: db-secret-provider-classSave the
secretProviderClass.yamlfile using the following command:kubectl get secretproviderclass <secretproviderclass-name> -n <sample-namespace> -o yaml> secretproviderclass_backup.yaml
Note:
The dbSecretProviderClass is an optional field. If it is not present in the
cloudscale-values.yamlfile, then skip this step.Note the required values (names)and sSave internal configmap yaml using the following command:
kubectl get configmap nbu-media-autoscaler-configmap flexsnap-conf nbuconf cs-config -n <sample-namespace> -o yaml > internalconfigmap_backup.yaml
Note:
The
nbu-media-autoscaler-configmapis an optional internal configmap. If it is not present in namespace, then removenbu-media-autoscaler-configmapfrom the above command.Take a note of cert-manager and trust-manager as the same versions of cert-manager and trust-manager should be used when deploying them during recovery.
Note the details of cloud STU used for MSDP storage, such as name of bucket, volume, credential and the respective details added through Credential management in UI.
(Applicable only for DBaaS based deployment environment) Snapshot Manager backup steps:
For AKS
Search the disk (PV) to which psql pvc is attached in Azure cloud portal and click on in the different resource group other than the cluster infra resource group and note down this resource group. Wait for the resource to be available.
Note:
Snapshot must be created in resource group in different availability zone to take care of the recovery in case of zone failures/corrupted.
Save the
pgsql-pv.yamlfile:kubectl get pv | grep psql-pvc
pvc-079b631e-a905-4586-80b5-46acc7011669 30Gi RWO Retain Bound nbu/psql-pvc managed-csi-hdd 3h10m
kubectl describe pv <PV which is bound to psql-pvc> > pgsql-pv.yaml
For example, kubectl describe pv pvc-079b631e-a905-4586-80b5-46acc7011669 > pgsql-pv.yaml
Note down the snapshot id, which would be used to create a disk from snapshot during recovery.
Note:
Disk Snapshot must be taken after every plugin addition as the latest database is required to recover all the plugins during Database recovery.
For EKS
Describe the PV attached to psql-pvc and save the VolumeID (for example,
vol-xxxxxxxxxxxxxxx), storage class name and availability zone (AZ) from the output of following command:kubectl get pv | grep psql-pvc
pvc-079b631e-a905-4586-80b5-46acc7011669 30Gi RWO Retain Bound nbu/psql-pvc managed-csi-hdd 3h10m
kubectl describe pv <PV which is bound to psql-pvc> > pgsql-pv.yaml
For example, kubectl describe pv pvc-079b631e-a905-4586-80b5-46acc7011669 > pgsql-pv.yaml
Search above VolumeID in the in AWS cloud portal.
Create snapshot (expand the drop down) from the volume and wait for the completion. Note down the snapshot id (for example,
snap-xxxxxxxxxxxx)Note:
Disk Snapshot must be taken after every plugin addition as the latest database is required to recover all the plugins during Database recovery.
Take the backup of catalog policy at
/mnt/nbdata/DrPackageson the master server.
Note:
For manual deployment using Helm charts, ensure that you save the operators-values.yaml and cloudscale-values.yaml files. These files are used at the time of recovery.