NetBackup™ Deployment Guide for Kubernetes Clusters
- Introduction
- Section I. Configurations
- Prerequisites
- Recommendations and Limitations
- Configurations
- Configuration of key parameters in Cloud Scale deployments
- Tuning touch files
- Setting maximum jobs per client
- Setting maximum jobs per media server
- Enabling intelligent catalog archiving
- Enabling security settings
- Configuring email server
- Reducing catalog storage management
- Configuring zone redundancy
- Enabling client-side deduplication capabilities
- Parameters for logging (fluentbit)
- Section II. Deployment
- Section III. Monitoring and Management
- Monitoring NetBackup
- Monitoring Snapshot Manager
- Monitoring fluentbit
- Monitoring MSDP Scaleout
- Managing NetBackup
- Managing the Load Balancer service
- Managing PostrgreSQL DBaaS
- Managing fluentbit
- Performing catalog backup and recovery
- Section IV. Maintenance
- PostgreSQL DBaaS Maintenance
- Patching mechanism for primary, media servers, fluentbit pods, and postgres pods
- Upgrading
- Cloud Scale Disaster Recovery
- Uninstalling
- Troubleshooting
- Troubleshooting AKS and EKS issues
- View the list of operator resources
- View the list of product resources
- View operator logs
- View primary logs
- Socket connection failure
- Resolving an issue where external IP address is not assigned to a NetBackup server's load balancer services
- Resolving the issue where the NetBackup server pod is not scheduled for long time
- Resolving an issue where the Storage class does not exist
- Resolving an issue where the primary server or media server deployment does not proceed
- Resolving an issue of failed probes
- Resolving token issues
- Resolving an issue related to insufficient storage
- Resolving an issue related to invalid nodepool
- Resolving a token expiry issue
- Resolve an issue related to KMS database
- Resolve an issue related to pulling an image from the container registry
- Resolving an issue related to recovery of data
- Check primary server status
- Pod status field shows as pending
- Ensure that the container is running the patched image
- Getting EEB information from an image, a running container, or persistent data
- Resolving the certificate error issue in NetBackup operator pod logs
- Pod restart failure due to liveness probe time-out
- NetBackup messaging queue broker take more time to start
- Host mapping conflict in NetBackup
- Issue with capacity licensing reporting which takes longer time
- Local connection is getting treated as insecure connection
- Primary pod is in pending state for a long duration
- Backing up data from Primary server's /mnt/nbdata/ directory fails with primary server as a client
- Storage server not supporting Instant Access capability on Web UI after upgrading NetBackup
- Taint, Toleration, and Node affinity related issues in cpServer
- Operations performed on cpServer in environment.yaml file are not reflected
- Elastic media server related issues
- Failed to register Snapshot Manager with NetBackup
- Post Kubernetes cluster restart, flexsnap-listener pod went into CrashLoopBackoff state or pods were unable to connect to flexsnap-rabbitmq
- Post Kubernetes cluster restart, issues observed in case of containerized Postgres deployment
- Request router logs
- Issues with NBPEM/NBJM
- Issues with logging feature for Cloud Scale
- The flexsnap-listener pod is unable to communicate with RabbitMQ
- Troubleshooting AKS-specific issues
- Troubleshooting EKS-specific issues
- Troubleshooting issue for bootstrapper pod
- Troubleshooting AKS and EKS issues
- Appendix A. CR template
- Appendix B. MSDP Scaleout
- About MSDP Scaleout
- Prerequisites for MSDP Scaleout (AKS\EKS)
- Limitations in MSDP Scaleout
- MSDP Scaleout configuration
- Installing the docker images and binaries for MSDP Scaleout (without environment operators or Helm charts)
- Deploying MSDP Scaleout
- Managing MSDP Scaleout
- MSDP Scaleout maintenance
Cluster specific settings
It is recommended to have a private Kubernetes cluster created for Cloud Scale deployment.
Ensure that the control plane or API server of the private created Kubernetes cluster has an internal IP address.
Note:
The use of private cluster ensures that the network traffic between your API server and node pools remain on the private network only.
Select Linux based operating system for the control and data pool nodes.
By default, Linux based operating system is supported only with default settings.
Ensure that latest version of Kubernetes cluster exists which is supported by Cloud Scale version 10.3 and above.
Autoscaling parameters
The autoscaling value for the pool node must always be set to True.
The minimum value of nodes in this node pool must be 1 and the maximum value can be obtained using the following formulae:
Number of max nodes = Number of parallel backup and snapshot jobs to run / Minimum of (RAM per node in GB, max_jobs setting)
Maximum pods per node setting
For Azure:
MAX pods per node = (RAM size in GB * 2) + number of Kubernetes and CSP pods[10] + 3(listener + fluent collector + fluentbit)
For AWS:
Node size in AWS must be selected depending on ENIC available with the node type. For more information on changing the value of max pods per node in AWS, refer to AWS Documentation.
Note:
If the max pods per node are not sufficient, then max jobs per node can be reduced as mentioned in the 'max_jobs tunable' content in the following section.
Pool settings
NetBackup pool: Used for deployment of NetBackup primary services along with Snapshot Manager control plane services.
Minimum CPU requirement and Node size RAM: 4 CPU and 16 GB RAM
cpdata pool: Used for deployment of Snapshot Manager data plane (dynamically created) services.
Average size of VM to be backed up **
RAM requirement in GB's for cpdata node
Number of CPU's
Tunable
<= 2 TB
8
2
2 TB > and < 4 TB
8
2
Max_jobs = 4
4 TB > and < 8 TB
16
4
Max_jobs = 5
8 TB > and < 16 TB
16
4
Max_jobs = 4
16 TB > and < 24 TB
24
4
Max_jobs = 3
24 TB > and < 32 TB
32
4
Max_jobs = 3
Note:
** If customer has distinct sizes of hosts to be protected then one should consider the higher sized VM's as an average size of the VM.
Media pool: CPU requirement and Node size RAM: 4 CPU and 16 GB RAM
MSDP pool: CPU requirement and Node size RAM: 4 CPU and 16 GB RAM
max_jobs tunable: The max_jobs tunable parameter is used to restrict the number of jobs that can run on single node of the Cloud Scale cpdata node which can be used to reduce the number of jobs a single node can run.
The max_jobs must be updated as follows:
$ Kubectl edit configmap flexsnap-conf -n <nbux ns>
Add the following entry in
flexsnap.confsection:[capability_limit]
max_jobs=16
For example,
===== ~$ k describe cm flexsnap-conf Name: flexsnap-conf Namespace: nbux-002522 Labels: <none> Annotations: <none> Data ==== flexsnap.conf: ---- [agent] id = agent.8308b7c831af4b0388fdd7f1d91541e0 [capability_limit] max_jobs=16 =======
Tuning account rate limit: For BFS performance improvement, the API limits per AWS account can be updated as per the following formulae:
X = Account rate limit V1 = Number of VM's S1 = schedules/day D1 = Data change rate TB/incremental backup X = ((S1 * D1 * V1)/40) < 1 ? keep 1000 : go for (X= ((X+1)*1000)) requests/sec
For example,
The default theoretical speed for the account is 43 TB/day (1000 request per sec x 86400 sec in a day x 512 KB block size).
For PP schedule frequency of 1 per day and each VM around 1 TB size.
Theoretical maximum for number of full/day if the backup window is the full day, then 43 VM/day can be backed up.
With 10% incremental changes everyday, the theoretical maximum for incremental is 380 incremental VM's/day with all incrementals having similar change rate. This incremental change does not consider obtaining the changed list and other pre and post backup functionality. If you consider this as taking 20% of time, then it would be around 250 incremental VMs/ day.