Restore Process for Kubernetes

Commvault provides several types of restores for Kubernetes.

Note

If a backup or restore operation is interrupted, Commvault might not have a chance to remove temporary volumesnapshots, volumes, and Commvault worker pods. For simple identification and manual removal, if required, Commvault attaches labels to these resources. For more information, see Restrictions and Known Issues for Kubernetes.
Regarding the recovery of external IP addresses, Commvault does not set hard-coded IP addresses on resources (such as load balancers) to avoid conflicts with existing resources.

Full Application Restores

A Kubernetes full application restore operation includes a few steps.

If the unconditional overwrite setting is enabled for restores, then the existing applications in the destination namespace are deleted.
For each volume that is associated with the applications, in the restore namespace:
1. Create a Persistent Volume Claim with the same provisioned size as the original volume.
2. Create a temporary pod and mount the volume to the temporary pod.
3. Read data from the MediaAgent and write to the persistent volume.
4. After the writes are complete, unmount the volume from the temporary pod and delete the temporary pod.
5. Deploy the application and mount the volume to the associated application.

How Commvault Handles Failures During Backups and Restores

What happens when a failure occurs during a backup or a restore?

By default, Commvault handles failures as follows:

If the backup or restore for an individual application fails, then the Kubernetes access node restarts the application-specific job from the beginning, starting with rescheduling a new Commvault temporary worker pod. Backup data from the previous failed backup is discarded and must be re-transferred.
If a child Kubernetes access node fails, then the operations that it was handling are rescheduled on another access node.
If the coordinator Kubernetes access node fails, then a partial or complete failure status is registered for the job. The job is not restarted.
If the Commvault temporary worker pod fails, then the job is not restarted and is marked as Completed with Errors.

For more information about job restarts, see Job Status and Control for Virtual Machines. This page contains information for the Virtual Server Agent, which is the agent that the Commvault software uses to manage Kubernetes jobs.

How many times, and how frequently, does Commvault try to restart a failed job?

By default, restarts are enabled for all job types. For backup jobs, the default number of restarts is 10. For restore jobs, the default number of restarts is 144. For both backup and restore jobs, the default restart interval is 20 minutes.

If resources are created for failed attempts to run the job, are those resources cleaned up?

If a Kubernetes backup or restore operation is interrupted, the Commvault access node might not have an opportunity to remove temporary VolumeSnapshots, volumes, and Commvault temporary worker pods.

For a workaround and other information, see "Cleanup of Temporary Resources" in Restrictions and Known Issues for Kubernetes.

What happens when the storage snapshotter pod fails during a backup job?

If the Container Storage Interface (CSI) components (specifically, the external snapshotter pod) fail, the following occur:

Commvault attempts to issue snap commands fail.
The failures are reported by the kube-apiserver as failures of Commvault attempts to snap.
Commvault fails the backup for that container.
Commvault tries to restart the application snapshot job 5 times, with the following results:
- If snapshots are already created, the backup proceeds. However, the cleanup of Commvault-created PVCs/snapshots might fail. For more information, see "Cleanup of Temporary Resources" in Restrictions and Known Issues for Kubernetes.
- If snapshots are not created yet, the backup job fails, and Commvault does not attempt to restart the job.

Commvault recommends putting observability and health checks on your CSI controllers, including the external snapshot controller, to ensure that your cluster is always in a state in which backups and restores can be performed.