Backup Process for Kubernetes

Updated

Commvault provides several backup types for Kubernetes.

Note: If a backup or restore operation is interrupted, Commvault might not have a chance to remove temporary volumesnapshots, volumes, and Commvault worker pods. For simple identification and manual removal, if required, Commvault attaches labels to these resources. For more information, see Restrictions and Known Issues for Kubernetes.

Backup Types

Full Cluster Backups

A full cluster backup is a special application group that selects all namespaces (including System namespaces) and cluster-scoped API resources for protection.

Namespace Backups

A namespace backup is the default backup method for applications. A namespace backup protects all API resources within the parent namespace of a protected application. Namespaces may be selected by label or by discrete selection by browsing the cluster live.

Application Backups

Application backups are invoked when an individual application (Pod, DaemonSet, Deployment, StatefulSet, or Helm Chart) is selected for backup.

Application backups take an advanced application-centric approach to protection and intelligently infer related API resources/objects (such as Secrets, ConfigMaps, Namespaces).

Backup Modes

Full cluster, namespace, and application backups can use either snapshot-based backups or streaming backups. If a Kubernetes cluster has a VolumeSnapshotClass object configured, then backups are snapshot-based. If the cluster does not have VolumeSnapshotClass object configured, then backups are streaming.

CSI Volume Snapshot-Based Backups

A CSI volume snapshot-based backup uses the VolumeSnapshotClass object configured in the Kubernetes cluster to take snapshots of the persistent volumes and backup data. The CSI snapshot creation process runs in parallel for each application. Persistent volume backups do not occur as a consistency group across all PVs for an app. Instead, snapshots are created sequentially by the access node, and then protected in parallel.

A snapshot-based backup operation includes the following steps:

  1. Discover applications based on the application group content.

  2. For each Persistent Volume Claim that is associated with applications, in the backup namespace:

    1. Determine the VolumeSnapshotClass (for the PVC).

    2. Create a snapshot of the Persistent Volume Claim.

    3. Create a temporary Persistent Volume Claim by using the snapshot created earlier.

    4. Create a temporary pod and mount the temporary Persistent Volume Claim to read the data.

  3. Clean up the Job Results folder on the access node and complete the backup.

  4. After the backup completes, unmount the temporary Persistent Volume Claim and delete the temporary VolumeSnapshot, the Persistent Volume Claim, and the temporary pod.

Streaming Backups

A streaming backup is a legacy backup process that is used when the Kubernetes cluster does not have VolumeSnapshotClass object configured.

A streaming backup operation includes the following steps:

  1. Discover applications based on the application group content.

  2. For each volume that is associated with the applications, in the backup namespace, create a temporary pod and mount the volume to read data.

  3. Clean up the Job Results folder on the access node and complete the backup.

  4. After the backup completes, unmount the volume and delete the temporary pod.

vsphereVolume Snapshot-Based Backups

Caution: vsphereVolume is an in-tree volume plug-in and is deprecated from active Kubernetes releases. Upgrade your VMware cluster to support the Container Storage Interface (CSI) out-of-tree driver.

If a Kubernetes cluster has the vsphereVolume (deprecated) volume plug-in installed, and if you enabled Commvault VMware vSphere integration for Kubernetes, then Commvault uses the following process during backups to create VMware VMDK snapshots from the Kubernetes access node.

A vsphereVolume snapshot-based backup requires the PersistentVolumeClaim (PVC) to reside on a StorageClass that uses the kubernetes.io/vsphere-volume provisioner. VMware has two provisioners:

  • CSI provisioner: csi.vpshere.vmware.com

  • vCP provisioner: kubernetes.io/vsphere-volume

This process applies to the vCP provisioner.

The backup process for PersistentVolumes residing on a vsphereVolume controlled StorageClass is as follows:

  1. Discover applications based on the application group content. (See vsdiscovery.log on the access node.)

  2. For each PersistentVolumeClaim that is associated with a in-scope application or namespace, do the following:

    1. Determine the StorageClass (for the PVC).

    2. Validate that the StorageClass provisioner is kubernetes.io/vsphere-volume.

    3. Contact the VMware vCenter SDK endpoint and create a snapshot of the PersistentVolumeClaim VMDK.

    4. Create a temporary VMDK volume and PersistentVolumeClaim by using the snapshot created earlier.

    5. Create a temporary pod and mount the temporary PersistentVolumeClaim to read the data.

    6. Complete the backup.

  3. Unmount the temporary PersistentVolumeClaim, and terminate the temporary Pod.

  4. Contact the VMware vCenter SDK endpoint and delete the temporary VMDK volume and VMDK snapshot.

  5. Clean up the Job Results folder on the access node and complete the backup.

How Commvault Handles Failures During Backups and Restores

What happens when a failure occurs during a backup or a restore?

By default, Commvault handles failures as follows:

  • If the backup or restore for an individual application fails, then the Kubernetes access node restarts the application-specific job from the beginning, starting with rescheduling a new Commvault temporary worker pod. Backup data from the previous failed backup is discarded and must be re-transferred.

  • If a child Kubernetes access node fails, then the operations that it was handling are rescheduled on another access node.

  • If the coordinator Kubernetes access node fails, then a partial or complete failure status is registered for the job. The job is not restarted.

  • If the Commvault temporary worker pod fails, then the job is not restarted and is marked as Completed with Errors.

For more information about job restarts, see Job Status and Control for Virtual Machines. This page contains information for the Virtual Server Agent, which is the agent that the Commvault software uses to manage Kubernetes jobs.

How many times, and how frequently, does Commvault try to restart a failed job?

By default, restarts are enabled for all job types. For backup jobs, the default number of restarts is 10. For restore jobs, the default number of restarts is 144. For both backup and restore jobs, the default restart interval is 20 minutes.

For more information about job restarts, see Job Status and Control for Virtual Machines. This page contains information for the Virtual Server Agent, which is the agent that the Commvault software uses to manage Kubernetes jobs.

If resources are created for failed attempts to run the job, are those resources cleaned up?

If a Kubernetes backup or restore operation is interrupted, the Commvault access node might not have an opportunity to remove temporary VolumeSnapshots, volumes, and Commvault temporary worker pods.

For a workaround and other information, see "Cleanup of Temporary Resources" in Restrictions and Known Issues for Kubernetes.

What happens when the storage snapshotter pod fails during a backup job?

If the Container Storage Interface (CSI) components (specifically, the external snapshotter pod) fail, the following occur:

  1. Commvault attempts to issue snap commands fail.

  2. The failures are reported by the kube-apiserver as failures of Commvault attempts to snap.

  3. Commvault fails the backup for that container.

  4. Commvault tries to restart the application snapshot job 5 times, with the following results:

    • If snapshots are already created, the backup proceeds. However, the cleanup of Commvault-created PVCs/snapshots might fail. For more information, see "Cleanup of Temporary Resources" in Restrictions and Known Issues for Kubernetes.

    • If snapshots are not created yet, the backup job fails, and Commvault does not attempt to restart the job.