Backup Process for Kubernetes

Updated

Commvault provides several backup types for Kubernetes.

Backup Types

Full Cluster Backups

A full cluster backup is a special application group that selects all namespaces (including System namespaces) and cluster-scoped API resources for protection. Protection then occurs according to these steps:

Namespace Backups

A namespace backup is the default backup method for applications. A namespace backup protects all API resources within the parent namespace of a protected application. Namespaces may be selected by label or by discrete selection by browsing the cluster live.

Application Backups

Application backups are invoked when an individual application (Pod, DaemonSet, Deployment, StatefulSet, or Helm Chart) is selected for backup.

Application backups take an advanced application-centric approach to protection and intelligently infer related API resources/objects (such as Secrets, ConfigMaps, Namespaces).

Backup Modes

Full cluster, namespace, and application backups can use either snapshot-based backups or streaming backups. If a Kubernetes cluster has a VolumeSnapshotClass object configured, then backups are snapshot-based. If the cluster does not have VolumeSnapshotClass object configured, then backups are streaming.

CSI Volume Snapshot-Based Backups

A CSI volume snapshot-based backup uses the VolumeSnapshotClass object configured in the Kubernetes cluster to take snapshots of the persistent volumes and backup data. The CSI snapshot creation process runs in parallel for each application. Persistent volume backups do not occur as a consistency group across all PVs for an app. Instead, snapshots are created sequentially by the access node, and then protected in parallel.

A snapshot-based backup operation includes the following steps:

  1. Discover applications based on the application group content.

  2. For each Persistent Volume Claim that is associated with applications, in the backup namespace:

    1. Determine the VolumeSnapshotClass (for the PVC).

    2. Create a snapshot of the Persistent Volume Claim.

    3. Create a temporary Persistent Volume Claim by using the snapshot created earlier.

    4. Create a temporary pod and mount the temporary Persistent Volume Claim to read the data.

  3. Clean up the Job Results folder on the access node and complete the backup.

  4. After the backup completes, unmount the temporary Persistent Volume Claim and delete the temporary VolumeSnapshot, the Persistent Volume Claim, and the temporary pod.

Streaming Backups

A streaming backup is a legacy backup process that is used when the Kubernetes cluster does not have VolumeSnapshotClass object configured.

A streaming backup operation includes the following steps:

  1. Discover applications based on the application group content.

  2. For each volume that is associated with the applications, in the backup namespace, create a temporary pod and mount the volume to read data.

  3. Clean up the Job Results folder on the access node and complete the backup.

  4. After the backup completes, unmount the volume and delete the temporary pod.

How Commvault Handles Failures During Backups and Restores

What happens when a failure occurs during a backup or a restore?

By default, Commvault handles failures as follows:

  • If the backup or restore for an individual application fails, then the Kubernetes access node restarts the application-specific job from the beginning, starting with rescheduling a new Commvault temporary worker pod. Backup data from the previous failed backup is discarded and must be re-transferred.

  • If a child Kubernetes access node fails, then the operations that it was handling are rescheduled on another access node.

  • If the coordinator Kubernetes access node fails, then a partial or complete failure status is registered for the job. The job is not restarted.

  • If the Commvault temporary worker pod fails, then the job is not restarted and is marked as Completed with Errors.

For more information about job restarts, see Job Status and Control for Virtual Machines. This page contains information for the Virtual Server Agent, which is the agent that the Commvault software uses to manage Kubernetes jobs.

How many times, and how frequently, does Commvault try to restart a failed job?

By default, restarts are enabled for all job types. For backup jobs, the default number of restarts is 10. For restore jobs, the default number of restarts is 144. For both backup and restore jobs, the default restart interval is 20 minutes.

For more information about job restarts, see Job Status and Control for Virtual Machines. This page contains information for the Virtual Server Agent, which is the agent that the Commvault software uses to manage Kubernetes jobs.

If resources are created for failed attempts to run the job, are those resources cleaned up?

If a Kubernetes backup or restore operation is interrupted, the Commvault access node might not have an opportunity to remove temporary VolumeSnapshots, volumes, and Commvault temporary worker pods.

For a workaround and other information, see "Cleanup of Temporary Resources" in Restrictions and Known Issues for Kubernetes.

What happens when the storage snapshotter pod fails during a backup job?

If the Container Storage Interface (CSI) components (specifically, the external snapshotter pod) fail, the following occur:

  1. Commvault attempts to issue snap commands fail.

  2. The failures are reported by the kube-apiserver as failures of Commvault attempts to snap.

  3. Commvault fails the backup for that container.

  4. Commvault tries to restart the application snapshot job 5 times, with the following results:

    • If snapshots are already created, the backup proceeds. However, the cleanup of Commvault-created PVCs/snapshots might fail. For more information, see "Cleanup of Temporary Resources" in Restrictions and Known Issues for Kubernetes.

    • If snapshots are not created yet, the backup job fails, and Commvault does not attempt to restart the job.