Hadoop (HDFS)

The Commvault software provides the integrated approach that you need to back up and archive HDFS (Hadoop Distributed File System) data.

You install the Commvault software on a Hadoop DataNode or a Hadoop Client Node. These nodes are referred to as data access nodes.

When you configure Hadoop, you specify one data access node as a master node. The master node must always be available. The master node is a control client that distributes the backup and restore operations among the data access nodes.

During backup and restore operations, communication that is related to the file system namespace operations between the data access nodes and the Hadoop cluster occurs through the Hadoop NameNode. The actual data transfer occurs between the Hadoop DataNodes and the data access nodes.

For Azure HDInsight cluster with backend storage as Azure Blob Storage, we recommend using Cloud Apps to back up the Azure Blob Storage directly. For more information about Azure Blob Storage, see Overview: Azure Blob Storage.

Key Features

Simplified Data Management

Management of all the Hadoop data in your environment using the same console and infrastructure.

Distributed Backup and Restores

Distributed backup and restores, which run in parallel on multiple data access nodes, for optimal sharing of the backup load.

Extent-Based Backups

Extent-Based Backups integrate extent or chunk based technology to speed up backup operations. Extent based backups break down a file in to small independent extents,that leads to potentially quicker backups and more resiliency to network disruptions. Extent based backups are helpful if the backup content consists of files that are large in size that may cause the available data streams to remain idle.

Extent-based backups are Hadoop rack-aware. If there are backup (data) access nodes configured in the same rack as that of the Hadoop data nodes that contain the physical blocks, then those backup nodes are given priority while backing up the blocks from the same rack.

Benefits

Faster backups
Large files are backed up using multiple streams that increases the speed of backups.
Efficient Backups
During network disruption, a backup job resumes exactly from the place the backup job was interrupted.

Limitations

Restore by jobs feature is not supported for extents based backups.
When you kill a restore operation that restores files that are backed up as extents, the empty files might be left behind.

Fault-Tolerant Model

Fault-tolerant model that redistributes task loads when a data access node fails.

Data Archiving

Archive and delete inactive Hadoop data from the primary disk storage based on user-defined policy.

Diverse Backup Types

Support for full, incremental, and synthetic full backup types for archive subclients.

LAN-Free Backup and Restores

Faster LAN-free backup and restores using a grid storage policy.

Restores to Big Data Application Targets

Support for restoring Hadoop data to a big data application target (any other file system).

Versioning

Support for multiple file versions that allows selecting a specific version of a file for restore.

Data Recovery

Support for recovering data lost due to file deletion or corruption.

Reports

A variety of reports are automatically provided for managing the Hadoop data. You can access Reports from the Web Console, the Cloud Services site, or the CommCell Console.

Terminology

The Hadoop documentation uses the following terminology:

Pseudo-client	The logical entity that represents one or more Hadoop clusters.
Instance	The entity that represents one Hadoop cluster.
Subclient	The logical entity that defines the data to be backed up or archived.