V11 SP8

Hyperconverged Storage Pool - Resiliency

Overall resiliency of a platform is a function of system architecture and best practices implemented to deliver the required level of service. In Commvault’s HCI, the inherent application level resiliency of a distributed deduplication database (DDB) and indexing, are complimented by the scale-out architecture of the underlying hardware. Implementing industry best-practices such as mirrored root disk, hot spares and bonded network interfaces, further enhances resiliency at the node-level.

Disk and node level resiliency on a Commvault hyperconverged platform is a function of erasure coding scheme and the block-size of the cluster. Erasure code determines number of chunks the incoming data is broken into before being saved in HDD’s (bricks), which are dispersed across nodes in a block. The block size is the number of server nodes in a block. The set of disks in all nodes within a block, used to house the dispersed chunks constitute a sub-volume. Higher tolerance to node level failures is achieved by dispersing the chunks across more number of nodes within the cluster. Higher tolerance against disk failures is achieved by having more HDD’s leading to more sub-volumes within each node in the cluster.

Following is an illustration of (4+2) erasure code, on a block of three (3) nodes.

In the above example with erasure code of (4+2), the sub-volume size is 6xHDD’s. This cluster can tolerate the loss of any two HDD’s for each sub-volume in the cluster or the failure of a complete node. Both the conditions will result in the availability of the minimum number of data blocks - four (4), for continued operation. Adding more sub-volumes through the addition of more HDD’s will not alter tolerance against node failures as all sub-volumes will become in-accessible when a node fails.

Following is a table of resiliency for different erasure codes and block-sizes. The erasure code scheme to use and block-size of the cluster have a bearing on both resiliency and disk capacity growth needs of a customer.

Erasure Code Block Size

(Nodes / Block)

Sub- Volume Size (HDD's Erasure Code Disk Overhead Tolerance to Node Failures Tolerance to HDD Failures Comments
(4 + 2) 3 Nodes 6 33% 1 node per block 2 HDD's per sub-volume Basic configuration. More usable capacity than 2-way replication
6 Nodes 6 33% 2 nodes per block 2 HDD's per sub-volume Good balance of resiliency with disk capacity and scaling in-place. Resiliency equivalent to 3-way replication with more usable disk capacity.
(8 + 4) 3 Nodes 12 33% 1 node per block 4 HDD's per sub-volume Better resiliency against disk failures. Starter for future cluster growth.
6 Nodes 12 33% 2 nodes per block 4 HDD's per sub-volume Better overall resiliency and disk capacity scaling options.
12 Nodes 12 33% 4 nodes per block 4 HDD's per sub-volume Fully scaled configuration with maximum resiliency.