Deduplication Building Block Guide

Overview

A Building Block is a combination of server and storage which provides a modular approach for data management.

This building block guide illustrates how to choose the right number of deduplication nodes depending on your environment, storage requirements and the production data size. Choosing the right deduplication model will allow you to protect large amounts of data with minimal infrastructure, faster backups and better scalability.

Server

For building block, you must choose efficient servers with fastest processors and effective memory that delivers good performance and scalability.

Storage

Before setting up building block, make sure that you plan for sufficient storage space that balances the cost, availability and performance. Sufficient storage space includes:

  • Space for deduplication database

  • Space for configuring disk library

DDB Backup on Cloud

Deduplication Database (DDB) backup is not recommended on Archive Object Storage, like Glacier, Oracle Archive, Azure Archive, etc.

It is recommended to perform DDB backups to a Disk library. If however a Cloud library is used, create a new Cloud library that does not have the Archive Object Storage enabled (like Amazon S3) as the target and use it for performing both the DDB and Index backups.

Deduplication System Requirements

Supported Platforms

DDB can be hosted on any of the following operating systems:

Important: Partitioned DDB is supported only on the x64 version, but not on the x32 version, of the following operating systems.

Operating System

Description

Windows

All platforms on which Windows MediaAgent is supported, except 64-bit editions on Intel Itanium (IA 64) and Windows XP.

Supported on NTFS and ReFS. For more information on supported Windows MediaAgent, see MediaAgent System Requirements.

Linux

All platforms on which Linux MediaAgent is supported, except Power PC (Includes IBM System p).

32-bit Linux editions are not supported.

Supported on ext3, ext4 and XFS. For more information on supported Linux MediaAgent, see MediaAgent System Requirements.

Note

If using NFS paths.,use NFS version 3 (NFSv3) with Network Lock Manager (NLM) or NFSv4.

Microsoft Cluster Service (MSCS)

Clusters supported by Windows MediaAgents.

Supported on NTFS and ReFS.

Linux Cluster

Clusters supported by Linux MediaAgents.

Supported on ext3, ext4 and XFS.

Hardware Requirements

The hardware requirements for MediaAgent that hosts the DDB is explained in Hardware Specifications for Deduplication Mode.

You can configure or modify the kernel parameters on the MediaAgent. For more information, see Kernel Parameter Configuration.

Tips:

  • The DDB must be stored on solid state drives (SSD) that are local to the MediaAgent. Before setting up the DDB, the storage volumes must be validated for high performance. Use a tool that measures the IOPS (Input Output Operations per second). For optimal backup performance, the DDB needs to be on a fast, dedicated disk.

  • The DDB disk on Windows MediaAgent should be formatted at 32 KB file system block size to reduce the impact of NTFS fragmentation over a time period. The DDB disk on LINUX MediaAgent should be formatted at 4 KB file system block size.

Deduplication Components

The following aspects need to be considered before configuring the deduplication in your environment.

Storage Policy

Deduplication is centrally managed through storage policies. Each storage policy can maintain its own deduplication settings or can be associated to a global deduplication storage policy. Depending upon the type of data and production size you can use dedicated storage policy or global deduplication policy.

Deduplication Storage Policy

A dedicated deduplication storage policy consists of one library, one deduplication database and one or more MediaAgents. For scalability purposes, using a dedicated deduplication policy allows for the efficient movement of very large amounts of data.

Dedicated policies are recommended, when backing up large amount of data with separate data types that do not deduplicate well against each other such as database and file system data. If you enable horizontal scaling for deduplication databases, the dedicated policies are created automatically.

For more information, see Data Protection and Archiving Deduplication.

Global Deduplication Policy

Global deduplication storage policy provides one large global deduplication database which can be shared by multiple deduplication storage policy copies. Each storage policy can manage specific content and its own retention rules. However, all participating storage policy copies share the same data paths (which consists of MediaAgents and Disk Library mount paths) and the global deduplication database.

Notes:

  • Client computers - subclients cannot be associated to a Global Deduplication Storage Policy. They should be associated only to standard storage policies.

  • Once a storage policy copy is associated to a Global Deduplication Storage Policy, you cannot change the association.

  • Multiple copies within a storage policy cannot use the same Global Deduplication Storage Policy.

    Global deduplication policy is recommended:

  • For data that exists in multiple remote sites and is being consolidated into a centralized data center.

  • For small data size with different retention requirements.

    For more information, see Global Deduplication.

Deduplication Database

  • Place the DDB locally on the MediaAgent in a folder on a different volume or partition than that of the root file system. The faster the disk performance the more efficient the data protection and deduplication process will be.

    Do not host the DDB under the software installation directory, for example software_installation_directory\Base directory.

The DDB backup process uses VSS (Windows), LVM (Unix), or thin volume (Unix) snapshots to create a snapshot of the DDB. Consider the following to improve your backup process.

  • When the DDB is on an LVM volume, verify that the volume has enough space for the LVM snapshot. Maintain at least 15% of unallocated space on the volume group.

  • Verify that the amount of copy-on-write (COW) space that is reserved for snapshots is at least 10% of the logical volume size. For the instructions on reserving the space, see Modifying Copy-on-Write Space Size for Snapshots.

    Note: You can add more partitions to an existing deduplication database (DDB) that is used by a storage policy enabled with deduplication. For more information, see Configuring Additional Partitions for a Deduplication Database.

  • For partitioned DDB, host each DDB on different physical drives for better performance.

  • When hosting the DDB on different MediaAgents, ensure that all MediaAgents are of similar operating system type, with 64-bit version, and the MediaAgents are online.

  • Do not host the DDB on a MediaAgent, which is on a CommServe with live sync configured.

  • For Windows, we recommend that the DDB needs to be on a fast, dedicated disk formatted at 32KB and dedicated disk libraries formatted at 64 K or with a higher block size up to 512 K, if supported by the operating system. For Linux MediaAgents, we recommend to use DDB disks formatted at 4KB block size.

  • Configure LUNs so that no more than one DDB is configured on any one RAID group.

  • Configure only two DDBs per MediaAgent with two different LUN groups.

  • You must use only one deduplication database (DDB) path for all the storage pools that are associated with a MediaAgent. The DDB path can be root directory of a drive (for example, D:) or a folder (for example, D:\DDB) on the MediaAgent.

    If you use D:\DDB as the DDB path for first storage pool on a MediaAgent, then you must use the same folder D:\DDB as the DDB path for all the other storage pools that you create on the same MediaAgent.

Disk Library

The Disk Library consists of disk devices that point to the location of the disk library folders. Each disk device may have a read or write path or read only path. The read or write path is for the MediaAgent controlling the mount path to perform backup. The read only path is for the alternate MediaAgent to be able to read the data from the host MediaAgent. This is to allow for restores or auxiliary copy operations while the local MediaAgent is busy.

For deduplication backups:

  • Run the disk performance tool to test the performance of the read and write operation on a disk.

    See Disk Performance Tool for instructions.

  • Non-deduplication data should backup to a separate disk library.

  • Configuring the data types into separate disk libraries allows for easier reporting on the overall deduplication savings.

If the non-deduplicated and deduplication data are written to the single library, it will skew the overall disk usage information and make space usage prediction difficult.

Follow the best practice as recommended by your disk storage vendor for disk partitioning to allow for ease of maintenance for the disk library.

When you commission a disk storage, plan for and measure its optimal performance prior to running your data protection operations. for more information, see Disk Library Volume Performance.

Note: If you have configured disk mount paths that do not support sparse files, but you want to reclaim unused disk space, then you can use the Reclaim idle space on Mount Paths with no drill hole capability option.

For example NAS mount paths.

For disk storage the mount paths can be divided into two types:

NAS paths (Disk Library over shared storage)

  • This is the preferred method for a mount path configuration.

  • In NAS paths the disk storage is on the network and the MediaAgent connects through a network protocol.

  • If a MediaAgent goes offline, the disk library is still accessible by other MediaAgents in the library.

Direct Attached Block Storage (Disk Library over Direct Attached Storage)

  • In direct attached block storage (SAN) the mount paths are locally attached to the MediaAgent.

  • If a MediaAgent is lost then the disk library is offline.

  • In a direct attached design, configure the mount paths as mount points instead of drive letters. This allows for larger capacity solutions to configure more mount paths than the drive letters.

  • Smaller capacity sites can use drive letters as long as they do not exceed the number of available drive letters.

Block Size

We recommend you to use default block size of 128 KB for disk storage and 512 KB for cloud storage. If cloud storage is used for secondary copies (that use disk copies as source), then we recommend you to use same block size as the source copy.

For a complete cloud environment where all copies use cloud storage, we recommend to use default block size of 512 KB.

For a mixed environment where some workloads use cloud storage for both primary and secondary copies and other workloads use primary and secondary disk storage to cloud, we recommend you to create separate storage pools with different block size, as follows:

  • 512 KB (default) for complete cloud workloads

  • 128 KB for secondary copies that use disk copies as source

You can configure block size from the Storage Policy Properties - Advanced tab. When configuring the global deduplication policy, all other storage policy copies that are associated with the global deduplication policy must use the same block size. To modify the block size of global deduplication policy, see Modifying Global Deduplication Policy Settings for instructions.

Note

If you change the block size without sealing the deduplication database, a new baseline is created.

Application Read Size

Application read size is the size of the data read from the clients for data transfer during backup operations. By default, the application read size is set to 512 KB.

Compression

By default, when a deduplication storage policy is configured, source-side compression is automatically enabled on the Storage Policy level. This setting will override the subclients compression settings.

When global deduplication storage policy is configured, the compression settings on the global deduplication policy will override the storage policy compression settings.

For more information, see Data Compression.

Datapaths

Consider the following when using SAN storage for data path configuration:

  • When using SAN storage for the mount path, use Alternate Data Paths > When Resources are offline.

    If a data path fails or is marked offline for maintenance, the job will failover to the next data path configured in the Data Path tab.

    Although Round-Robin between Data paths will work for SAN storage it’s not recommended because of the performance penalty during DASH copies and restores. This is because of the multiple hops that have to occur in order to restore or copy the data.

Consider the following when using NAS storage for data path configuration:

  • When using NAS storage for the mount path, Round Robin, between data paths is recommended. This is configured in the Copy Properties> Data Path Configuration tab of the storage policy. If using a global deduplication policy, the data path configuration is configured in each associated storage policy and not in the Global Deduplication Policy properties.

  • NAS mount paths do not have the same performance penalty because the network communication is between the servicing MediaAgent and the NAS mount path directly.

Deduplicating Different Data Types

For best performance and scalability when backing up the different data types (such as file system data, SQL data, and exchange data) that exceeds the suggested capacity referenced in Hardware Specifications for MediaAgent, it is best practice to have different global deduplication policies to protect different data types.

Designing for Remote Sites

Consider a setup with multiple remote sites and a centralized data center. Each remote site backs up the internal data using individual storage policies and saves a copy of the backup locally and on the centralized data center. Here, redundant data within the individual sites can be eliminated using deduplication on primary copies at the remote site. Secondary copies stored at the data center might contain redundant data among the sites. This redundant data can be identified and eliminated using global deduplication on the secondary copies.

For instructions on how to setup remote office backups, see Global Deduplication.

Horizontal Scaling of DDBs

When you create a storage pool with deduplication, the software creates a DDB with the name StoragePoolName_DDBStoreID. When you perform a backup operation for a subclient, the software renames the DDB to StoragePoolName_SubclientDataType_DDBStoreID. The value of SubclientDataType is Files for File System agents, VMs for virtual machines and Databases for databases. The software also renames the DDBs that exist before you enable horizontal scaling. However, the existing subclients of all data types still backup to the renamed current DDB. When you perform a backup operation for a subclient of different data type, the software creates a new DDB for the data type. Any new subclients are associated to a DDB based on their data type.

When a DDB of a data type is marked full upon reaching the system threshold limits, the software automatically creates a new DDB for the data type and associates any new subclients of the data type to the new DDB.

Horizontal scaling improves deduplication efficiency because similar data types deduplicate more efficiently than dissimilar data types.

When you create a storage pool with deduplication, if one of the following conditions is true, then horizontal scaling does not apply for the copy:

  • You select a library that is associated with a MediaAgent that is on Service Pack 14 or an earlier version.

  • The deduplication database or a deduplication database partition is hosted on a MediaAgent that is on Service Pack 14 or an earlier version.

When you enable horizontal scaling, the auxiliary copy operation for a storage policy copy uses the Use Scalable Resource Allocation option by default, and you cannot turn off the option. For more information, see Performing an Auxiliary Copy Operation.

You can review and update values for the DDB horizontal scaling threshold free space percentage and DDB horizontal scaling threshold number of subclients per DDB parameters in the Media Management Configuration: Deduplication tab.

DDB horizontal scaling threshold number of primary entries per DDB:

When the average number of primary records available on a DDB partition disk reaches the threshold of 800 million, the software creates a new DDB.

DDB horizontal scaling threshold QI Time:

The average Query and Insert (QI) time to reach the 1000μs threshold is 30 days. The average number of primary records available on a DDB partition disk should be 200 million or more.

When the average QI time exceeds the threshold and the average number of primary records is above 200 million per partition, then the software creates a new DDB.

Note

  • Any new subclients from the client computers with Service Pack 14 and more recent service packs associate to the new deduplication databases that are created with horizontal scaling for backups.

  • The existing and any new subclients from the client computers with Service Pack 13 and earlier service packs continue to use the earlier deduplication database for backups.

  • The software performs a periodic weekly check on the full DDB. When the number of records in the full DDB exceed the system threshold, the software starts associating new subclients to the new DDB.

  • You can use the Move Clients to New DDB workflow to move the subclients of a client computer from the previous full DDB to a new DDB created for their data type. For instructions, see Moving Clients from a Full Deduplication Database to a New Deduplication Database.

  • When the threshold crosses 600μs in 30 days, the software generates Major severity alerts.

  • When the threshold crosses 800μs in 30 days, the software generates Critical severity alerts.

Loading...