Deduplication Building Block Guide
A Building Block is a combination of server and storage which provides a modular approach for data management.
This building block guide illustrates how to choose the right number of deduplication nodes depending on your environment, storage requirements and the production data size. Choosing the right deduplication model will allow you to protect large amounts of data with minimal infrastructure, faster backups and better scalability.
For building block, you must choose efficient servers with fastest processors and effective memory that delivers good performance and scalability.
Before setting up building block, make sure that you plan for sufficient storage space that balances the cost, availability and performance. Sufficient storage space includes:
- Space for deduplication database
- Space for configuring disk library
DDB Backup on Cloud
Deduplication Database (DDB) backup is not recommended on Archive Object Storage, like Glacier, Oracle Archive, Azure Archive, etc.
It is recommended to perform DDB backups to a Disk library. If however a Cloud library is used, create a new Cloud library that does not have the Archive Object Storage enabled (like Amazon S3) as the target and use it for performing both the DDB and Index backups.
Deduplication System Requirements
DDB can be hosted on any of the following operating systems:
Important: Partitioned DDB is supported only on the x64 version of the following operating systems.
All platforms on which Windows MediaAgent is supported, except 64-bit editions on Intel Itanium (IA 64) and Windows XP.
Supported on NTFS and ReFS. For more information on supported Windows MediaAgent, see MediaAgent System Requirements.
All platforms on which Linux MediaAgent is supported, except Power PC (Includes IBM System p).
Supported on ext3, ext4 and XFS. For more information on supported Linux MediaAgent, see MediaAgent System Requirements.
Microsoft Cluster Service (MSCS)
Clusters supported by Windows MediaAgents.
Supported on NTFS and ReFS.
Clusters supported by Linux MediaAgents.
Supported on ext3, ext4 and XFS.
The hardware requirements for MediaAgent that hosts the DDB is explained in Hardware Specifications for MediaAgent.
Tip: For optimal backup performance, the DDB needs to be on a fast, dedicated disk formatted at 32 KB or larger block size. The DDB must be stored on solid state drives (SSD) that are local to the MediaAgent. Before setting up the DDB, the storage volumes must be validated for high performance. Use a tool that measures the IOPS (Input Output Operations per second).
The following aspects need to be considered before configuring the deduplication in your environment.
Deduplication is centrally managed through storage policies. Each storage policy can maintain its own deduplication settings or can be associated to a global deduplication storage policy. Depending upon the type of data and production size you can use dedicated storage policy or global deduplication policy.
Deduplication Storage Policy
A dedicated deduplication storage policy consists of one library, one deduplication database and one or more MediaAgents. For scalability purposes, using a dedicated deduplication policy allows for the efficient movement of very large amounts of data.
Dedicated policies are recommended, when backing up large amount of data with separate data types that do not deduplicate well against each other such as database and file system data.
For more information, see Data Protection and Archiving Deduplication.
Global Deduplication Policy
Global deduplication storage policy provides one large global deduplication database which can be shared by multiple deduplication storage policy copies. Each storage policy can manage specific content and its own retention rules. However, all participating storage policy copies share the same data paths (which consists of MediaAgents and Disk Library mount paths) and the global deduplication database.
- Client computers - subclients cannot be associated to a Global Deduplication Storage Policy. They should be associated only to standard storage policies.
- Once a storage policy copy is associated to a Global Deduplication Storage Policy, you cannot change the association.
- Multiple copies within a storage policy cannot use the same Global Deduplication Storage Policy.
Global deduplication policy is recommended:
- For data that exists in multiple remote sites and is being consolidated into a centralized data center.
- For small data size with different retention requirements.
For more information, see Global Deduplication.
- Deduplication database maintains all signature hash records for a deduplication storage policy. A DDB partition that is hosted on solid-state drive (SSD) might scale up to Back-End Terabyte (BET) size of 200 TB of data residing on the disk library and 2 PB of application (backup) data, assuming a 10:1 deduplication ratio.
- Also, we recommend you to locate the DDB locally on the MediaAgent. The faster the disk performance the more efficient the data protection and deduplication process will be.
- The DDB Backup process uses VSS (Windows), LVM (Unix) or thin volume (Unix) snapshots to create snapshot of the DDB. During this process, consider the following to improve your backup process.
- For the DDB configured on a LVM volume, ensure that enough disk space is available to accommodate the LVM snapshot. We recommend you to maintain at least 15% of unallocated space on the volume group.
- Also, make sure that the amount of copy-on-write (COW) space to reserve while creating snapshots is at least set to 10% of the Logical Volume size. For the instructions on reserving the space, see Modifying Copy-on-Write Space Size for Snapshots.
Note: You can add more partitions to an existing deduplication database (DDB) that is used by a storage policy enabled with deduplication. For more information, see Configuring Additional Partitions for a Deduplication Database.
The Disk Library consists of disk devices that point to the location of the disk library folders. Each disk device may have a read or write path or read only path. The read or write path is for the MediaAgent controlling the mount path to perform backup. The read only path is for the alternate MediaAgent to be able to read the data from the host MediaAgent. This is to allow for restores or auxiliary copy operations while the local MediaAgent is busy.
For deduplication backups:
- We recommend you to use dedicated disk libraries formatted at 64 KB of block size for each MediaAgent.
- Run the disk performance tool to test the performance of the read and write operation on a disk.
See Disk Performance Tool for instructions.
- Non-deduplication data should backup to a separate disk library.
- Configuring the data types into separate disk libraries allows for easier reporting on the overall deduplication savings.
If the non-deduplicated and deduplication data are written to the single library, it will skew the overall disk usage information and make space usage prediction difficult.
Follow the best practice as recommended by your disk storage vendor for disk partitioning to allow for ease of maintenance for the disk library.
When you commission a disk storage, plan for and measure its optimal performance prior to running your data protection operations. for more information, see Disk Library Volume Performance.
Note: If you have configured disk mount paths that do not support sparse files, but you want to reclaim unused disk space, then you can use the Reclaim idle space on Mount Paths with no drill hole capabilityoption.
For example NAS mount paths.
For disk storage the mount paths can be divided into two types:
NAS paths (Disk Library over shared storage)
- This is the preferred method for a mount path configuration.
- In NAS paths the disk storage is on the network and the MediaAgent connects through a network protocol.
- If a MediaAgent goes offline, the disk library is still accessible by other MediaAgents in the library.
Direct Attached Block Storage (Disk Library over Direct Attached Storage)
- In direct attached block storage (SAN) the mount paths are locally attached to the MediaAgent.
- If a MediaAgent is lost then the disk library is offline.
- In a direct attached design, configure the mount paths as mount points instead of drive letters. This allows for larger capacity solutions to configure more mount paths than the drive letters.
- Smaller capacity sites can use drive letters as long as they do not exceed the number of available drive letters.
We recommend you to use default block size of 128 K for deduplicated storage policies for all data type backups.
For storage policies associated with cloud storage libraries, the recommended block size is 512 KB.
Block size can be configured from the Storage Policy Properties - Advanced tab. When configuring the global deduplication policy, all other storage policy copies that are associated with the global deduplication policy must use the same block size. To modify the block size of global deduplication policy, see Modifying Global Deduplication Policy Settings for instructions.
Application Read Size
Application read size is the size of the data read from the clients for data transfer during backup operations. By default, the application read size is set to 512 KB.
To achieve optimal rate of data transfer during DDB backups, we recommend you to modify the read size value to 256 KB.
By default, when a deduplication storage policy is configured, source-side compression is automatically enabled on the Storage Policy level. This setting will override the subclients compression settings.
When global deduplication storage policy is configured, the compression settings on the global deduplication policy will override the storage policy compression settings.
For more information, see Data Compression.
Consider the following when using SAN storage for data path configuration:
- When using SAN storage for the mount path, use Alternate Data Paths > When Resources are offline.
If a data path fails or is marked offline for maintenance, the job will failover to the next data path configured in the Data Path tab.
Although Round-Robin between Data paths will work for SAN storage it’s not recommended because of the performance penalty during DASH copies and restores. This is because of the multiple hops that have to occur in order to restore or copy the data.
Consider the following when using NAS storage for data path configuration:
- When using NAS storage for the mount path, Round Robin, between data paths is recommended. This is configured in the Copy Properties> Data Path Configuration tab of the storage policy. If using a global deduplication policy, the data path configuration is configured in each associated storage policy and not in the Global Deduplication Policy properties.
- NAS mount paths do not have the same performance penalty because the network communication is between the servicing MediaAgent and the NAS mount path directly.
Deduplicating Different Data Types
For best performance and scalability when backing up the different data types (such as file system data, SQL data, and exchange data) that exceeds the suggested capacity referenced in Hardware Specifications for MediaAgent, it is best practice to have different global deduplication policies to protect different data types.
Designing for Remote Sites
Consider a setup with multiple remote sites and a centralized data center. Each remote site backs up the internal data using individual storage policies and saves a copy of the backup locally and on the centralized data center. Here, redundant data within the individual sites can be eliminated using deduplication on primary copies at the remote site. Secondary copies stored at the data center might contain redundant data among the sites. This redundant data can be identified and eliminated using global deduplication on the secondary copies.
For instructions on how to setup remote office backups, see Global Deduplication.
Last modified: 2/26/2019 7:28:26 AM