Deduplication Best Practices
Review the following best practices before using deduplication.
Installation and Configuration
Installing Software Updates
When installing updates or patches in a deduplication-enabled setup, ensure all deduplication-enabled jobs are either suspended or stopped prior to installing the updates or patches. This will prevent accidental sealing of deduplication databases due to services being stopped when data protection operations are in progress.
Space Requirements for Deduplication Database
The following calculations can be used to approximately determine the amount of space required for the deduplication database:
(0.2 * Number of blocks above the deduplication size threshold to be backed up on the mount path (KB))
- If a client has 100 files to be backed up
- With a schedule of every day full backup
- For 30 days
Then the size of deduplication database should be:
(0.2 * 100 * 30) = 600 KB
This size is the maximum space required for the deduplication database. The deduplication database size might be much smaller if there is significant deduplication during backups.
Hence in the above example, if the 100 files do not change at all, then the amount will be (0.1*100*30) KB = 300 KB.
Deduplication Building Block Guide
Building block guide illustrates how to choose the right number of deduplication nodes depending on your environment, storage requirements and the production data size. Choosing right deduplication model will allow you to protect large amount of data with minimal infrastructure, faster backups and better scalability.
For more information, see Deduplication Building Block Guide.
Determining Number of Partitions for Deduplication Database
Depending upon your environment and the application data size, we recommend you to choose the right number of partitions for a deduplication database during storage policy creation. Choosing right number of partition involves several important factors such as:
- Application Data Size or Front-End Terabytes (FET) -Size of data (Files, Databases, Mailboxes and so on) on client computer that needs to be backed up.
- Backup Windows - Time allotted to protect the application data.
- Daily Change Rate - Change rate in the application data size.
- Retention Period - How long the data is to be kept before aging off the system.
Considering the above factors, deduplication databases with single partition can be configured for environments where the FET is approximately 120 TB or less. This value (120 TB of application data) was considered based on the following assumptions. This database can scale up to a Back-End Terabytes (BET) size of 200 TB of unique data on the disk library.
Front-End Terabytes (FET)*
Full Backup Schedule
Once per week
Incremental Backups Schedule
Daily Change Rate
Deduplication Block Size
*Note that, the FET size that can be protected with one partition can vary depending on the Retention Period and Block Size settings. For example, if the retention period is more than the above value, then the amount of data that can be managed by a single partition may be less than 120 TB.
Based on the above assumptions, deduplication databases with two partitions can be useful in the environment where the FET size is approximately 120 - 200 TB.
Managing Deduplication Database
Performance Tuning for DDB Backup
The DDB Backup process uses VSS (Windows), LVM (Unix) or thin volume (Unix) snapshots to create snapshot of the DDB. During this process, consider the following to improve your backup process.
- The LVM (Linux) snapshot of the DDB hosted on a Linux machine requires a free space that must follow the given calculation:
5% of volume space as COW space and should be greater than or equal to 4GB and lesser than 50GB. That is:
If 5% of the volume space is less than 4GB, then it will use 4GB as COW space.
If 5% of the volume space is more than 50GB, then it will use 50GB as COW space.
This percentage also includes any free space requirement specified by your hardware vendor. However, the snapshot of the DDB hosted on a thin logical volume grows dynamically and shares space from thin pool if required free space is not available. For the instructions on reserving the space, see Modifying Copy-on-Write Space Size for Snapshots.
- If enough disk space is not available to accommodate the snapshot for the DDB configured on a thin volume (Unix), then the snapshot grows dynamically and shares the thin pool space with other logical volumes.
- For better performance, on the DDB volumes, configure VSS Shadow Area on a different volume (if possible on a different hard disk) which has less I/O during the backup and does not have the active paging file.
Deleting a Deduplication Database
Never delete the Deduplication database manually. The deduplication database facilitates the deduplication of backup jobs and data aging jobs. If deleted, new deduplicated backup jobs cannot be performed and the existing data in the disk mount paths will never be pruned.
Auxiliary Copy operations will automatically unravel or explode the deduplicated data, if deduplication is not enabled in the copy.
If the secondary copy is set up for deduplication, then a separate deduplication database gets created for the copy and the associated data is deduplicated for secondary copy.
However, note that if secondary copy has deduplication enabled but DASH Copy option is disabled, the deduplicated data is always unraveled on the source and then deduplicated on the destination.
Data Aging Operations
- Ensure that every week, all data aging, operations are run in a specific window. Also, make sure that all data aging jobs are successful.
- Data Aging operations will automatically look up the Deduplication database before data is deleted from the disk.
- Data Aging will only delete the source data when all the references to a given block is pruned. So if you see older chunks in disk libraries remaining on the volume even if the original data is deleted, it might be due to the fact that valid deduplication reference(s) to the chunk exists within the data.
- If you have a partially available deduplication database (that is, one of the partitions is offline), then the pruning of backed up data from disk will not be performed until all partitions of the deduplication database are available. So, make sure to recover the offline partitions to reclaim the disk space.
Data Encryption and Data Compression
When Data Encryption and/or Data Compression are enabled, the system automatically runs the data compression, signature module and data encryption in that order. If the setup contradicts this order, the system will automatically perform compression, signature generation and encryption in the source client computer.
However, for database agents, when Data Encryption and/or Data Compression are enabled, the system automatically runs the signature module, the data compression, and data encryption in that order.
When you have a primary copy that is encrypted (and is not deduplicated), enabling deduplication on a secondary copy will not accomplish any viable deduplication on the secondary copy. This is because each backup includes unique encryption keys which in turn will cause unique signatures for each backup.
Avoid WAN Accelerator Appliance for DASH Copy
Using WAN accelerator appliances in conjunction with DASH Copy is not recommended. DASH Copy is already efficiently sending unique data, the overhead of a WAN accelerator can negatively impact that performance with no perceived benefit.
In instances where WAN constraints for bandwidth are in place, we recommend you to use DASH Copy with source side cache enabled because this limits the signature lookups over the network.
Deduplication to Tape
Consider the following when using the deduplication to tape:
- Silo storage feature is for long term data preservation and not short term data recovery.
It can be used where you have several TBs of (10 TB or more) with frequent backups and long term retention (that is year plus).
- NOT RECOMMENDED
Silo storage is not recommended for short term data preservation because data on the Silo storage does not conform traditional deduplication pruning.
- Tapes used for silo backups will not be refreshed and available for re-use until the associated deduplication database has been sealed and all the backup jobs associated to that deduplication database have been aged.
Additionally, silo enabled deduplication database that are not sealed periodically will affect in large restores. Therefore, sealing a deduplication database at periodic intervals (for example, monthly or quarterly) is recommended.
- Silo storage is not recommended for frequent restore operations (such as to recover data from last week). This feature is useful to recover a data within a specific time period (such as from last year or five years ago). This is because a Silo restore operation requires a large number of silo retrievals spread across multiple tapes to restore the data back to the disk library.
- Silo storage is not recommended for high deduplication ratio since periodic sealing of deduplication database is recommended.
Therefore, it is important to note that the Silo storage is less of a data recovery solution and more of a data preservation solution.
Last modified: 3/1/2018 7:33:15 PM