Deduplication - Best Practices
Review the following best practices before using deduplication.
Installation and Configuration
(0.2 * Number of blocks above the deduplication size threshold to be backed up on the mount path (KB))
- If a client has 100 files to be backed up
- With a schedule of every day full backup
- For 30 days
Then the size of deduplication database should be:
(0.2 * 100 * 30) = 600 KB
This size is the maximum space required for the deduplication database. The deduplication database size might be much smaller if there is significant deduplication during backups.
Hence in the above example, if the 100 files do not change at all, then the amount will be (0.1*100*30) KB = 300 KB.
Building block guide illustrates how to choose the right number of deduplication nodes depending on your environment, storage requirements and the production data size. Choosing right deduplication model will allow you to protect large amount of data with minimal infrastructure, faster backups and better scalability.
For more information, see Deduplication Building Block Guide.
Depending upon your environment and the application data size, we recommend you to choose the right number of partitions for a deduplication database during storage policy creation. Choosing right number of partition involves several important factors such as:
- Application Data Size or Front-End Terabytes (FET) - Size of data (Files, Databases, Mailboxes and so on) on client computer that needs to be backed up.
- Backup Windows - Time allotted to protect the application data.
- Daily Change Rate - Change rate in the application data size.
- Retention Period - How long the data is to be kept before aging off the system.
Considering the above factors, deduplication databases with single partition can be configured for environments where the FET is approximately 120 TB or less. This value (120 TB of application data) was considered based on the following assumptions. This database can scale up to a Back-End Terabytes (BET) size of 200 TB of unique data on the disk library.
|Front-End Terabytes (FET)*||120 TB|
|Full Backup Schedule||Once per week|
|Incremental Backups Schedule||Daily|
|Daily Change Rate||5 %|
|Retention Period||4 Weeks|
|Block Size||128 KB|
*Note that, the FET size that can be protected with one partition can vary depending on the Retention Period and Block Size settings. For example, if the retention period is more than the above value, then the amount of data that can be managed by a single partition may be less than 120 TB.
Based on the above assumptions, deduplication databases with two partitions can be useful in the environment where the FET size is approximately 120 - 200 TB.
Managing Deduplication Database
The DDB Backup process uses VSS (Windows), LVM (Unix) or thin volume (Unix) snapshots to create snapshot of the DDB. During this process, consider the following to improve your backup process.
- For the DDB configured on a LVM volume, ensure that enough disk space is available to accommodate the LVM snapshot. We recommend you to maintain at least 15% of unallocated space on the volume group.
Also, make sure that the amount of copy-on-write (COW) space to reserve while creating snapshots is at least set to 10% of the Logical Volume size. For the instructions on reserving the space, see Modifying Copy-on-Write Space Size for Snapshots.
- If enough disk space is not available to accommodate the snapshot for the DDB configured on a thin volume (Unix), then the snapshot grows dynamically and shares the thin pool space with other logical volumes.
- For better performance, on the DDB volumes, configure VSS Shadow Area on a different volume (if possible on a different hard disk) which has less I/O during the backup and does not have the active paging file.
If the secondary copy is set up for deduplication, then a separate deduplication database gets created for the copy and the associated data is deduplicated for secondary copy.
However, note that if secondary copy has deduplication enabled but DASH Copy option is disabled, the deduplicated data is always unraveled on the source and then deduplicated on the destination.
- Ensure that every week, all data aging, operations are run in a specific window. Also, make sure that all data aging jobs are successful.
- Data Aging operations will automatically look up the Deduplication database before data is deleted from the disk.
- Data Aging will only delete the source data when all the references to a given block is pruned. So if you see older chunks in disk libraries remaining on the volume even if the original data is deleted, it might be due to the fact that valid deduplication reference(s) to the chunk exists within the data.
- If you have a partially available deduplication database (that is, one of the partitions is offline), then the pruning of backed up data from disk will not be performed until all partitions of the deduplication database are available. So, make sure to recover the offline partitions to reclaim the disk space.
However, for database agents, when Data Encryption and/or Data Compression are enabled, the system automatically runs the signature module, the data compression, and data encryption in that order.
When you have a primary copy that is encrypted (and is not deduplicated), enabling deduplication on a secondary copy will not accomplish any viable deduplication on the secondary copy. This is because each backup includes unique encryption keys which in turn will cause unique signatures for each backup.
Using WAN accelerator appliances in conjunction with DASH Copy is not recommended. DASH Copy is already efficiently sending unique data, the overhead of a WAN accelerator can negatively impact that performance with no perceived benefit.
In instances where WAN constraints for bandwidth are in place, we recommend you to use DASH Copy with source side cache enabled because this limits the signature lookups over the network.
Consider the following when using the deduplication to tape:
- Silo storage feature is for long term data preservation and not short term data recovery.
It can be used where you have several TBs of Front End data (10 TB or more) with frequent backups and long term retention (that is year plus).
Silo storage is not recommended for short term data preservation because data on the Silo storage does not conform traditional deduplication pruning.
- Tapes used for silo backups will not be refreshed and available for re-use until the associated deduplication database has been sealed and all the backup jobs associated to that deduplication database have been aged.
Additionally, silo enabled deduplication database that are not sealed periodically will affect in large restores. Therefore, sealing a deduplication database at periodic intervals (for example, monthly or quarterly) is recommended.
- Silo storage is not recommended for frequent restore operations (such as to recover data from last week). This feature is useful to recover a data within a specific time period (such as from last year or five years ago). This is because a Silo restore operation requires a large number of silo retrievals spread across multiple tapes to restore the data back to the disk library.
- Silo storage is not recommended for high deduplication ratio since periodic sealing of deduplication database is recommended.
Therefore, it is important to note that the Silo storage is less of a data recovery solution and more of a data preservation solution.