Deduplication Overview

Deduplication provides an efficient method to transmit and store data by identifying and eliminating duplicate blocks of data during backups.

All data types from Windows, Linux, UNIX operating systems and multiple platforms can be deduplicated when data is copied to secondary storage. For information about the supported data types, see Deduplication - Support.

Deduplication offers the following benefits:

  • Optimizes use of storage media by eliminating duplicate blocks of data.
  • Reduces network traffic by sending only unique data during backup operations.

How Deduplication Works

The following is the general workflow for deduplication:

  • Generating signatures for data blocks

    A block of data is read from the source and a unique signature for the block of data is generated by using a hash algorithm.

    Data blocks can be compressed (default), encrypted (optional), or both. Data block compression, signature generation, and encryption are performed in that order on the source or destination host.

  • Comparing signatures

    The new signature is compared against a database of existing signatures for previously backed up data blocks on the destination storage. The database that contains the signatures is called the Deduplication Database (DDB).

    • If the signature exists, the DDB records that an existing data block is used again on the destination storage. The associated MediaAgent writes the index information to the DDB on the destination storage, and the duplicate data block is discarded.
    • If the signature does not exist, the new signature is added to the DDB. The associated MediaAgent writes both the index information and the data block to the destination storage.

    Signature comparison is done on a MediaAgent. For improved performance, you can use a locally cached set of signatures on the source host for the comparison. If a signature does not exist in the local cache set, it is sent to the MediaAgent for comparison.

  • Using MediaAgent roles

    During the deduplication process, two different MediaAgents roles are used. These roles can be hosted by the same MediaAgent or different MediaAgents.

    • Data Mover Role: The MediaAgent has write access to disk libraries where the data blocks are stored.
    • Deduplication Database Role: The MediaAgent has access to the DDB that stores the data block signatures.
    • An object (file, message, document, and so on) written to the destination storage may contain one or many data blocks. These blocks might be distributed on the destination storage whose location is tracked by the MediaAgent index. This index allows the blocks to be reassembled so that the object can be restored or copied to other locations. The DDB is not implemented during the restore process.

Strategies for Deduplication Implementation

You can set up one of the following deduplication implementations:

Source-Side (Client-Side) Deduplication

(Recommended). Use source-side deduplication when the MediaAgent and the clients are in a delayed or low bandwidth network environment such as WAN. You can also use source-side deduplication for Remote Office backup solutions. For example, Laptop Backup (DLO)

Implementing this method reduces the amount of data that is transferred across the network.

MediaAgent-Side (Storage-Side) Deduplication

Use MediaAgent-side deduplication when the MediaAgent and the clients are in a fast network environment such as LAN and if you do not want any CPU utilization on client computers.

When the signature generation option is enabled on the MediaAgent, MediaAgent-side deduplication reduces the CPU usage on the client computers by moving the processing to the MediaAgent.

Global Deduplication

Global deduplication provides greater flexibility in defining retention policies when protecting the data.

Use global deduplication storage policies in the following scenarios:

  • To consolidate Remote Office backup data in one location.
  • When you must manage data types, such as file system data and virtual machine data, by different storage policies but in the same disk library.

Deduplication to Tape (Silo Storage)

Deduplication to Tape can copy deduplicated data to tape in a deduplicated format.

Deduplication to Tape extends the primary disk storage by managing the disk space and periodically moving the deduplicated data to the secondary storage.

Deduplicated data on tape responds automatically to restore requests by copying only necessary data back to the disk library and then restoring the data.

DASH Copy

An Auxiliary Copy job uses DASH (Deduplication Accelerate Streaming Hash) Copy, which is an option for a deduplication-enabled storage policy copy, to send only unique data to that copy. DASH Copy uses network bandwidth efficiently and minimizes the use of storage resources.

DASH Copy transmits only unique data blocks, which reduces the volume and time of an Auxiliary Copy job by up to 90%.

Use DASH Copy when remote secondary copies can only be reachable on low bandwidth connections.

For more information, see DASH Copy.

DASH Full (Accelerated Synthetic Full Backups)

DASH Full is a Synthetic Full operation that updates the DDB and index files for existing data rather than physically copying data like a normal Synthetic Full backup.

Use DASH Full backup operations to increase performance and reduce network usage for full backups.

DASH Full is used with Simpana OnePass to manage the retention of archived data.