Deduplication is one of the useful features of Windows Server since the launch of the 2008 R2 version.
It is a native feature added through the server manager that gives system administrators enough time to plan server storage and network volume management.
Most Server administrators rarely talk about this feature until it is time to address the organization’s storage crunch.
Data deduplication works by identifying similar data blocks and saving a copy as the central source, thus reducing the spread of data all over the storage areas. Deduplication takes place on a file or block level, giving you more space in the server.
Special hardware components, which are relatively expensive, are required to explore the block level deduplication. The reason behind extra hardware is the complex processing requirements involved.
The file level deduplication is not complicated and, thus, does not require the additional hardware. As such, in most cases, administrators implementing deduplication prefer the file approach.
When to Apply Windows Server Deduplication
Since Windows Server file deduplication works on the file level, its operations work on a higher level than a block level, as it tries to match chunks of data.
File deduplication is an operating system level, meaning that you can enable this feature within a virtual guest in a hypervisors environment.
Growth in industries is also driving the demand for deduplication, although storage hardware components are becoming bigger and affordable.
Deduplication is all about fulfilling this growing demand.
Why is Deduplication Feature Found on Servers?
Severs are central to any organization’s data, as users store their information on repositories. Not all users embrace new technology on how to handle their work, while others feel safe making multiple copies of the same work.
Since most Server administrators do the work of managing and backing up users’ data, using the Windows deduplication feature greatly enhances their productivity.
Data deduplication in a straightforward feature and will take a few minutes to make it active.
Deduplication is one of the server roles found on Windows Servers, and you do not need a restart for it to work.
However, it is safe to do so to make sure the entire process is configured correctly.
Preparing for Windows Server Duplication
- Click on start
- Click on the run command window
- Enter the following command and press enter (this command runs against selected volume to analyze potential space for storage): DDEval.exe
- Right click on the volume in Server Manager to activate data deduplication
The following wizard will guide you through the deduplication process depending on the type of server in place. (Choose a VDI or Hyper-V configuration or File Server)
Set up The Timing for Deduplication
Deduplication should run on scheduled time to reduce the strain on existing resources. You should not aim to save storage space at the expense of overworking the server.
The timing should be set when there is little strain on the server to allow for quick and effective deduplication.
Deduplication is a task that requires more CPU time because of the numerous activities and processes taken by each job.
Other deduplication demands include optimization, integrity scheduling, and garbage collection. All these deduplication activities should be running at peak hours unless the server has enough resources to withstand system slowdowns.
The capacity that deduplication reclaims varies depending on server use and storage available.
General files, ISOs, office applications files, and virtual disks usually consume much of the storage allocations.
Benefits of Windows Server Deduplication
Windows Server deduplication brings several benefits to an organization, including the following:
- Reduced storage allocation
Deduplication can reduce storage space for files and backups. Therefore, an enterprise can get more storage space, reducing the annual cost of storage hardware. With enough storage, there is a lot of efficiency and speed, which eliminates the need for installing backup tapes.
- Efficient volume replication
Deduplication ensures that only unique data is written to the disk, which reduces network traffic.
- Increasing network bandwidth
If deduplication is configured to run at the source, then there is no need to transfer files over the network.
- Cost-effective solution
Since power consumption is reduced, there is less space required for extra storage of both local and remote locations. The organization buys and spends less on storage maintenance, thus reducing the overall storage costs.
- Fast file recovery process
Deduplication ensures faster file recoveries and restorations without straining the day’s business activities.
Features of Deduplication
1. Transparency and Ease of Use
Installation is straightforward on the target volume(s). Running applications and users will not know when deduplication is taking place.
The file system works well with NTFS file requirements. However, files using the encryption mode, Encrypted File System (EFS), files that have a capacity smaller than 32KB, or those with Extended Attributes (EAs), cannot be processed during deduplication.
In such cases, file interaction takes place through NTFS, and not deduplication. A file with an alternative data stream will only have its primary data stream deduplicated, as the alternative will be left on the disk.
2. Works on Primary Data
This feature, once installed on the primary data volumes, will operate without interfering with the server’s primary objective.
This feature will ignore hot data (active files at the time of deduplication) until it reaches a given number of days. The skipping of such files maintains consistency of the active files and shortens the deduplication time.
This feature uses the following approach when processing special files:
- Post procession: when new files are created, the files go directly to the NTFS volume where they are evaluated on a regular schedule. The background processing confirms file eligibility for deduplication, every hour, by default. The scheduling for confirmation time is flexible
- File age: a setting on the deduplication feature called MinimumFileAgeDays controls how long a file should stay on the queue before it is processed. The default number of days is 5. The administrator can configure it to 0 to process all files.
- Type of file and location exclusions: you can instruct the deduplication feature not to process specific file types. You can choose to ignore CAB files, which do not help the process in any way as well as any other file type that requires a lot of compression space such as PNG files. There is an option of directing the feature not to process a particular folder.
Any volume that is under deduplication runs as an automatic unit. The volume can be backed up and moved to a different location.
Moving it to another server means that anything that was in that file is accessible on its new site.
The only thing that you need to change is schedule timings because the native task scheduler controls the scheduler.
If the new server location does not have a running deduplication feature, you can only access the files that have not been deduplicated.
4. Minimal Use of Resources
The default operations of the deduplication feature use minimal resources on the primary server.
In case the process is active, and there is a shortage of resources, deduplication will surrender the resources to the active process and resumes when enough is available.
Here’s how storage resources are utilized:
- The hash index storage method uses low resources and reduces read/write operations to scale large datasets and deliver high edit/search performance. The index footprint left behind is excessively low and uses a temporary partition.
- Deduplication verifies the amount of space before it executes. If no storage space is available, it will keep trying at regular intervals. You can schedule and run any deduplication tasks during off-peak hours or during idle time.
5. Sub-file Segmentation
The process segments files into different sizes, such as between 32 to 128 KB using an innovative algorithm developed by Microsoft and other researchers.
The segmentation splits each file into a sequence depending on its content. A Rabin fingerprint, which is a system based on the sliding Window hash, helps to identify the chunk boundaries.
The average size of every segment is 64KB and it is compressed and placed into a chunk store that is hidden in a folder located at the System Volume Information (SVI) folder.
A reparse point, which is a pointer to the map of all data streams, helps in replacing the normal files when requested.
Another feature you can get from deduplication is that sub-file segmentation and indexing engine is shared with BranchCache feature.
This sharing is important because when a Windows Server is running and all the data segments are already indexed, they can be quickly sent over the network as needed, consequently saving a lot of network traffic within the office or the branch.
How Does Deduplication Affect Data Access?
The fragmentations created by deduplication are stored on the disk as file segments that are spread all over, increasing the seek time.
Upon the processing of each file, the filter driver will work overtime to maintain the sequence by keeping the segments together in a random fashion.
Deduplication keeps a file cache to avoid repeating file segments, helping in their quick access. In case multiple users access the same resource simultaneously, that access pattern enables speeding up of the deduplication for each user.
Here are some important points to note:
- No much difference is noted when opening an Office document; users cannot tell whether the feature is running or not
- When copying one bulky file, deduplication will send end-to-end copy that is likely to be 1.5 times faster than it would take a non-deduplicated file.
- During the transfer of multiple bulky files simultaneously, cache helps to transfer the file 30% times faster
- When the file-server load simulator (File Server Capacity Tool) is used to test multiple file access scenarios, a reduction of about 10% in the number of users supported will be noticed.
- Data optimization increases between 20-35 MB/Sec per job that easily translates to 100GB/hour for a single 2TB volume running on one core CPU with a 1GB RAM. This is an indicator that multiple volumes can be processed if additional CPU, disk resources, and memory allocations are available.
Reliability and Risk Preparedness
Even when you configure the server environment using RAID, there is still the risk of data corruption and loss attributed to disk malfunctioning, control errors, and firmware bugs.
Other environmental risks to stored data include radiation or disk vibrations.
Deduplication raises the risk of disk corruption, especially when one file segment referring to thousands of other files is located in a bad sector.
Such a scenario gives a possibility of losing thousands of users’ data.
Using the Windows Server Backup tool runs a selective file restore API to enable backup applications to pull files out of the optimized backup.
Detect and Report
When a deduplication filter comes across a corrupted file or section of the disk, a quick checksum validation will be done on data and metadata.
This validation helps to recognize any data corruption during file access, hence reducing accumulated failures.
An extra copy of critical data is created, and any file segment with more than 100 references is collected as most popular chunks.
Once the deduplication process is active, scanning and fixing of errors becomes a continuous process.
Inspection of the deduplication process and host volumes takes place on a regular basis to scrub any logged errors and fix them from alternative copies.
An optional deep scrubber will walk through the whole data set by identifying errors and fixing them, if possible.
When the disk configurations are set to mirror each other, deduplication will look for a better copy on the other side and use it as a replacement.
If there are no other alternatives, data will be recovered from an existing backup.
Verdict on Deduplication
Some of the features described above does not work in all Window Server 2012 editions and may be subject to limitations.
Deduplication was built for volumes that support the NTFS data structure.
Therefore, it cannot be used with Cluster Shared Volumes (CSV).
Also, Live Virtual Machines (VMs) and active SQL databases are not supported by deduplication.
Deduplication Data Evaluation Tool
To get a better understanding of the deduplication environment, Microsoft created a portable evaluation tool that installs into the \Windows\System32\ directory.
The tool can be tested on Windows 7 and later Windows operating systems.
It is installed through the DDPEval.exe and supports local drives, mapped, unmapped, and remote shares.
If you are using Windows NAS or an EMC /NetApp NAS, you can test it on a remote share.
The Windows Server native deduplication feature is now becoming a popular feature.
It mirrors the needs of a typical server administrator working in production deployments.
However, planning for deduplication before implementation is necessary because of the various situations in which its use may not be applicable.