Thought to be one of the useful features of Windows Server since the launch of the 2008 R2 version. Deduplication is a native feature added through the server manager that gives system administrators enough time to plan server storage and network volume management.
Most Server Administrators rarely talk about this feature until it is time to address the organization’s storage crunch. Data deduplication identifies similar data blocks and saves a copy as the central source reducing the spread of data all over the storage areas. Deduplication works on a file or block level giving you more space in the server.
Special hardware components, which are relatively expensive, are required to explore the block level deduplication; the reason behind extra hardware is the complex processing requirements. The file level of deduplication is not complicated and thus does not require the additional hardware. In most cases, Administrators implementing deduplication prefer the file approach.
When to Apply Windows Server Deduplication
Windows server file deduplication works on the file level its operations work on a higher level than a block level as it tries to match chunks of data. File deduplication is an operating system level meaning that you can enable the feature within a virtual guest in a hypervisors environment.
Growth in industries is also driving the demand for deduplication although storage hardware components are becoming bigger and affordable. Deduplication is all about fulfilling the growing demand.
Why is Deduplication Feature Found on Servers?
Severs are central to any organization data, as users store their information into the repositories. Not all users embrace new technology on how to handle their work while others feel safe making multiple copies of the same work. Most of the work Server Administrators should be doing managing and backing up user data, and this gives them an easy time using windows dedupe feature.
Data deduplication in a straightforward feature and will take a few minutes to make it active. Deduplication is one of the server roles found on windows servers, and you do not need a restart for it to work. However, it is safe to do so to make sure the entire process is configured correctly.
Preparing for Windows Server Duplication
- Click on start
- Click on the run command window
- Enter the command below and press enter (this command runs against selected volume to analyses potential space for storage)
- Right click on the volume in Server Manager to activate data deduplication
- The following wizard will guide you through the deduplication process depending on the type of server in place. (Choose a VDI or Hyper-V configuration or File Server)
Set up The Timing for Deduplication
Deduplication should run on scheduled time to reduce the strain on existing resources. You should not aim to save storage space at the expense of optimization of the server. The timing should at such a time when there is little strain on the server to allow for quick and effective deduplication.
Deduplication is a process that requires more CPU time because of the numerous activities and process taken by each job. Other deduplication demands include optimization, integrity scheduling, and garbage collection. All these deduplication activities should be running at peak hours unless the server has enough resources to withstand system slowdowns.
The capacity that deduplication reclaims varies depending on server use and storage available. General files, ISOs, Office applications files, and virtual disks consume much of the storage locations.
Benefits of Windows Server Deduplication
With the help of deduplication, it brings these direct benefits to the organization:
Reduced Storage Allocation
Deduplication can reduce storage space for files and backups. Therefore, an enterprise can get more storage space reducing the annual cost of storage hardware. With enough storage, there is a lot of efficiency, speed and eliminates the need of installing backup tapes
Efficient Volume Replication
Deduplication ensures that only unique data is written to the disk hence reducing network traffic
Increasing Network Bandwidth
If deduplication is configured to run at the source no need to transfer files over the network
Power consumption is reduced, less space required for extra storage for both local and remote locations. The organization buys and spends less on storage maintenance thus reducing the overall storage costs.
Deduplication ensures faster file recoveries and restoration without straining the day’s business activities.
Features of Deduplication
Transparency and Ease of Use
Installation is straightforward on the target volume(s). Running applications and users will not know when deduplication takes place. The file system works well with NTFS file requirements. Files using the encryption mode, Encrypted File System (EFS), files that have a capacity smaller than 32KB or those with Extended Attributes (EAs) cannot be processed during deduplication. In such cases, file interaction takes place through NTFS and not deduplication. Files with alternative data stream will only have its primary data stream deduplicated, as the alternative will be left on the disk.
Works on Primary Data
The feature once installed on the primary data volumes will operate without interfering with the server’s primary objective. The feature will ignore hot data (active files at the time of deduplication) until it reaches a given number of days. The skipping of such files maintains consistency of the active files and shortens the deduplication time.
This feature uses the following approach when processing special files
- Post procession: when new files are created, the files go directly to the NTFS volume where they are evaluated on a regular schedule. The background processing confirms file eligibility for deduplication every hour by default. The scheduling for confirmation time is flexible
- File age: a setting on the deduplication feature called MinimumFileAgeDays controls how long a file should stay on the queue before it is processed. The default number of days is 5. The Administrator can configure it to 0 to process all files.
- Type of File and Location Exclusions: you can instruct the deduplication feature not to process specific file types. You can choose to ignore CAB files, which does help the process in any way and any file that requires a lot of compression space such as PNG files. There is an option of directing the tool not to process a particular folder.
Any volume that is under deduplication runs as an automatic unit. The volume can be backed up and move it to a different location. Moving it to another server means that anything that was in that file is accessible on its new site. The only thing that you need to change is schedule timings because the native task scheduler controls the scheduler. If the new server location does not have a running deduplication feature, you can only access the files that have not been deduplicated.
Minimal Use of Resources
The default operations of the deduplication feature are to use minimal resources on the primary server. If any case the process is active, and there is a likely shortage of resources, deduplication will surrender the resources to the active process and resume when enough is available.
How storage resources are utilized
- The harsh index storage method uses low resources and reduces read/write operations to scale large datasets and deliver high edit/search performance. The index footprint left behind is excessively low and uses a temporary partition.
- Deduplication verifies the amount of space before it executes. If no storage space is available, it will keep trying at regular intervals. You can schedule and run any deduplication tasks during off-peak hours or during idle time.
The process segments files into different sizes for example between 32 to 128 KB using an algorithm based on Microsoft research and other developers. The segmentation splits the file into a sequence depending on the content of the file. A Rabin fingerprint, a system based on sliding window hash helps to identify the chunk boundaries.
The average size of every segment is 64KB and are compressed and placed into a chunk store hidden in a folder located at the System Volume Information (SVI) folder. A reparse point, which is a pointer to the map of all data streams, helps in the replacement of normal files when requested.
Another feature you get from deduplication is that sub-file segmentation and indexing engine is shared with BranchCache feature. This sharing is important because when a Windows Server is running and all the data segments are already indexed, they can be quickly sent over the network as needed, therefore saving a lot of network traffic within the office or the branch.
How Does Deduplication Affect Data Access?
The fragmentations created by deduplication are stored on the disk are file segments that are spread all over increasing seek time. Upon the processing of each file, the filter driver will work overtime to maintain the sequence by keeping the segments together in a random fashion. Deduplication keeps a file cache to avoid repeating file segments and helps in quick file access. In a case where multiple users access the same resource simultaneously, that access pattern enables speeding up of the deduplication for each user.
- No much difference is noted when opening an Office document; users cannot tell whether the feature is running or not
- When copy one bulky file, deduplication will send end-to-end copy that is likely to be 1.5 times faster than it would take a non-deduplicated file.
- During the transfer of multiple bulky files simultaneously, cache helps to transfer the file 30% times faster
- The file-server load simulator (File Server Capacity Tool) when used to test multiple file access scenarios, you will notice a reduction of about 10% in the number of users supported.
- Data optimization increases between 20-35 MB/Sec per job that easily translates to 100GB/hour for a single 2TB volume running on one core CPU with a 1GB RAM. This is an indicator that multiple volumes can be processed if additional CPU, disk resources, and memory.
Reliability and Risk Preparedness
Even when you configure the server environment using RAID, there is the risk of data corruption and loss attributed to disk malfunctioning, control errors, and firmware bugs. Other environmental risks to stored data include radiation or disk vibrators. Deduplication raises the risk of disk corruption especially when one file segment referring to thousands of other files is located in a bad sector. Such a scenario gives a possibility of losing thousands of user data.
Using the Windows Server Backup tool runs a selective file restore API to enable backup applications to pull files out of the optimized backup
Detect and Report
When a deduplication filter comes across a corrupted file or section of the disk, a quick checksum validation will be done on data and metadata. This validation helps the process to recognize data corruption during file access, hence reducing accumulated failures.
An extra copy of critical data is created, and any file segments with more than 100 references are collected as most popular chunks.
Inspection of the deduplication process and host volumes take place on a weekly basis to scrub for any logged errors and tries to fix them from alternative copies. An optional deep scrubber will walk you through the whole data set by identifying errors and fixing them if possible.
When the disk configurations are configured to mirror each other, deduplication will look for a better copy on the other side and use it as a replacement. If there are no other alternatives, data will be recovered from an existing backup. Scanning and fixing of errors is a continuous process once the deduplication is active.
Verdict on Deduplication
Some of the features described above does not work in all Window Server 2012 editions and may be subject to limitations. Deduplication was built for volumes that support NTFS data structure. Therefore root volumes and system drives, and it cannot be used with Cluster Shared Volumes (CSV). Live Virtual Machines (VMs) and active SQL databases are not supported by deduplication.
Deduplication Data Evaluation Tool
To get a better understanding of the deduplication environment, Microsoft created a portable evaluation tool that installs into the \Windows\System32\ directory. The tool can be tested on Windows 7 and later Windows operating systems. The tool installed through the DDPEval.exe supports local drives, mapped, unmapped, and remote shares. If you are using Windows NAS or an EMC /NetApp NAS, you can test it on a remote share.
The Windows Server native deduplication feature is now becoming a popular feature. It mirrors the needs of a typical server administrator working in production deployments. However, planning for deduplication before implementation is necessary because of the varying situations in which its use may not be applicable.