Data Deduplication is a Microsoft Windows Server feature, initially introduced in Windows Server 2012 edition.
As a simple definition, we can tell, data deduplication is an elimination of redundant data in data set and storing only one copy of the same data. It is done by identifying double byte patterns through data analysis, removing double data and replacing it with reference pointed to stored, single piece of data.
In 2017, according to IBM, an output of world data creation was 2.5 quintillions (1018) bytes a day. That fact shows that today’s servers handle huge portions of data in every aspect of human life.
Definitely, some percentage drops on duplicated data in any form, and that data is nothing more than the unnecessary load on servers.
Microsoft knew the trends, way back in 2012 when Data deduplication was introduced and kept developing it, so in Windows Server 2016 system, Data deduplication is more advanced, as more important.
But let’s start with 2012, and understand the feature in its basic.
Data Deduplication Characteristics:
Usage – Data deduplication is very easy to use. It can be enabled on a data volume in “one-click”, with no delays or impacts on a system functionality. In simple words, if the user requests a file, he will get it, as usual, no matter is that file affected by deduplication process.
Deduplication is made not to aim to all files. For example, files smaller than 32KB, encrypted files ( encrypted with a usage of EFS), and files that have Extended attributes, will not be affected by the deduplication process.
If files have an alternate data stream, the only primary stream will be affected, but alternate, will not.
Deduplication can be used on Primary data volumes without affecting files that are being written to until files get to certain age, which allows great performance of feature active files and saves on other files. It sorts files in categories by criteria, and those that are categorized as “in policy” files are affected with deduplication, while others are not.
Deduplication does not change write-path of new files. It allows writing of new files directly to NTFS and evaluates them later through background monitoring process.
When files get to a certain age, MinimumFileAgeDays setting decides ( previously set up by admin), are the files eligible for deduplication. The default setting is 5 days, but it can be changed, to a minimum of 0 days, which processes it, no matter of age.
Some file types can be excluded, like PNG or CAB file types, with compression, if it is decided, the system will not benefit much from mentioned file type processing.
In need of backing up and restoring to another server, deduplication will not make problems. All settings are maintained on the volume, and in need of relocation, they will be relocated too, all except scheduled settings, that are not written on volume. If relocation is made to a server that does not use deduplication, a user will not be able to access files affected by the process.
The feature is made to follow server workload and adapt to system resources. Servers usually have roles to fill, and storage, as seen by admin is only necessary to store background data, so deduplication is adapting to that philosophy. If there are resources to deduplicate, the process will run, if not, the process will stand by and wait for resources to become available.
A feature is designed to use low resources and reduce the Input/output operations per second ( IOPS) so it can scale large data and improve the performance, with index footprint of only 6 bytes of RAM per chunk (average size of 64 KB) and temporary partitioning.
– As mentioned, deduplication works on “chunks” principle, it uses an algorithm with chunks a file in a 64KB pieces, compresses it, and store in a hidden folder. If a user requests that file, it “regenerate” file from the pieces and serve it to the user.
– BranchCache™: the feature that sub-file chunking and indexing engine are shared with. It has an option to send, if needed, already indexed chunk over the WAN to the branch office, and saves a lot of time and data.
Is there a Fragmentation, and what about data access?
The question that is imposed when reading about deduplication, is fragmentation!?
Is there a fragmentation on a hard drive, based on spreading chunks around your hard drive?
Answer is no, deduplication ’s filter driver has a task to keep the sequence of unique chunks together on disk locality, so distribution doesn’t go randomly, plus, deduplication has its own cache, so in situation of multiple requests for a file in an organization, the access pattern will speed things up, and will not start multiple file “recovery” processes, and user will have the same “answer time” as with file without deduplication, and in need of copying one large file, we see end-to-end copy times that can be 1.5 times what it takes on a non-deduplicated volume. But real quality and savings are coming up when copying multiple large files at the same time. The time of copying, due to the cache can speed up to an amazing 30%.
Deduplication Risks and solutions
Of course, like all other features, this way of works has some risks.
In cases of any type of data corruption, there are serious risks, but solutions too.
There is possibility with errors caused by disk anomalies, controller errors, firmware bugs or environmental factors, like radiation or disk vibrations, that chunks errors can cause major problems as multiple files loss., but with good admin organization, usage of backup tools, on time corruption detection, redundancy copies and regular checkups can minimize risks of corrupted data, and loses.
Deduplication in Windows Server 2016
As with all other features, data deduplication went through some upgrades and new features in the latest edition of Microsoft Server.
We will describe the most important ones, and show a way to enable and configure that feature in Microsoft Server 2016 environment.
Multithreading is flagged as a most important change in 2016 when compared with Windows Server 2012 R2. On Server 2012 R2, deduplication operates in a single-threaded mode, and it uses one processor core by the single volume. In Microsoft, they saw it as a performance limit, and in 2016, they introduced multi-threaded mode. Now each volume uses multiple threads and an I/O queues. It changed limits of size per file or volume. In Server 2012 R2, maximum volume size was 10 TB, and in 2016 edition, it changed to 64TB volumes, and 1 TB files, what represents a huge breakthrough.
In the first edition of deduplication feature ( Microsoft Windows Server 2012), there was a single type of deduplication, created only for standard file servers, with no support for constantly running VM’s.
Windows Server 2012 R2 started using Volume Shadow Copy Service (VSS) in a way that deduplication with a usage of optimization jobs, optimizes data, and VSS captures and copies stable volume images for backup on running server systems. With the usage of VSS, Microsoft, in 2012 R2 system, introduced virtual machines deduplication support and a separate type of deduplication.
Windows Server 2016, went one step further and introduced another type of deduplication, designed specifically for virtual backup servers (DPM).
Nano server support
Nano server is minimal component’s fully operational Windows Server 2016, similar to Windows Server Core editions, but smaller, and without GUI support, ideal for purpose-built, cloud-based apps, infrastructure services, or Virtual Clusters.
Windows Server 2016, supports fully deduplication feature on that type of servers.
Cluster OS Rolling Upgrade support
Cluster OS Rolling upgrade is a Windows Server 2016 feature that allows upgrade of an operating system from Windows Server 2012 R2 cluster nodes to Windows Server 2016 without stopping Hyper V. It can be made by usage of so-called “mix mode” operation of the cluster. From deduplication angle, that means that same data can be located at nodes with different versions of deduplication. Windows Server 2016, supports mix mode and provides deduplicated data access while a process of cluster upgrade is ongoing.
Installation and Setup of Data Deduplication on Windows Server 2016
In this section, we will bring an overview of best practice installation and set up data deduplication on Windows Server 2016 system.
As usual, everything starts with a role.
In server manager, choose, Data deduplication ( Located in the drop-down menu of File and storage services), or with the usage of PowerShell cmdlet (as administrator) :
Install-WindowsFeature -Name FS-Data-Deduplication
Enabling And Configuring Data Deduplication on Windows Server 2016
For Gui systems, deduplication can be enabled from Server manager – File and Storage services – Volumes, selection of volume, then right-click and Configure Data Deduplication.
After selecting the wanted type of deduplication, it is possible to specify types of files or folders that will not be affected by the process.
After it is needed to setup schedule, with a click on Set Deduplication Schedule button, which will allow selection of days, weeks, start time, and duration.
Through PowerShell terminal, deduplication can be enabled with following command ( E: is an example volume letter) :
Enable-DedupVolume -Name E: -UsageType HyperV
Jobs Can be listed with the command :
And scheduled with following command (example – Garbage collection job) :
Set-DedupSchedule -Name “OffHoursGC” -Type GarbageCollection -Start 08:00 -DurationHours 5 -Days Sunday -Priority Normal
These are only basics of deduplication PowerShell commands, it has a lot more different deduplication -specific cmdlets, and they can be found at the following link :
Do you want to avoid Data Lost and Unwanted Data Access?
Protect yourself and your clients against security leaks and get your free trial of the easiest and fastest NTFS Permission Reporter now!