Computers and Internet

Windows Server 2012 – Storage Spaces and Data Deduplication Hands-on Review Part 7

This is the continuation of my Storage Spaces and Data Deduplication review. Here’s an index of test cases on this part and links to other parts:

  • Part 1: Introduction and Lab Environment Preparation
  • Part 2
    • Physical Disk Pull
    • Introduce the Pulled Disk Back into the System
  • Part 3
    • Extend Thinly Provisioned Virtual Disk and Observe Behavior at Limits
    • Bonus: Detecting and Replacing Physical Disk Failures
  • Part 4
    • Removing a Disk from the Storage Pool
  • Part 5
    • Reclaim Unused Space on Thin Provisioned Disks
    • Bonus: Defragmentation Attempt and Observations
  • Part 6
    • Understanding Hot-Spare Behavior
  • Part 7 (You’re here) 
    • Evaluating and Enabling Data Deduplication

Evaluating and Enabling Data Deduplication

Let me start by showing you what I have in terms of virtual disks and volume(s) within:

image

For the moment, my virtual disks are not Deduplication enabled, as seen in:

image

Before blindly enabling DeDuplication, we might want to assess the contents of the NTFS volumes that live inside these virtual disks. You see how I worded it btw? Deduplication is enabled on a “virtual disk” basis, yet there could be multiple volumes (NTFS etc) within each. First lesson is that deduplication is operating at the disk level, not at the volume level.

Now, here’s how we can assess a “volume” for deduplication. Name of the utility is “DDPEval.Exe”. Depending on the size of the volume being evaluated, it may take extended amount of time. Below, I’m showing you the resource monitor view as well while DDPEval is working.

image

When done, DDPEval produces an output like this:

image

If you noticed, it is reporting that space savings will be 99%. But why is this? Let’s take a look at the files on this volume E:

image

That’s all of it in my test scenario. So how come 99% of these can be saved?… Well, that’s the beauty of deduplication feature in Windows Server 2012. It works below the file system, so even the repetitions inside the same file can be optimized. My files above happens to have all “a” characters in them.

Anyway, now that we have done our assessment, let’s proceed to enable it and see what happens. There is a way to do this in the graphical interface in Server Manager, here:

image

Wizard looks like below. “Enable” box was unchecked, so I checked it as below:

image

At this point I’ll proceed with default values. You see the “Set Deduplication Schedule” button above? Here what comes up if you click that:

image

There are two separate schedules offered here. This would help with scenarios like more aggressive optimization during weekends .vs. shortened one for regular weekdays. For the moment, I will only enable background optimization because I can control when/how idle the volume is and I’ll be able to report to you when it actually kicks in on its own, if ever.

So I clicked OK on the above dialog.

When I did that, following service started:

image

I can look at the status of deduplication on any of my volumes where deduplication is enabled using the “Get-DedupStatus” command, as in:

image

<I’ll wait for 30 minutes to see if idle detection timer kick in and do some optimizations>

image

As you can see, nothing happened on its own. I will inquire as to what specific algorithm is used to determine idle behavior and provide an update if I find anything relevant.

Meanwhile, let’s force it to optimize. Like this:

image

Hmm.. that happened too quickly – and as you can see, nothing has actually happened. We must be operating below the levels of amount of duplications for system to care… or maybe I’m using the wrong set of command parameters.

My D: volume is much larger, so let me enable deduplication on it. Since you have seen the graphical method above, I’ll show the PowerShell method this time:

image

This time I decided to test if actually scheduling an optimization would make a difference. So I went to the “Properties” of the volume, clicked on “Set deduplication schedule” button and reached to the following dialog.

image

What’s interesting here is that the dialog, although it has been accessed from the properties of an individual volume, is referring to the server name. From here we gather that deduplication schedule operates at entire server level. On one hand we can set 2 separate schedules for it to run, on the other hand, it will run against all volumes where deduplication is enabled. Keep this in mind.

image

After configuring it through the UI, I got to the above position. It was about 5:06pm at the time I configured it to run at 5:08pm

Sure enough, at 5:08pm, the disk activity began. Get-DedupJob is now showing the following:

image

There are three things I want to explain in above screenshot. First, as soon as time hit 5:08pm, both volume optimization processes kicked in at the same time (parallel) and changed to “Running” state. This parallelism maybe a good thing or a bad thing. I know that optimization activities can be targeted at individual volumes, but perhaps not through the schedule set on graphical interface.

Second, after waiting just a short while (less than 2-3 minutes), E: drive was done. Remember E: drive has 11GB of content with 8GB was showing as recoverable per DDPEval.exe. Despite DDPEval’s estimate no actual bytes were recovered. This may be because we’re flying low on minimum thresholds that deduplication engine is designed to work with. I will research and update this write-up later.

Third, volume D has been going on for a while. After about 4 minutes, it saved “1.34GB” and continuing to run. Disk activity is heavy right now. I’ll wait until it makes more progress and then will report. Pretty much everything on this disk is a duplicate, so I expect a good amount of data footprint reduction.

<wait until optimization finishes>

image

You see above the finished optimization pass. Level of space saving is indicative of nothing because my test files were simply full of character “a”s in them. Regardless we were able to experience the end-to-end process. Let’s take a look at how Server Manager showing things now:

image

Although there are multiple places where the deduplication savings are displayed, I am showing the Volumes Overview.

Let’s recap where we are and what we learnt:

  • We have established that disk deduplication operates below file level, and can achieve savings even if repetitions are inside the same file. (I should check if duplications across different volumes of the same disk benefit from the deduplication)
  • Deduplication runs on a set schedule or as a background task. Simple experimentation showed that what I think “idle” may not be enough for system to kick things off. Until I learn exactly how background optimization kicks in, I will use scheduled optimization.
  • Deduplication schedule applies to the service and scheduled optimizations launch simultaneously against the entire server. This may be the behavior of the graphical interface though; because it appears to be possible to launch volume-specific optimizations individually.
  • Because this concept is important, let me re-state what I just said. Disk deduplication operates at the disk layer (below file system like NTFS), however called against individual volumes and out of the box scheduled jobs run against entire server.
  • There is a threshold that I do not know, that must be crossed before optimization pass actually does anything. Our volume E: experimentation above is a demonstration of this. Implication is that, on very small volumes, even if DDPEval.exe reports space to be gained, it may not be possible to do so. If anyone knows the specific threshold, please drop me a note.

That’s it for the disk deduplication feature review.

3 replies »

  1. Hey, great article. Wondering if you looked into recovering disk allocation with data-deduplication enabled? I’ve found the interesting scenario that with thin provisioning and deduplication. Its near impossible to trim the virtual disk allocation after deleting files. This means you have a scenario where the volume will keep growing in size, no matter what you do! I currently have 2.7tb stored as 3.6tb on disk!

    • Hi there,

      Yes I hit this scenario and had to revert to fixed volumes. First, you should evaluate your situation using optimize-volume. see http://technet.microsoft.com/en-us/library/hh848675.aspx.

      There are reasons such as unmovable files (NTFS map related) that would prevent optimization from catching up. therefore thin provisioning is not for all workloads across any platform.

      At the high level, I don’t think dedupe has any role in your situation.

      Let me know how it goes.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s