Windows Server 2012 – Storage Spaces and Data Deduplication Hands-on Review Part 7

This is the continuation of my Storage Spaces and Data Deduplication review. Here’s an index of test cases on this part and links to other parts:

  • Part 1: Introduction and Lab Environment Preparation
  • Part 2
    • Physical Disk Pull
    • Introduce the Pulled Disk Back into the System
  • Part 3
    • Extend Thinly Provisioned Virtual Disk and Observe Behavior at Limits
    • Bonus: Detecting and Replacing Physical Disk Failures
  • Part 4
    • Removing a Disk from the Storage Pool
  • Part 5
    • Reclaim Unused Space on Thin Provisioned Disks
    • Bonus: Defragmentation Attempt and Observations
  • Part 6
    • Understanding Hot-Spare Behavior
  • Part 7 (You’re here) 
    • Evaluating and Enabling Data Deduplication

Evaluating and Enabling Data Deduplication

Let me start by showing you what I have in terms of virtual disks and volume(s) within:

image

For the moment, my virtual disks are not Deduplication enabled, as seen in:

image

Before blindly enabling DeDuplication, we might want to assess the contents of the NTFS volumes that live inside these virtual disks. You see how I worded it btw? Deduplication is enabled on a “virtual disk” basis, yet there could be multiple volumes (NTFS etc) within each. First lesson is that deduplication is operating at the disk level, not at the volume level.

Now, here’s how we can assess a “volume” for deduplication. Name of the utility is “DDPEval.Exe”. Depending on the size of the volume being evaluated, it may take extended amount of time. Below, I’m showing you the resource monitor view as well while DDPEval is working.

image

When done, DDPEval produces an output like this:

image

If you noticed, it is reporting that space savings will be 99%. But why is this? Let’s take a look at the files on this volume E:

image

That’s all of it in my test scenario. So how come 99% of these can be saved?… Well, that’s the beauty of deduplication feature in Windows Server 2012. It works below the file system, so even the repetitions inside the same file can be optimized. My files above happens to have all “a” characters in them.

Anyway, now that we have done our assessment, let’s proceed to enable it and see what happens. There is a way to do this in the graphical interface in Server Manager, here:

image

Wizard looks like below. “Enable” box was unchecked, so I checked it as below:

image

At this point I’ll proceed with default values. You see the “Set Deduplication Schedule” button above? Here what comes up if you click that:

image

There are two separate schedules offered here. This would help with scenarios like more aggressive optimization during weekends .vs. shortened one for regular weekdays. For the moment, I will only enable background optimization because I can control when/how idle the volume is and I’ll be able to report to you when it actually kicks in on its own, if ever.

So I clicked OK on the above dialog.

When I did that, following service started:

image

I can look at the status of deduplication on any of my volumes where deduplication is enabled using the “Get-DedupStatus” command, as in:

image

<I’ll wait for 30 minutes to see if idle detection timer kick in and do some optimizations>

image

As you can see, nothing happened on its own. I will inquire as to what specific algorithm is used to determine idle behavior and provide an update if I find anything relevant.

Meanwhile, let’s force it to optimize. Like this:

image

Hmm.. that happened too quickly – and as you can see, nothing has actually happened. We must be operating below the levels of amount of duplications for system to care… or maybe I’m using the wrong set of command parameters.

My D: volume is much larger, so let me enable deduplication on it. Since you have seen the graphical method above, I’ll show the PowerShell method this time:

image

This time I decided to test if actually scheduling an optimization would make a difference. So I went to the “Properties” of the volume, clicked on “Set deduplication schedule” button and reached to the following dialog.

image

What’s interesting here is that the dialog, although it has been accessed from the properties of an individual volume, is referring to the server name. From here we gather that deduplication schedule operates at entire server level. On one hand we can set 2 separate schedules for it to run, on the other hand, it will run against all volumes where deduplication is enabled. Keep this in mind.

image

After configuring it through the UI, I got to the above position. It was about 5:06pm at the time I configured it to run at 5:08pm

Sure enough, at 5:08pm, the disk activity began. Get-DedupJob is now showing the following:

image

There are three things I want to explain in above screenshot. First, as soon as time hit 5:08pm, both volume optimization processes kicked in at the same time (parallel) and changed to “Running” state. This parallelism maybe a good thing or a bad thing. I know that optimization activities can be targeted at individual volumes, but perhaps not through the schedule set on graphical interface.

Second, after waiting just a short while (less than 2-3 minutes), E: drive was done. Remember E: drive has 11GB of content with 8GB was showing as recoverable per DDPEval.exe. Despite DDPEval’s estimate no actual bytes were recovered. This may be because we’re flying low on minimum thresholds that deduplication engine is designed to work with. I will research and update this write-up later.

Third, volume D has been going on for a while. After about 4 minutes, it saved “1.34GB” and continuing to run. Disk activity is heavy right now. I’ll wait until it makes more progress and then will report. Pretty much everything on this disk is a duplicate, so I expect a good amount of data footprint reduction.

<wait until optimization finishes>

image

You see above the finished optimization pass. Level of space saving is indicative of nothing because my test files were simply full of character “a”s in them. Regardless we were able to experience the end-to-end process. Let’s take a look at how Server Manager showing things now:

image

Although there are multiple places where the deduplication savings are displayed, I am showing the Volumes Overview.

Let’s recap where we are and what we learnt:

  • We have established that disk deduplication operates below file level, and can achieve savings even if repetitions are inside the same file. (I should check if duplications across different volumes of the same disk benefit from the deduplication)
  • Deduplication runs on a set schedule or as a background task. Simple experimentation showed that what I think “idle” may not be enough for system to kick things off. Until I learn exactly how background optimization kicks in, I will use scheduled optimization.
  • Deduplication schedule applies to the service and scheduled optimizations launch simultaneously against the entire server. This may be the behavior of the graphical interface though; because it appears to be possible to launch volume-specific optimizations individually.
  • Because this concept is important, let me re-state what I just said. Disk deduplication operates at the disk layer (below file system like NTFS), however called against individual volumes and out of the box scheduled jobs run against entire server.
  • There is a threshold that I do not know, that must be crossed before optimization pass actually does anything. Our volume E: experimentation above is a demonstration of this. Implication is that, on very small volumes, even if DDPEval.exe reports space to be gained, it may not be possible to do so. If anyone knows the specific threshold, please drop me a note.

That’s it for the disk deduplication feature review.

Posted in Computers and Internet | Tagged , , , , , , , , , , | 1 Comment

Lumia 900 under microscope after 6 months of use

Some of you know I use a Cyan Lumia 900 Windows Phone, made a nightstand dock for it using Lego. I use my phone pretty heavily and I do not like to use protective casing for it. I never used a case for any phone, and Lumia 900 is not an exception. Few days ago I wanted to look at the phone under microscope and see how those scratches look like. I wanted to share with you the pictures I took through the microscope. Note that my gear for taking photos from a microscope is not top-notch – apologies in advance for lack of proper depth of field, or even focus for that matter, but I think you’ll get the idea.

Overall I don’t consider my phone scratched much (despite what you might think after looking at the pictures J). Cyan casing itself is holding up really nice. I’ll be showing you the only dent it has from falling.

So first picture is the corner where the headphone socket is. Phone dropped from shoulder height onto asphalt while exiting a car. Keep reading after…


Zooming into the “crater” you can see the tiny pieces of dirt that were able to penetrate the cyan casing. The case is cyan through and through alright; but dent being the dent and impact being to asphalt, it did collect some particles. Entire crater is smaller than the top of a needle, and this “dirt” is barely noticeable.


Ok… moving onto camera lens / below. Keep in mind, the inner glass (not the black-looking-but-actually-silver lining between cyan case and glass) is actually positioned deep. When touched under microscope, I can see it moving around with the push of my finger independent of the lining.

By holding the phone in my hand, I don’t see much dirt or scratches on the glass. However, it’s a different story under microscope. At this point I also started to believe that every one of those really close-up phone/gear photos we see in product marketing materials are rendered. There are just way too many small things going around to keep them pitch-perfect clean. Below image was taken after a good wipe. I am speculating that you’re looking at remains of my dead skin around the glass. At best they are little dust particles from the wipe I used. The lens element is positioned deeper than the silver lining around, so it’s hard to reach to the edges of the glass where you see little dust particles.

As for the silver lining around the glass element. It does come into contact with surfaces when you put the phone down face-up, and naturally gets most of the scratching in the back. Keep reading.


To give you another perspective, here, a photo of “T” in word Tessar 2.2/2.8. Also worth noting that microscope’s own light is coming at an angle and is quite powerful. Phone must have easily been put down 3000+ times over the course of 6-7 months. It’s always with me, always placed face-up on a table somewhere. Sometimes it’s concrete, sometimes wooden desk. I rarely keep it in pocket while sitting.


I’ll let you guess what these below are. Mind you, depth of field in photos like this changes everything. There is a protective cover on which there are dust particles. I decided to focus on what’s underneath. This also tells you how unimportant those dust particles in front of the camera lens above are. Camera will be focusing to objects far away – few dust particles will only change the amount of light coming in by a certain percentage. But that’s about all the impact it will have.


You know there is a protective edge around the display of Lumia 900, so you can safely put the phone face-down. On a flat surface display doesn’t touch down thanks to this feature. What you’re seeing below is me holding the phone vertically under the microscope, focusing on the edge. In the background (supposed to be white) you are seeing the microscope’s white plate. Horizontally, upper half of the black line is actually mirror image of the edge reflecting of off Lumia’s display. I wanted to include this to show you the texture of that barely recognizable edging.


Next one is classic – but I wanted to include it anyway. You’re looking at a portion of letter “O” of Outlook on the main display of Lumia 900. Exposure time here was something like 20 seconds at ISO 100 with no external light.


That’s it. I like the material Nokia used in Lumia line. It is holding up nicely – despite repeat falls and me not using any additional case, it scratched only on extreme situations. I would say that the metal element in the back needs better protection. It perhaps shouldn’t have been the lowest point of the camera when placed face-up. Looking forward to 920 and 820 models soon.

Posted in Computers and Internet | Tagged , , , , , , , , | 5 Comments

Windows Server 2012 – Storage Spaces and Data Deduplication Hands-on Review Part 6

This is the continuation and Part 6 of my Storage Spaces and Data Deduplication review. Here’s an index of test cases on this part and links to other parts:

  • Part 1: Introduction and Lab Environment Preparation
    • Physical Disk Pull
    • Introduce the Pulled Disk Back into the System
    • Extend Thinly Provisioned Virtual Disk and Observe Behavior at Limits
    • Bonus: Detecting and Replacing Physical Disk Failures
    • Removing a Disk from the Storage Pool
    • Reclaim Unused Space on Thin Provisioned Disks
    • Bonus: Defragmentation Attempt and Observations
  • Part 6 (You’re here)
    • Understanding Hot-Spare Behavior
    • Evaluating and Enabling Data Deduplication

 

Understanding Hot-Spare Behavior

Let’s take a look at hot-spare feature in Storage Spaces. Most of the storage administrators have experience with array controllers and SAN/NAS devices, and accustomed to a certain behavior with hot-spares. My curiosity is around how similar the hot-spare functionality in Windows Server 2012 relative to traditional hot-spare. For example, will it automatically become active? What will happen to the disk that it replaced? Keep reading to see as I experiment with this.

Here’s where we were last time in Part 5:


I’m going to add 5th disk as hot-spare. Here’s how it’s done:



…adding the disk takes just a second. Then you see the hot-spare disk like below:


Before we go any further, keep in mind this very important terminology distinction. The disks are added to the Storage Pool object, not to “virtual disks”. You carve out virtual disks out of storage pools. In my example, there is only one virtual disk. However, to continue my tests, I will create another virtual disk – just to ensure you fully understand disks are part of the pool, not the virtual disk. In other words, spare could serve any of the virtual disks within. They are not assigned to individual virtual disks.

For that reason, here’s our TestVirtualDisk2, also thin provisioned and parity protected. I also created an NTFS volume within, loaded some data on it as well. Here’s how things look like now. Look at the Allocated column on each disk as well. This is a great example of how we can over-subscribe the underlying storage.


Something else worth noting. In the pool, I have 4 disks with usage attribute set to “Automatic”. I copied just one 8GB file to E: volume (hosted on TestVirtualDisk2 that I just created). It ended up getting distributed across all disks. Through experimentation in earlier parts of this blog post, we had seen that pool was distributing the data among multiple disks, and it’s surely expected behavior – however, we didn’t know if it was spreading the data over as many disks as it has at the time, .vs. just 3. It is now confirmed that all disks are being utilized, which is very good and consistent with other storage solutions out there.


Ok – time has come to test how this hot-spare works. I will go ahead and remove one of the disks. I will do this by physically pulling it out. As a reminder, neither my disks nor my eSATA enclosure support hot-swap; therefore this is truly fatal operation that I normally wouldn’t do. That said, purpose is to simulate the loss of a disk – therefore it’s real-enough.

One of the difficulties here is matching physical disks to the disks you see in the management console. In a large disk set, this could be seriously challenging task that you really cannot afford to make a mistake. For this reason, Storage Spaces management console gives you “Toggle Disk Light” option when you right-click on any physical disk. However, your controller and disks need to support that feature. Sadly, my test gear doesn’t support it:


At any rate,… I’m proceeding to just pull the disk.

Action: Physically remove PhysicalDisk5 from the eSATA enclosure.

After I removed the disk, and after I refresh the management console UI, here’s what I see.


So normally hot-spare means “hot” spare. So it should have kicked-in. But so far as I can see, it has not. Data is still available, however each virtual disk is in degraded state (because both of them had data on Disk5). None of the physical disks are doing anything right now. They are happily idling away.

I would have expected hot-spare to kick-in and system to bring everything to healthy state on its own. But this is not happening.

After some research, here’s what I found. Let’s take a look at the properties of my storage pool. Pay special attention to current “degraded” status and the default setting of “Auto” on the “RetireMissingPhysicalDisks” property.


Here’s the behavior of RetireMissingPhysicalDisks under various settings:

  • Enabled:        If just a disk is missing, meaning its enclosure is still present, then treat the missing disk as failed.
  • Disabled:    When a disk is missing, wait for either the disk to reconnect or for admin action.
  • Auto:        If the pool has a hot spare, then follow the Enabled logic.  Otherwise, follow the Disabled logic.

 

Since we have “hot-spare” in the pool, it should behave as if it’s set to “Enabled”. But it’s not doing that.

More research and more findings:

  • Storage Spaces will wait for 5 minutes before taking any action on the hot-spare disk. In my case, we’re past 5 minutes already and nothing is happening.
  • Storage Spaces will not do anything unless there is a write failure. Who knows maybe this is why…

Let me attempt to save a file onto one of those disks and see what happens:

Action: Copy some 4GB worth of files to TestVolume2 (E:)

Nope. There is nothing written to the hot-spare disk. Usage designation of the disk hasn’t changed immediately either. But let me see if 5 minute count is starting after I made a write attempt… So I’ll wait a little.

<5 mins later>

Sure enough, physical disk LEDs have gone crazy. Something is finally happening, let’s check the Usage designations of the disks. Remember, PhysicalDisk4 was set as hot-spare (see earlier screenshots above)


Finally, Windows decided to change hot-spare disk to Automatic, and apparently initiated a repair. But there is something very interesting going on here. The repair activity was only for the disk where I have made the write request on (TestVirtualDisk2).

After waiting just a short while, disk activity has ended – everything is seemingly idle and look at where we are:


Apparently it repaired only the disk where I had a write operation.

If what I’m observing is true, a read-only NTFS volume (think file share for distribution of deployment packages) might not be able to benefit from hot-spare and could be vulnerable to loss of another disk. The possibility that this is not happening automatically is concerning. So I’ll attempt to write something to that drive D: and see if auto-repair kicks in then.

Action: Copy a tiny file to drive D (which sits on TestVirtualDisk1)

Time: 4:48pm

<nothing is happening for now. Will wait>

Time: 4:52pm

<nothing is happening for now. Will wait some more>

Time: 4:55pm

Nothing happened.

So even if I write something to the drive, auto-repair is not kicking in. This is strange. Maybe the size of the file matters for some reason. Let me copy a large file.

Action: Copy 8GB file to drive D:

After file copy operation finished, disk activity is continuing – I’m assuming some sort of repair is in progress. Strange that a small file write didn’t trigger this; whereas a big one seemingly did. Here how things look like:


More importantly, the allocation of TestVirtualDisk1


Couple things to note here. Physical disk activity is very heavily going on right now. Health status, as can be seen from above, is still “Warning”. Also, missing disk is not showing as allocated, whereas Disk4 is now showing as being used. Strangely, allocation level of Disk4 is very high. It has only been about 15-20minutes, and free space is down to 19.8GB. It would appear to me that the repair process is promptly reserving final footprint as “used” on the physical disk and then filling it in. Interesting.

I have not done anything with the “missing disk”. It’s still showing missing. I will play with it (such as insert it back after a while, as well as attempt to re-purpose – stay tuned for those).

For now, I’m waiting for repair process to complete. Due to data set on this disk, I expect it to take several hours.

<Several hours later>


Repair has been completed and the virtual disk is reporting healthy. At this point pool can lose another disk and still maintain data integrity.

So what did we learn thus far?

  • Hot-spare drive, unlike on traditional SAN/NAS systems, does NOT get activated as soon as a failure is logged. Rather, it happens 5 minutes after a write operation is made to any of the volumes inside a virtual disk. If a virtual disk does not have any volume that receives write activity, it does not get repaired and remains in degraded state. Bottom-line: If you’ll use hot-spare, trust it only if you have continuous read/write activity to at least one volume in each of your virtual disks. Otherwise, you will need to monitor the logs constantly and raise alerts for an administrator to perform the change. For an admin-initiated operation, following has to occur:
    • Change Usage property of the hot-spare disk to “Automatic”
    • Initiate “Repair” on each of the virtual disks that are in degraded state.

Next, we want to see what happens when the missing disk is made available back to the system again.

Action: Insert the previously removed disk back in (PhysicalDisk5) and reboot the computer (I do reboot on re-inserts not because it’s a must but because my eSATA enclosure doesn’t officially support hot-swap. It’s OK to pull a disk to properly simulate a failure but opposite is not a good practice)

Disk showed up and maintained “Retired” status. Compare below screen with the one above. Because it’s set to Retired, we know from earlier parts of this blog, that it will never be used for write operations.


At this point I want to mark that disk (PhysicalDisk5) as hot-spare. I could not find a way change “Usage” property from the user interface. So this is what I did instead:


Quick refresh on Server Manager UI shows that the change has been reflected properly. (Compare with above where Usage is showing Retired)


That’s all I have for hot-spare feature. My conclusion of hot-spare feature is this:

  • New, useful feature that works slightly differently from traditional storage systems.
  • Allows flexibility in that you could potentially leverage a hot-spare that is different in size than the other disks in the pool.
  • Hot-spare services multiple virtual disks. Read this one more time. Not every storage solution allows you to use a single spare disk to potentially service any of the arrays. Often times you are required to assign dedicated spares to each. Windows Server 2012 allows you to build parity, mirror or no protection virtual disks across varying size of physical disks, allows you to assign any size disk or disks as hot-spare. Nice flexibility that I have not seen at this cost level personally.
  • It does require administrators to learn it properly. The behavior that cause hot-spare(s) to not auto-activate due to lack of write requests, could catch someone by surprise. Experiment with it and learn how it works.

Next up, we will look at Disk Deduplication. Read it on Part 7

Posted in Computers and Internet | Tagged , , , , , , , , , , , , , , , | 4 Comments

Windows Server 2012 – Storage Spaces and Data Deduplication Hands-on Review Part 5

This is the continuation and Part 5 of my Storage Spaces and Data Deduplication review. Here’s an index of test cases on this part and links to other parts:

  • Part 1: Introduction and Lab Environment Preparation
    • Physical Disk Pull
    • Introduce the Pulled Disk Back into the System
    • Extend Thinly Provisioned Virtual Disk and Observe Behavior at Limits
    • Bonus: Detecting and Replacing Physical Disk Failures
    • Removing a Disk from the Storage Pool
  • Part 5 (You’re here)
    • Reclaim Unused Space on Thin Provisioned Disks
    • Bonus: Defragmentation Attempt and Observations
    • Understanding Hot-Spare Behavior
    • Evaluating and Enabling Data Deduplication

 

Reclaim Unused Space on Thin Provisioned Disks

I expect footprint reduction being a very important scenario for migrations between systems as large data sets move and gets re-organized. Organizations should not have to provision twice the amount of physical disks just to be able to re-organize their storage spaces. Thankfully there is a way to do this, and I could not figure it out previously. Thanks to Nandu from Storage Spaces team I learnt a few things and I’ll show it to you now:

Here’s my test case:

  1. Have a virtual disk, thinly provisioned. Use 4 physical disks.
  2. Fill it up with actual data with 100% NTFS volume allocation within.
  3. Delete data until one of the disks can be removed from the virtual disk (this was the part that I previously thought was not reducing the allocation of thinly provisioned volume on its own)
  4. Remove the disk.
  5. Show that everything is healthy and functioning as expected.

 

We’re starting with those 4 disks that I mentioned, here how things look like from virtual disk perspective:

I’d like you to keep an eye on “Allocated” column of the virtual disk “TestVirtualDisk1″, which is showing 265GB at the moment.

Because I just added the 4th disk to show you this process, I need to copy some data to it. I will create 3 additional single files, each of which are 45GB or so in size. Yes these are large single files, and are in addition to 265GB that already exists on the virtual disk. I give you all these details because file sizes, total allocations all do matter very much in this test case. For example, deleting a small file may not decrease the allocation, but a larger one might. Keep reading to observe these variances.

Copy operation complete, here’s where we are:

…but more importantly the individual disk level allocations:

Remember our purpose: We want to remove Disk3 (92.5GB) from this virtual disk and we want it to still remain healthy. Test scenario for removal is that underlying NTFS file system utilization went down, and we’re wanting to re-purpose some physical disks in other places.

First test I’m going to do is to hard-delete one of those 3 files I have added and review how allocations and disk spaces are changing, if at all:

After this operation, let’s see if anything changed on the Server Manager. As you can see, allocation has dropped to 351GB from 383GB.

If you noticed, I have deleted a file 45GB in size. Allocation drop however, was only 383-351=32GB in size. It’s hard to reverse engineer the math here. But the allocation appears to be happening in chunks.

Before I forget, let me include disk level distribution (although it’s very possible this would vary based on which file I have deleted – so don’t read too much into utilization of individual disks)

So let’s try something else. I’m going to delete 8GB file, see what happens then. Theory I’m testing here is if the allocation trimming is happening in chunks, I wonder when I will hit that threshold.

Let’s see what Server Manager shows now.

As suspected – no change!

At this point let me introduce two bullet points provided to me by Nandu:

There are two ways in which space reclamation occurs:

1.       The file system sends down TRIMs as soon it has released the allocation. If the TRIM is for a large enough region (the slab granularity, in the spaces case, it is 256MB), the space will be immediately released and will be reflected in the allocatedSize property of the physicaldisks, and the FootPrintOnPool,AllocatedSize properties of the virtual disk.  You can observe this by deleting a single large file.

2.       The file system sends down trims, but no single trim covers an entire slab, the driver is unable to release allocation and nothing changes.

 

Given that 8GB is quite a large file, I must be hitting the condition 2.

For that condition, we need to use “Optimize-Volume” PowerShell cmdlet with some special parameters. Like this:

In my case however, it didn’t help. Allocation still showing 351GB. Let me delete more files and try this again:

Now the Server Manager is showing…

337GB allocated. I have deleted exactly 38GB worth of files. 351GB (previous allocation) – 337GB = 14GB reclaimed. Why the difference?

At this point, we have about 13GB from prior delete operation, and 38-14=24GB from last delete operation, waiting to be reclaimed. Let’s see if optimize-volume works differently this time:

Looks like the number of slabs are too few for optimize-volume to care enough and do something about them.

Defragmentation Attempt and Observations

At this point I will attempt a good old defrag and observe – I have a strange feeling about this. I ran:

As you can see we’re into 9% of the defrag operation (it’s been about 30 mins or so since it started).

Let’s see how my virtual disk allocation is doing (prepare for a surprise)

What?! It actually increased. I have not added any new files. Just letting defrag do its thing. If went from 337GB to 345GB. Now the system owes us:

  • 13GB when I deleted a 45GB file and it reclaimed only 32GB of it.
  • 24GB when I deleted 38GB worth of files and it reclaimed only 14GB of it.
  • 8GB of “defrag” overhead only when 9% into defragmentation.
  • Total: 13+24+8 = 45GB system owes us (i.e. we should be able to reclaim)

     

Let’s highlight this observation: Defragmenting a volume created inside a thinly provisioned virtual disk could increase the allocation footprint of the virtual disk.

What I just said is really bizarre and I will research it to figure out what’s going on, or if the defrag-time bloat is temporary.

<waiting till defrag finishes. I’ll note allocation size and defrag % as I check the status occasionally>

Defrag percent

Allocated Size on Virtual Disk

~10%

346GB

~11%

347GB

~12%

349GB

~13%

352GB

~14%

367GB

>>I have interrupted defrag as it’s getting very high while we’re still below 15% completion. At this rate, footprint could reach to 100% of the provisioned capacity. I’ll learn more about this and report back.

Let me re-run the previous optimize-volume command and see what happens:

Not good. Meanwhile physical disks are all still active as if they are continuing to do defrag. I’ll wait a while and let them stabilize a bit.

<I decided to let the system do its thing for a few hours>

Disks have eventually idled and when I re-tried the above command, it behaved better, as in:

Slab Consolidation part is taking a short while, reached about 35% in 5 minutes. Let’s recap what we have so far while waiting:

  • Traditional defrag on a thin-provisioned volume significantly increasing the volume’s actual physical footprint. Pending further research but for now I am inclined to say it’s not a good idea to defrag a volume that sits on a thin provisioned virtual disk.
  • Optimize-volume command could error out if you run it immediately after interrupting a traditional defrag process.
  • Disks continue to be active even after you interrupt the defrag process, still causing optimize-volume to fail. I have not been able to measure exactly how long after interruption of defrag the disks reach idle state.
  • Slab consolidation process has some minimum slab count in its mind, and will refuse to reclaim space if the changes are too small.
  • Only after I reached a significant level of file deletions, optimize-volume decided to actually do slab consolidation.

 

Ok. While I type these up, slab consolidation finished, here how finishing lines look like:

Alright… As you can see, system owed us 45GB of space from deleted files. It also was showing a slab with potential to be purged but declining to process it based on the assessment of “too few slabs”. Result is within my expectations and within rounding errors that I believe I have been able to reclaim all the space I was expecting to reclaim.

Here’s how the Server Manager looks:

Next up is “Hot Spare”. Continue on Part 6.

Posted in Computers and Internet | Tagged , , , , , , , , , , , , | Leave a comment

Windows Server 2012 – Storage Spaces and Data Deduplication Hands-on Review Part 4

This is the continuation and Part 4 of my Storage Spaces and Data Deduplication review. Here’s an index of test cases on this part and links to other parts:

  • Part 1: Introduction and Lab Environment Preparation
    • Physical Disk Pull
    • Introduce the Pulled Disk Back into the System
    • Extend Thinly Provisioned Virtual Disk and Observe Behavior at Limits
    • Bonus: Detecting and Replacing Physical Disk Failures
  • Part 4 (You’re here)
    • Removing a Disk from the Storage Pool
    • Reclaim Unused Space on Thin Provisioned Disks
    • Bonus: Defragmentation Attempt and Observations
    • Understanding Hot-Spare Behavior
    • Evaluating and Enabling Data Deduplication

 

Removing a Disk from the Storage Pool

At the moment, we have 4 disks in our pool. I’d like to reduce this to 3 disks. Scenario is that my virtual disk no longer needs that much space, and so I’ll delete some files and then see if I can remove a disk from the pool. I think you’ll find few things here quite interesting.

Current state: Remember that disk just went offline and will not accept any further writes to it (see screenshots earlier in Part 3). I can however, bring it online and then delete files. As I do this, I will check the health page of the virtual disk, see if some of the space on the disks are reclaimed. As this happens, I will make an attempt to remove the disk from the pool.

Action: Delete 20GB worth of files.

Before deletion:


After hard deletion of 20GB of files (didn’t make a silly recycle.bin type mistake here – just saying):


It would appear that deletions that happen at the volume level are not immediately (or perhaps ever – Update: see Part 5 on my learnings as it turns out reclaiming of space is possible) reflected on the thinly provisioned virtual disk. Let me attempt to remove Disk3 from the pool and see what happens.


I clicked on Yes, and this is what I got:


I know there is sufficient space to hold things together – but how do I vacate the disk? I did some research. The “Usage” property of physical disk needs to be changed, and then “Repair” operation needs to be run on the Virtual Disk. Note that setting the usage policy to “Retired” doesn’t actually vacate the disk. I have not found a way to do this from the graphical interface, so here how it’s done in PowerShell. I also have NOT been able to find a way to reduce the allocation footprint of a thinly provisioned virtual disk. That is, once you reached to a point, it doesn’t appear to ever go back. I will continue to research this but for now that’s my assumption.

Update (October 4, 2012): I have learnt (Thanks Nandu!) the ways to reduce allocation. See Part 5 for a special case to demonstrate reduction of storage allocation on a virtual disk.

Given that I am unable to reclaim the needed space right now, I will not be able to take out 1 disk just yet. I will need to bring in more total capacity and take 2 disks out. Specifically, the Disk5 from primordial is coming in, Disks 3 and 4 are going back to primordial. I am hoping to be able to achieve this by first setting usage policy on 3 and 4 as retired, then adding the disk5 to the pool. When I then run repair on the volume, data on 3 and 4 will transfer over to 5, I will then expect to be able to remove those disks from my pool. Let’s see if theory works.



Now I’m adding Disk5 to the pool. After these changes, and a refresh operation, here how my VirtualDisk1 is showing. Because I set two of the disks as retired, no new data can be written to them. However, I can read data off of the volume without any issue. Also note that despite adding Disk5 to the pool, Disk5 does not yet have any allocation showing under the virtual disk.


Next, we need to run the “Repair-VirtualDisk” command.


During repair process, you can see that Disk5 started to be utilized and retired disks are not showing. At this point I’m not sure if retired disks can simply be removed from the system or not. I will attempt to do so:


While repair job is still in progress, I am proceeding to remove disks3 and 4 from the pool – but look what happens. Because repair is still in progress, it’s preventing me from doing that. This is actually strange – because Virtual Disk Health properties clearly show Disks3 and 4 have no allocation on it whatsoever.


<I don’t know how long repair took – I resumed after a day or so and it was done>

Take a quick look at the state of our virtual disk. If you notice, status is reporting healthy and that disks3 and 4 are not in use. This should be sufficient for me to do the removal. Let’s check…


Action: Remove Disk3


This time it proceeded to remove, and displayed this message:


I don’t think it’s actually doing any lengthy repair tasks as the disk had no content for this virtual disk. Quick “refresh” confirms this. Disk is not in repair mode and removal appears to be successful. I did the same for Disk4 and here where we are now:


Primordial disks look like this:


So what did we learn here?

  • Thin provisioned Virtual Disk only expands its footprint, never shrinks. I say this based on pure observation. If there is some PowerShell magic that does offline maintenance or something, I don’t know. If I do learn something new, I’ll come back and update. For now, it only appears to expand. This means that temporary file additions to thin provisioned disks should be carefully planned. See Part 5 for a test case that shows how footprint reduction is done.
  • Removal of disk from a pool is possible, but requires retire-repair-remove cycle as I have demonstrated above. You must provide at least same amount of capacity and at least 3 columns per virtual disk in order to be able to remove an existing disk from a pool. This method will preserve parity protection in case you lose a physical disk while performing this maintenance. I would avoid pull-addNew-repair method as you’re risking complete data loss until repair is complete.
  • In Windows Server 2012 RTM, it does appear that the dismount-due-to-pool-full behavior is recoverable in that the Virtual Disk stays mounted (unlike Release Preview version) if no writes are attempted to it. This statement needs time-testing with different data sets – so if you do repro a situation where you cannot mount a virtual disk that is full, drop me a comment with your details and I’ll attempt to replicate your scenario. It’s very critical that without additional physical disk we should be able to at least read the data out of our virtual disks. For the RTM version of Windows Server 2012, my experience shows that we indeed can mount & read everything from it.

 

Continue reading additional cases in Part 5.

Posted in Computers and Internet | Tagged , , , , , , , , , , | Leave a comment

Windows Server 2012 – Storage Spaces and Data Deduplication Hands-on Review Part 3

This is the continuation and Part 3 of my Storage Spaces and Data Deduplication review. Here’s an index of test cases on this part and links to other parts:

  • Part 1: Introduction and Lab Environment Preparation
    • Physical Disk Pull
    • Introduce the Pulled Disk Back into the System
  • Part 3 (You’re here)
    • Extend Thinly Provisioned Virtual Disk and Observe Behavior at Limits
    • Bonus: Detecting and Replacing Physical Disk Failures
    • Removing a Disk from the Storage Pool
    • Reclaim Unused Space on Thin Provisioned Disks
    • Bonus: Defragmentation Attempt and Observations
    • Understanding Hot-Spare Behavior
    • Evaluating and Enabling Data Deduplication

 

It’s strongly recommended that you do review Part 1 and Part 2 for the situation leading up to this point – otherwise screenshots and scenarios may not make much sense.

Extend Thinly Provisioned Virtual Disk and Observe Behavior at Limits

Next, I’m going to test the behavior of a thin provisioned disk at its limits. Here’s the scenario:

  • Test how extending virtual disk works. I expect this to be just work with no interesting findings.
  • I will then test extending the virtual disk beyond the capacity of the storage pool. I also expect this to work fine, at least at the time of extension process.
  • I will then load up real data inside the disk, beyond the capacity of the storage pool. This is the interesting part.

 

First, here’s how I’m extending the disk. Keep in mind my test Storage Pool can handle 579GB total. For this first extension, I’ll use something small just to see the operation.



Well… It took literally 1 second to reach this state. I couldn’t even catch the progress status.


Next extension is going to go beyond 579GB barrier of the Storage Pool. Let’s see how that works.

That too got confirmed almost instantly:


Note that the NTFS volume inside this disk does NOT get automatically extended. It needs to be done separately. Here’s how:


Note that we’re not creating additional stripes to extend – just leave the selected disk (circled) as-is.


Result:


At this point, just by pure coincidence, one of my disks appear to be having some serious problem. So let’s use this opportunity to replace it with another.

Detecting and Replacing Physical Disk Failures

Event log is full of these:



So “Disk 4″ corresponds to this one that I’m marking, and is the one that’s failing.


I happen to have same-size disk also available in primordial group at the moment. I will add that one, then remove the bad one. Here’s the current situation with primordial group. The one with red arrow is the replacement/new disk.


Here how I added “Disk 5″ to my Storage Pool:



I then clicked OK. This is the dialog that came up. After waiting for about a minute, disk was added. I hope these “durations” are helping you because in most other systems such activities can take hours or cause hangs/freezes especially if the corresponding disks are failing. Thus far, I am pleased with configuration change commit durations in Windows Server 2012. Keep reading.


Here’s the situation with my Storage Pool now:


Next I will remove the “Disk 4″. Remember, Disk 4 is giving all sorts of errors and delays, sounding really bad with constant hopeless seeks, constantly active drive LED etc. I was simply lucky to hit this problem in the middle of writing this blog post. Anyway… After about only 2 minutes of waiting in below dialog, removal operation completed. I also want to take this opportunity to remind you that my lab gear is very old. I do believe disks are S.M.A.R.T capable but Storage Spaces had no proactive warning about them failing. This could be because of various reasons – and if I may speculate: a) Errors were not severe enough for Storage Spaces to notice (yet they were clearly causing significant commit delays per event log) b) Disks/controllers I have are too old and failure warnings were not making it all the way to Storage Spaces, c) S.M.A.R.T is simply useless and proactive warnings about disk failures are just not something admins should expect to rely on. At any rate, failing drive was still showing as healthy in Storage Spaces. Keep reading…


Couple interesting things on this next screen. Remember this came up after 2 minutes or so. What I like about this is that Windows Server proceeded to remove the disk and didn’t bother trying to read all the data out of it before committing removal action. It promptly took the drive out, then began repairing the now degraded virtual disk within. Because I had added the new disk, there is no capacity challenge. I do expect to lose data should another disk go down at this point. I say this because I took the 3rd disk in a parity protected set out and system didn’t have a chance to rebuild the parity on the disk I just added.


At this point all 3 disks are very active, bringing the virtual disk back to its parity protected healthy state. I don’t know how long this will take; however, I don’t want to make any other changes/tests until it’s back to healthy state again. Keeping in mind that my system is not designed for performance evaluation, following is the current activity from resource monitor. I’m providing this so you have some reference to show which process is doing what activity.


..after about an hour, rebuild activity still going; this time, even on this slow system and disks, strangely transfer rates picked up, as in:


After waiting some more, disks went idle and everything showing as healthy again.


Milestone: We were able to replace a failing disk (not simulated) with another one.

In an attempt to continue with test “Extend Thinly Provisioned Virtual Disk and Observe Behavior at Limits“, I initiated a number of copy operations. I completely saturated the write capacity of the virtual disk. When this happens, I made certain observations.

“Refresh” operation in Server Manager takes upwards of 3 to 6 minutes. If the virtual disk is not under load, this happens within 5 seconds. Whichever way the prioritization works, expect “Delays” while underlying disks are busy. The refreshing of statistical information may not be high priority, although 3 to 6 minutes of wait feels a bit on the extreme. Then again, this gear is not designed for performance. Everything I have is old.


Another observation is simple chkdsk operation against the volume while copy operations are in progress. This too takes exceedingly long. Here, a snapshot of chkdsk output in case there is anything of interest to you:


Also as FYI, here’s how copy processes are going right now. I am quite pleased with Windows in the way it can load balance I/O this evenly. Look at the write B/Sec values across all 5 copy operations I initiated against the same target virtual disk. While it’s causing excessive delays in operations like chkdsk, it manages to share the available write throughput among all processes almost perfectly. Disk queue length gives you a clue on saturation of write ability. At this day and age, getting saturated at mere 15MB/sec tells you how old my gear is. Do not draw performance conclusions from my post.


…after a while, 296GB of the volume was showing as allocated. I wanted to continue with copy operations. I started couple more file copy threads and strangely got this (D: is my test volume and C: is the system/boot drive that has nothing to do with Storage Spaces)


Couple more screenshots to show you around. Disk just went offline. Here how it looks from Computer Management console:


Here some event log entries. I marked the point of failure that kicked in as I started another round of copy operations. Note that it did start copying, and about 30 seconds into the copy operation this happened:


Strangely Server Manager is completely unaware of this fatal situation. Below is after a complete refresh of Server Manager UI:


 

Now it took my attention that one of the physical disks servicing the Storage Pool is showing 0.00 B free space (see above screenshot). Let’s keep this in mind. For now I will attempt to bring the disk online. As in:


…and it does go online.


Let’s repeat the same copy operation and see what happens:


As soon as I want to write a new file to this disk, same thing happens. Disk goes offline. We’ve got a reproducible situation. Here’s the event log one more time:


So let’s recap what just happened. If one of the Storage Pool disks reach zero-byte-free state, underlying thin-provisioned virtual disks are promptly taken offline if they need to expand. After some research, I learnt why this is the case:

  • Virtual Disk is taken offline because there are no disks available in the pool to continue to grow that 3rd column onto.

 

Because this is exactly the case I was curious about, and because it did behave in a way to cause outage, I will continue with experimentation. I will add another disk and keep adding files. We’ll see what happens there. So here’s where we are from pool/disk configuration standpoint:


Here’s a look at “Disks”:


…and finally, here’s what Primordial pool looks like:


Now I’ll proceed to add the Disk3 (93.2GB) to the TestSP1 Storage Pool. Here how things look after addition:

If the reason for disk dismount is accurate, I should be able to add files until next drive in the pool reaches zero free space. My reasoning for this theory is that parity protection requires minimum of 3 disks. For any new data file to be added, it needs 3 physically separate places to write the information to.

Because of the newly added disk, copy operation is progressing without disk going offline… for now… and here’s the situation from Virtual Disk health perspective. Disk3 (with red arrow) is the one I just added. Circled one is the next disk that I expect to go to zero free space.


Alright… copy operation was designed to add single 8GB file to the disk. Reason for this test was to stress the condition that 1.5GB free on the next drive that has the least space. 8GB file should have total of (with the assumption that there will be 3 columns for it)

( 8GB / 2 disks for data) x 3 (due to parity) = 12GB total physical footprint in a parity protected disk. This would have been distributed over to 3 non-full disks evenly. That gives us: 4GB per physical disk. The next available disk did not have that much free space (it had 1.5GB), thus copy operation should have failed. But it didn’t fail. Copy operation succeeded and only then the disk went offline. Check this out:


Here’s what I learnt so far:

Thin provisioned disk can go offline without any application friendly notification from file system perspective. For example, there is no “disk full” or otherwise “General error” type conditions that, while prevent writes to the disk, can continue the read operations or allow application to deal with the situation gracefully. Therefore, underlying pool capacity as well as the individual free spaces on each of these disks need to be well understood and well managed (monitored). There is the following property of the Storage Pool that caught my attention, although I don’t know what it does at this point:


Unknown: Parity allocation and disk utilization algorithms. I made an attempt to estimate these through some common sense methods but somehow the math doesn’t work out. I do understand when the disk is taken offline, however, accuracy I would need to comfortably design a system is not there yet. I will keep experimenting on this boundary condition later. Moving on to next test case.

Continue reading additional test cases in Part 4.

Posted in Computers and Internet | Tagged , , , , , , , , , , | 2 Comments

Windows Server 2012 – Storage Spaces and Data Deduplication Hands-on Review Part 2

This is the continuation of my Storage Spaces and Data Deduplication review. Here’s an index of test cases on this part and links to other parts:

  • Part 1: Introduction and Lab Environment Preparation
  • Part 2 (You’re here)
    • Physical Disk Pull
    • Introduce the Pulled Disk Back into the System
    • Extend Thinly Provisioned Virtual Disk and Observe Behavior at Limits
    • Bonus: Detecting and Replacing Physical Disk Failures
    • Removing a Disk from the Storage Pool
    • Reclaim Unused Space on Thin Provisioned Disks
    • Bonus: Defragmentation Attempt and Observations
    • Understanding Hot-Spare Behavior
    • Evaluating and Enabling Data Deduplication

 

Test Cases

Up until this point we have been provisioning the storage so we can do some tests. Starting this point I will poke at different things in the system to see how it breaks, recovers or handles the situation.

Physical Disk Pull

Note that despite my gear doesn’t support hot-swap; I’ll just pull a disk anyway… It could fail just like that in real life.

Action: 100GB disk that is part of the storage pool is physically pulled while copy activity was in progress.

  1. Answer: No apparent impact to operations. File copy continues as normal. This would have been the case with Windows Server 2008 R2 (and prior) software RAID5, so nothing new so far. Keep in mind, if this was just a “Simple” virtual disk (as opposed to parity or mirror protected), I fully expect removal action like this to bring it down along with loss of data. As we’re on parity protection here, no impact to operations observed.


Just so I show you all aspects of how Windows Server handles this “disk pull” event, here’s how Server Manager “feels” – which is not looking right at first glance. Disk 2 (100GB) is already pulled out. There is definitely some lag between the time disk pulled and the time Server Manager decides to show it. At this point I’m realizing that “Refresh” button in Server Manager is quite important. Keep reading after…


As I hit refresh button, it shows what I would expect:


Ok. Now, suppose I would like to assign one of the other drives out of “Primordial” pool. To do this, I right-click on the “Degraded” Virtual Disk, and choose “Add Physical Disk“, as in:


This brings up the following dialog. Remember, these disks are the ones showing in Primordial group (basically any unallocated disk that system can see at this point). I chose Disk4 and click OK.


Addition operation is quite fast, it shows like this almost immediately:


At the same time, disk LEDs of all physical drives in the Storage Pool went active. Presumably the new disk is getting its data generated as we have parity protection on the Virtual Disk.

At this point I’m going to attempt to remove the shadow configuration of the disk that is no longer in the system from the list of disks. Specifically the one with the exclamation mark in above dialog. Interesting observation here is going to be how long this will take. I will right-click on it and choose “Remove Disk“, like this:


Remove Disk operation on a disk that’s physically out of the system is actually quite fast. Within few seconds, I got this confirmation and a brief note as to what I should expect to happen next:


At this point I refreshed the Server Manager and everything seems back to normal/healthy, as in:


Milestone: We were able to successfully replace a failed disk without losing data.

Introduce the pulled disk back into the system

Next up is a small test about the disk that we pulled. I’d like to introduce it back to the system – but can I do that? First, remember from above steps, that I did the following:

  • Pulled it hot while running
  • Disk entry turned yellow
  • I right-clicked on it and chose “Remove Disk”
  • Server Manager refreshed and it’s no longer showing

 

Now… question is, will it come back if I just insert the disk? Such a simple looking question – you may be in for a surprise here. Keep reading.

Action: I insert the disk back in. I did this while system is running.

Drive LED came up green. Server Manager is not showing it. I refreshed the UI / no luck. So here’s the status right now (Remember, there should be total of 6 disks in the system, we’re seeing only 5). Also remember that my controller or disks do NOT support hot-swap. If you notice the “Rescan Storage” option in below dialog, I will click that. Keep reading…


Nothing happened. Still seeing only 5 disks.

Because hot-swap is not supported on this gear, I will proceed to reboot the system. Keep in mind, these disks are on a separately powered external eSATA enclosure. At this stage I have only did a soft reboot of the system. Not even power-down.

No change. Still seeing only 5 disks in Server Manager. However, when I look at the Computer Management, I see below situation. The circled disk is the one that I re-inserted and the one that’s not showing in Server Manager / Storage Spaces. Keep reading…


I did some research and asked around. I learnt the following:

  • Since I did a “remove disk” on the shadow configuration entry while the disk was physically out of the system, Storage Pool knows not to take the disk back into the pool.
  • Since the disk itself was offline during this change, it still thinks it can join the pool.
  • Because the disk in question has the remnants of the pool, it’s not showing in the primordial pool.
  • Given all of this, disk needs to be cleaned up before it can show in primordial.
  • There is still some speculation on whether the Storage Spaces / Disks view should or shouldn’t be able to show. In my case it does NOT show. Only Computer Management shows it.

 

Now let’s figure out how to do that… Right click action on the volume is all grayed out, check:


Given that all else in the GUI is grayed out, I decided to do this from diskpart. I simply issued a “clean” command on it, here:


Now I expect this to show in primordial group… and sure enough it’s there after a “Refresh”. See below.


Milestone: We were able to re-introduce the pulled this back into the system.

Continue reading additional test cases in Part 3.

Posted in Computers and Internet | Tagged , , , , , , , , , , | Leave a comment