Computers and Internet

Windows Server 2012 – Storage Spaces and Data Deduplication Hands-on Review Part 3

This is the continuation and Part 3 of my Storage Spaces and Data Deduplication review. Here’s an index of test cases on this part and links to other parts:

  • Part 1: Introduction and Lab Environment Preparation
    • Physical Disk Pull
    • Introduce the Pulled Disk Back into the System
  • Part 3 (You’re here)
    • Extend Thinly Provisioned Virtual Disk and Observe Behavior at Limits
    • Bonus: Detecting and Replacing Physical Disk Failures
    • Removing a Disk from the Storage Pool
    • Reclaim Unused Space on Thin Provisioned Disks
    • Bonus: Defragmentation Attempt and Observations
    • Understanding Hot-Spare Behavior
    • Evaluating and Enabling Data Deduplication

 

It’s strongly recommended that you do review Part 1 and Part 2 for the situation leading up to this point – otherwise screenshots and scenarios may not make much sense.

Extend Thinly Provisioned Virtual Disk and Observe Behavior at Limits

Next, I’m going to test the behavior of a thin provisioned disk at its limits. Here’s the scenario:

  • Test how extending virtual disk works. I expect this to be just work with no interesting findings.
  • I will then test extending the virtual disk beyond the capacity of the storage pool. I also expect this to work fine, at least at the time of extension process.
  • I will then load up real data inside the disk, beyond the capacity of the storage pool. This is the interesting part.

 

First, here’s how I’m extending the disk. Keep in mind my test Storage Pool can handle 579GB total. For this first extension, I’ll use something small just to see the operation.



Well… It took literally 1 second to reach this state. I couldn’t even catch the progress status.


Next extension is going to go beyond 579GB barrier of the Storage Pool. Let’s see how that works.

That too got confirmed almost instantly:


Note that the NTFS volume inside this disk does NOT get automatically extended. It needs to be done separately. Here’s how:


Note that we’re not creating additional stripes to extend – just leave the selected disk (circled) as-is.


Result:


At this point, just by pure coincidence, one of my disks appear to be having some serious problem. So let’s use this opportunity to replace it with another.

Detecting and Replacing Physical Disk Failures

Event log is full of these:



So “Disk 4” corresponds to this one that I’m marking, and is the one that’s failing.


I happen to have same-size disk also available in primordial group at the moment. I will add that one, then remove the bad one. Here’s the current situation with primordial group. The one with red arrow is the replacement/new disk.


Here how I added “Disk 5” to my Storage Pool:



I then clicked OK. This is the dialog that came up. After waiting for about a minute, disk was added. I hope these “durations” are helping you because in most other systems such activities can take hours or cause hangs/freezes especially if the corresponding disks are failing. Thus far, I am pleased with configuration change commit durations in Windows Server 2012. Keep reading.


Here’s the situation with my Storage Pool now:


Next I will remove the “Disk 4”. Remember, Disk 4 is giving all sorts of errors and delays, sounding really bad with constant hopeless seeks, constantly active drive LED etc. I was simply lucky to hit this problem in the middle of writing this blog post. Anyway… After about only 2 minutes of waiting in below dialog, removal operation completed. I also want to take this opportunity to remind you that my lab gear is very old. I do believe disks are S.M.A.R.T capable but Storage Spaces had no proactive warning about them failing. This could be because of various reasons – and if I may speculate: a) Errors were not severe enough for Storage Spaces to notice (yet they were clearly causing significant commit delays per event log) b) Disks/controllers I have are too old and failure warnings were not making it all the way to Storage Spaces, c) S.M.A.R.T is simply useless and proactive warnings about disk failures are just not something admins should expect to rely on. At any rate, failing drive was still showing as healthy in Storage Spaces. Keep reading…


Couple interesting things on this next screen. Remember this came up after 2 minutes or so. What I like about this is that Windows Server proceeded to remove the disk and didn’t bother trying to read all the data out of it before committing removal action. It promptly took the drive out, then began repairing the now degraded virtual disk within. Because I had added the new disk, there is no capacity challenge. I do expect to lose data should another disk go down at this point. I say this because I took the 3rd disk in a parity protected set out and system didn’t have a chance to rebuild the parity on the disk I just added.


At this point all 3 disks are very active, bringing the virtual disk back to its parity protected healthy state. I don’t know how long this will take; however, I don’t want to make any other changes/tests until it’s back to healthy state again. Keeping in mind that my system is not designed for performance evaluation, following is the current activity from resource monitor. I’m providing this so you have some reference to show which process is doing what activity.


..after about an hour, rebuild activity still going; this time, even on this slow system and disks, strangely transfer rates picked up, as in:


After waiting some more, disks went idle and everything showing as healthy again.


Milestone: We were able to replace a failing disk (not simulated) with another one.

In an attempt to continue with test “Extend Thinly Provisioned Virtual Disk and Observe Behavior at Limits“, I initiated a number of copy operations. I completely saturated the write capacity of the virtual disk. When this happens, I made certain observations.

“Refresh” operation in Server Manager takes upwards of 3 to 6 minutes. If the virtual disk is not under load, this happens within 5 seconds. Whichever way the prioritization works, expect “Delays” while underlying disks are busy. The refreshing of statistical information may not be high priority, although 3 to 6 minutes of wait feels a bit on the extreme. Then again, this gear is not designed for performance. Everything I have is old.


Another observation is simple chkdsk operation against the volume while copy operations are in progress. This too takes exceedingly long. Here, a snapshot of chkdsk output in case there is anything of interest to you:


Also as FYI, here’s how copy processes are going right now. I am quite pleased with Windows in the way it can load balance I/O this evenly. Look at the write B/Sec values across all 5 copy operations I initiated against the same target virtual disk. While it’s causing excessive delays in operations like chkdsk, it manages to share the available write throughput among all processes almost perfectly. Disk queue length gives you a clue on saturation of write ability. At this day and age, getting saturated at mere 15MB/sec tells you how old my gear is. Do not draw performance conclusions from my post.


…after a while, 296GB of the volume was showing as allocated. I wanted to continue with copy operations. I started couple more file copy threads and strangely got this (D: is my test volume and C: is the system/boot drive that has nothing to do with Storage Spaces)


Couple more screenshots to show you around. Disk just went offline. Here how it looks from Computer Management console:


Here some event log entries. I marked the point of failure that kicked in as I started another round of copy operations. Note that it did start copying, and about 30 seconds into the copy operation this happened:


Strangely Server Manager is completely unaware of this fatal situation. Below is after a complete refresh of Server Manager UI:


 

Now it took my attention that one of the physical disks servicing the Storage Pool is showing 0.00 B free space (see above screenshot). Let’s keep this in mind. For now I will attempt to bring the disk online. As in:


…and it does go online.


Let’s repeat the same copy operation and see what happens:


As soon as I want to write a new file to this disk, same thing happens. Disk goes offline. We’ve got a reproducible situation. Here’s the event log one more time:


So let’s recap what just happened. If one of the Storage Pool disks reach zero-byte-free state, underlying thin-provisioned virtual disks are promptly taken offline if they need to expand. After some research, I learnt why this is the case:

  • Virtual Disk is taken offline because there are no disks available in the pool to continue to grow that 3rd column onto.

 

Because this is exactly the case I was curious about, and because it did behave in a way to cause outage, I will continue with experimentation. I will add another disk and keep adding files. We’ll see what happens there. So here’s where we are from pool/disk configuration standpoint:


Here’s a look at “Disks”:


…and finally, here’s what Primordial pool looks like:


Now I’ll proceed to add the Disk3 (93.2GB) to the TestSP1 Storage Pool. Here how things look after addition:

If the reason for disk dismount is accurate, I should be able to add files until next drive in the pool reaches zero free space. My reasoning for this theory is that parity protection requires minimum of 3 disks. For any new data file to be added, it needs 3 physically separate places to write the information to.

Because of the newly added disk, copy operation is progressing without disk going offline… for now… and here’s the situation from Virtual Disk health perspective. Disk3 (with red arrow) is the one I just added. Circled one is the next disk that I expect to go to zero free space.


Alright… copy operation was designed to add single 8GB file to the disk. Reason for this test was to stress the condition that 1.5GB free on the next drive that has the least space. 8GB file should have total of (with the assumption that there will be 3 columns for it)

( 8GB / 2 disks for data) x 3 (due to parity) = 12GB total physical footprint in a parity protected disk. This would have been distributed over to 3 non-full disks evenly. That gives us: 4GB per physical disk. The next available disk did not have that much free space (it had 1.5GB), thus copy operation should have failed. But it didn’t fail. Copy operation succeeded and only then the disk went offline. Check this out:


Here’s what I learnt so far:

Thin provisioned disk can go offline without any application friendly notification from file system perspective. For example, there is no “disk full” or otherwise “General error” type conditions that, while prevent writes to the disk, can continue the read operations or allow application to deal with the situation gracefully. Therefore, underlying pool capacity as well as the individual free spaces on each of these disks need to be well understood and well managed (monitored). There is the following property of the Storage Pool that caught my attention, although I don’t know what it does at this point:


Unknown: Parity allocation and disk utilization algorithms. I made an attempt to estimate these through some common sense methods but somehow the math doesn’t work out. I do understand when the disk is taken offline, however, accuracy I would need to comfortably design a system is not there yet. I will keep experimenting on this boundary condition later. Moving on to next test case.

Continue reading additional test cases in Part 4.

3 replies »

  1. Excellent series, Baris!

    I created 5 iSCSI Targets (LUNS) on my iSCSI Storage with same capacity (for example 100 GB each), and connect all LUNS to my WinSrv2k12. They are listed as “Primordial” in “Server Manager\File and Storage Services\Volumes\Storage Pools”. I created a new Storage Pool named StPool1, including three of available five disks. Now I created a Virtual disk, with provisioning type=FIXED (instead thin), with maximum capacity, and layout=parity (it doesn’t not matter which parity type I choose). Then I created a volume, formatted and till now all is OK.

    Now I trying to add another physical disk (one of other two, not used till now) to my StPool1, no problem. But… When I try to extend the virtual disk, all time the process finished with error “The physical resources of this disk have been exhausted”. As I saw, it doesn’t matter of parity type, and how many physical disks I add later to the pool. If provisioning type of my Virtual disk is “Fixed”, this error occurs.

    Can you explain me why this happens?

    • Hi Oleg,

      Almost every unexplainable behavior about Storage Spaces revolves around the “Column” concept. When you do things from the UI instead of PowerShell, you don’t get to choose the most optimal column count. I can speculate that because you used 3 disk configuration initially, the column count of the vDisk ended up being as “3”. Now.. Since the disk needs to be parity protected, and is already FIXED to occupy all of the remaining disks, in order for it to get extended over to the 4th disk, it is probably failing to find a space to hold its parity.

      Before continuing to read, do this:
      – Launch PowerShell and run “get-virtualDisk | fl” review the column count.
      – Use the Storage Spaces UI to get to the “Health” view of the virtual disk. When you expand the details of each disk, you’ll see that they will all be showing 0 free space.

      Now.. on your 4th disk, you’ll have 100% free space, but there is no other disk to keep the parity on. For this reason extend action will fail.

      You could try to first shrink the volume, make up some space; then extend over to the 4-disk set.

      lastly, the column count. Every column has to be stored on a different disk. If your vDisk is formed using 3 columns, you’ll need free space across all 3 of them to do any kind of expansion.

      Rough math on this works out like this:

      4 disks:
      1TB with 100GB free
      1TB with 100GB free
      2TB with 60GB free
      3TB with 2TB free

      what can you create?
      3-column parity: ~200GB (it will need to form 100+100+100GB 3-column footprint for it. 1/3rd will be parity, you’ll get 200GB usable, roughly. If you noticed, I picked 100 because that’s the largest common free space across any 3 of the 4 disk set)
      4-column parity: (it will need to form 60+60+60+60 and give you 3×60=180GB free as 1/4th will go to parity, roughly. Again, 60GB is the largest common free space across all 4 of the 4 disk set)

      Does this help?

      (Happy to stand corrected on my math of concepts here; although I blogged about Spaces, what I included as sample here is not authoritative info from Windows engineering team — I learnt concepts through experimentation and might have errors in them. Absolutely feel free to poke holes and correct my response.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s