Computers and Internet

Windows Server 2012 – Storage Spaces and Data Deduplication Hands-on Review Part 6

This is the continuation and Part 6 of my Storage Spaces and Data Deduplication review. Here’s an index of test cases on this part and links to other parts:

  • Part 1: Introduction and Lab Environment Preparation
    • Physical Disk Pull
    • Introduce the Pulled Disk Back into the System
    • Extend Thinly Provisioned Virtual Disk and Observe Behavior at Limits
    • Bonus: Detecting and Replacing Physical Disk Failures
    • Removing a Disk from the Storage Pool
    • Reclaim Unused Space on Thin Provisioned Disks
    • Bonus: Defragmentation Attempt and Observations
  • Part 6 (You’re here)
    • Understanding Hot-Spare Behavior
    • Evaluating and Enabling Data Deduplication

 

Understanding Hot-Spare Behavior

Let’s take a look at hot-spare feature in Storage Spaces. Most of the storage administrators have experience with array controllers and SAN/NAS devices, and accustomed to a certain behavior with hot-spares. My curiosity is around how similar the hot-spare functionality in Windows Server 2012 relative to traditional hot-spare. For example, will it automatically become active? What will happen to the disk that it replaced? Keep reading to see as I experiment with this.

Here’s where we were last time in Part 5:


I’m going to add 5th disk as hot-spare. Here’s how it’s done:



…adding the disk takes just a second. Then you see the hot-spare disk like below:


Before we go any further, keep in mind this very important terminology distinction. The disks are added to the Storage Pool object, not to “virtual disks”. You carve out virtual disks out of storage pools. In my example, there is only one virtual disk. However, to continue my tests, I will create another virtual disk – just to ensure you fully understand disks are part of the pool, not the virtual disk. In other words, spare could serve any of the virtual disks within. They are not assigned to individual virtual disks.

For that reason, here’s our TestVirtualDisk2, also thin provisioned and parity protected. I also created an NTFS volume within, loaded some data on it as well. Here’s how things look like now. Look at the Allocated column on each disk as well. This is a great example of how we can over-subscribe the underlying storage.


Something else worth noting. In the pool, I have 4 disks with usage attribute set to “Automatic”. I copied just one 8GB file to E: volume (hosted on TestVirtualDisk2 that I just created). It ended up getting distributed across all disks. Through experimentation in earlier parts of this blog post, we had seen that pool was distributing the data among multiple disks, and it’s surely expected behavior – however, we didn’t know if it was spreading the data over as many disks as it has at the time, .vs. just 3. It is now confirmed that all disks are being utilized, which is very good and consistent with other storage solutions out there.


Ok – time has come to test how this hot-spare works. I will go ahead and remove one of the disks. I will do this by physically pulling it out. As a reminder, neither my disks nor my eSATA enclosure support hot-swap; therefore this is truly fatal operation that I normally wouldn’t do. That said, purpose is to simulate the loss of a disk – therefore it’s real-enough.

One of the difficulties here is matching physical disks to the disks you see in the management console. In a large disk set, this could be seriously challenging task that you really cannot afford to make a mistake. For this reason, Storage Spaces management console gives you “Toggle Disk Light” option when you right-click on any physical disk. However, your controller and disks need to support that feature. Sadly, my test gear doesn’t support it:


At any rate,… I’m proceeding to just pull the disk.

Action: Physically remove PhysicalDisk5 from the eSATA enclosure.

After I removed the disk, and after I refresh the management console UI, here’s what I see.


So normally hot-spare means “hot” spare. So it should have kicked-in. But so far as I can see, it has not. Data is still available, however each virtual disk is in degraded state (because both of them had data on Disk5). None of the physical disks are doing anything right now. They are happily idling away.

I would have expected hot-spare to kick-in and system to bring everything to healthy state on its own. But this is not happening.

After some research, here’s what I found. Let’s take a look at the properties of my storage pool. Pay special attention to current “degraded” status and the default setting of “Auto” on the “RetireMissingPhysicalDisks” property.


Here’s the behavior of RetireMissingPhysicalDisks under various settings:

  • Enabled:        If just a disk is missing, meaning its enclosure is still present, then treat the missing disk as failed.
  • Disabled:    When a disk is missing, wait for either the disk to reconnect or for admin action.
  • Auto:        If the pool has a hot spare, then follow the Enabled logic.  Otherwise, follow the Disabled logic.

 

Since we have “hot-spare” in the pool, it should behave as if it’s set to “Enabled”. But it’s not doing that.

More research and more findings:

  • Storage Spaces will wait for 5 minutes before taking any action on the hot-spare disk. In my case, we’re past 5 minutes already and nothing is happening.
  • Storage Spaces will not do anything unless there is a write failure. Who knows maybe this is why…

Let me attempt to save a file onto one of those disks and see what happens:

Action: Copy some 4GB worth of files to TestVolume2 (E:)

Nope. There is nothing written to the hot-spare disk. Usage designation of the disk hasn’t changed immediately either. But let me see if 5 minute count is starting after I made a write attempt… So I’ll wait a little.

<5 mins later>

Sure enough, physical disk LEDs have gone crazy. Something is finally happening, let’s check the Usage designations of the disks. Remember, PhysicalDisk4 was set as hot-spare (see earlier screenshots above)


Finally, Windows decided to change hot-spare disk to Automatic, and apparently initiated a repair. But there is something very interesting going on here. The repair activity was only for the disk where I have made the write request on (TestVirtualDisk2).

After waiting just a short while, disk activity has ended – everything is seemingly idle and look at where we are:


Apparently it repaired only the disk where I had a write operation.

If what I’m observing is true, a read-only NTFS volume (think file share for distribution of deployment packages) might not be able to benefit from hot-spare and could be vulnerable to loss of another disk. The possibility that this is not happening automatically is concerning. So I’ll attempt to write something to that drive D: and see if auto-repair kicks in then.

Action: Copy a tiny file to drive D (which sits on TestVirtualDisk1)

Time: 4:48pm

<nothing is happening for now. Will wait>

Time: 4:52pm

<nothing is happening for now. Will wait some more>

Time: 4:55pm

Nothing happened.

So even if I write something to the drive, auto-repair is not kicking in. This is strange. Maybe the size of the file matters for some reason. Let me copy a large file.

Action: Copy 8GB file to drive D:

After file copy operation finished, disk activity is continuing – I’m assuming some sort of repair is in progress. Strange that a small file write didn’t trigger this; whereas a big one seemingly did. Here how things look like:


More importantly, the allocation of TestVirtualDisk1


Couple things to note here. Physical disk activity is very heavily going on right now. Health status, as can be seen from above, is still “Warning”. Also, missing disk is not showing as allocated, whereas Disk4 is now showing as being used. Strangely, allocation level of Disk4 is very high. It has only been about 15-20minutes, and free space is down to 19.8GB. It would appear to me that the repair process is promptly reserving final footprint as “used” on the physical disk and then filling it in. Interesting.

I have not done anything with the “missing disk”. It’s still showing missing. I will play with it (such as insert it back after a while, as well as attempt to re-purpose – stay tuned for those).

For now, I’m waiting for repair process to complete. Due to data set on this disk, I expect it to take several hours.

<Several hours later>


Repair has been completed and the virtual disk is reporting healthy. At this point pool can lose another disk and still maintain data integrity.

So what did we learn thus far?

  • Hot-spare drive, unlike on traditional SAN/NAS systems, does NOT get activated as soon as a failure is logged. Rather, it happens 5 minutes after a write operation is made to any of the volumes inside a virtual disk. If a virtual disk does not have any volume that receives write activity, it does not get repaired and remains in degraded state. Bottom-line: If you’ll use hot-spare, trust it only if you have continuous read/write activity to at least one volume in each of your virtual disks. Otherwise, you will need to monitor the logs constantly and raise alerts for an administrator to perform the change. For an admin-initiated operation, following has to occur:
    • Change Usage property of the hot-spare disk to “Automatic”
    • Initiate “Repair” on each of the virtual disks that are in degraded state.

Next, we want to see what happens when the missing disk is made available back to the system again.

Action: Insert the previously removed disk back in (PhysicalDisk5) and reboot the computer (I do reboot on re-inserts not because it’s a must but because my eSATA enclosure doesn’t officially support hot-swap. It’s OK to pull a disk to properly simulate a failure but opposite is not a good practice)

Disk showed up and maintained “Retired” status. Compare below screen with the one above. Because it’s set to Retired, we know from earlier parts of this blog, that it will never be used for write operations.


At this point I want to mark that disk (PhysicalDisk5) as hot-spare. I could not find a way change “Usage” property from the user interface. So this is what I did instead:


Quick refresh on Server Manager UI shows that the change has been reflected properly. (Compare with above where Usage is showing Retired)


That’s all I have for hot-spare feature. My conclusion of hot-spare feature is this:

  • New, useful feature that works slightly differently from traditional storage systems.
  • Allows flexibility in that you could potentially leverage a hot-spare that is different in size than the other disks in the pool.
  • Hot-spare services multiple virtual disks. Read this one more time. Not every storage solution allows you to use a single spare disk to potentially service any of the arrays. Often times you are required to assign dedicated spares to each. Windows Server 2012 allows you to build parity, mirror or no protection virtual disks across varying size of physical disks, allows you to assign any size disk or disks as hot-spare. Nice flexibility that I have not seen at this cost level personally.
  • It does require administrators to learn it properly. The behavior that cause hot-spare(s) to not auto-activate due to lack of write requests, could catch someone by surprise. Experiment with it and learn how it works.

Next up, we will look at Disk Deduplication. Read it on Part 7

15 replies »

  1. That’s obviously a spam post – notice the link instead of the name.
    Same as “January 24, 2013 at 12:39 am” post below.
    Might want to check out if there are any countermeasures available.

    PS thx for the article

  2. Baris: Could you please explain what will happen to the data (File, Folder etc) which was residing on a disk that went crashed/Taken out. I understand the Hot spare disk will gets active once one disk gets missing..but would like to know what will be happen to the data? Do we have data loss in this case?

    • Hi there – sorry for late reply. in the case of protected volumes like parity or mirror, single disk will not cause data loss. In some circumstances volumes can survive the loss of multiple disks. Specific scenario we talked about in the blog will not suffer data loss. hope this helps.

  3. Is it possible that the spare never kicked in because the data was alredy still redudant on the existing set even with the failed drive?

    • In the scenario I went through, spare policy and conditions were well understood. So no, even the data was vulnerable to another failure, spare did not engage immediately. That said, proper configuration of hot spare with correct policies does allow it to engage as needed. It just behaves differently than your typical RAID system. I’m told that one reason for that is to account for the temporary loss of a disk in a parity group. If they were to engage the spare the instant one of the disks is gone, they’d be launching an expensive/lengthy process which might just be a simple disk seating or cabling issue. So instead of engaging immediately, various alerts are raised to allow admin to take corrective action. If it was a temporary error (think of enclosure power failure of a mirror pair), fixing it would allow quick incremental resync. Long story short, you *can* make it engage quicker, it’s just not the default behavior. HTH.

    • Hi – no in this case I did research the reason for spare not kicking in and noted in the blog. Reread perhaps? Whole point of spare is to have “additional” disk to standby and replace the one that goes bad automatically, minimizing the risk to redundancy of the volume. Hope this helps.

  4. Hi. Real great article. Currently I’m running some tests on Windows Server 2012 R2 and noted 3 differences:

    First, The Options are no longer {Enabled,Disabled,Auto} but {Always,Disabled,Auto} (I’m using Always).

    Second: About 5 minutes after pulling a disk, ALL Virtual Disks started to rebuild. And I HAVE some readonly disks.

    The third one might not be a difference, but not covered by your test: I pulled a Disk from a 10 Disk Pool (plus 2 Hotspares), while the “biggest” Layout (2 DataCopies, 4 Columns) would require 8 Physical Disks. Storage Spaces started to rebuild the virtual Disks WITHOUT enabling one of the available hotspares. I assume that a Hotspare is only enabled, if the remaining disks could not serve enough columns for the Virtual Disks.

    If this is true, there would be two more questions:
    1.) Will it also enable a Hotspare if it runs out of Columns in a “constructive” way (i.e. one disk runs full, so only 7 columns would be available)
    2.) What happens if it activates a Hotspare that runs out of space during rebuild? Will it just enable another Hotspare?

    I might run some tests on this during the next days. If i manage to find a pattern, ill post another comment.

    • Thanks for the comment – all these differences are because you are on R2. This post was written based on R1. If I find some time I will publish an update covering R2. That said, yes, hot spare behavior has changed in a number of ways. To answer your questions – yes that is true and #1: no because you wouldn’t be able to create the volume in the first place. Column / disk mapping is done at the time of provisioning. Hot spare is only to replace failed disks. #2 N/Ä. İ will keep these in mind when I start writing the R2 version of the article. Hope this helps.

  5. I made a physical server with 1 SSD with Windows Server 2012 R2 Essentials. After setting up and configuring the server, I added 2 1TB disks in Thin Mirror, with 1 Storage Space with 1 Storage Pool of 930GB and added +-500Gb of data to the virtual disk. Then I turned of the server, removed 1 disk, added a new 1TB(empty with no partitions) disk. Then started server again. I added the new disk by right click on the ‘Storage pool’ and ‘Add physical disk…’ I chose for ‘Hot Spare’ this time, because when I chose for ‘automatic’ in a previous test Windows just added the 1Tb to my pool size…This is not what I want. I want the new disk to replace the removed disk.
    Now I wanted Windows to start using this ‘Hot Spare’ as a replacement for the disk I removed. So I right clicked on the virtual disk and clicked on ‘repair virtual disk’. The removed disk got the ‘usage’ = ‘retired’ now. And the new disk got the status ‘automatic’. After a couple of hours I was able to right click the ‘retired’ disk and chose for ‘remove disk’.
    I wanted to test this again after some days to make a nice documentation how I did this without using any PowerShell. This time I get to the point where the removed disk gets the usage ‘retired’, but the ‘Repair virtual disk’ option just goes very quickly to 100% and then it stops. The new disk stays in ‘hot Spare’ under usage.
    Then I tried your tip to copy some files to the virtual disk. First 150Mb, then 500Mb, then 750Mb, then >1GB and finally a 4,5GB file. But even after more then 5 minutes waiting and clicking several times on ‘Repair virtual disk’ the new disk stays in ‘Hot Spare’ under Usage..

    Of course I can set the Usage mode of the new disk to ‘automatic’ with the help of PowerShell, but because I did it once without PowerShell, It must be possible to do without!?

    Any additional tips from somebody?

  6. I ended up making a backup of the complete storage pool to external disk and completely delete the storage pool. The first time I simulated a disk failure the disks were formatted with 1 partition in NTFS. The second time I just inserted empty disk with no formatting or partitioning. In this first test I could just remove either of the 2 disks in mirror and connect it through an USB cradle to a Windows 8 or 10 laptop and be able to read the data. If I do that with disks in the second test my laptop pops up with the exact storage pool of the server and disk management does not want to show my disk(so I can not assign a drive letter to it) even with a separate tool(like minitool) does not want to show my disk anymore. I needed to grap an old Windows 7 laptop to be able to access the disks and be able to format them.
    So I will add 2 new clean formatted NTFS disks with 1 patition for the whole disk and create a new Storage Pool. And start over with the failure tests. I really want a storage system that is independent of hardware failures of the NAS or Server.(just like WHSv1) and I think Storage Pool/Space could be it

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s