Here’s another post on Storage Spaces. For those who haven’t followed some of my earlier posts on Spaces, I experiment with and actually use Storage Spaces at home, at work. I have recently experienced a physical disk related issue and wanted to report on that for you to benefit; seems like there isn’t any documented case for this “Stale Metadata” error out there. So here’s what happened: my Windows Server 2012 R2 based home server’s 4-disk pool dropped to degraded state. Strangely Underlying volume (named FixedMirror) was still showing healthy. Here’s how it looks; first Pool view, then the Virtual Disk view:
Really strange piece in above screenshot is how Health Status is showing Healthy while one of the disks in a 2-way mirror protected virtual disk is showing a warning sign. You should understand here that the virtual disk in question is “fixed size” (not dynamically expanding) and 4.5TB in size. Meaning, all 4 of those disks are required to be available and healthy in order for virtual disk to also remain healthy. Unexpected behavior is that the virtual disk should have also dropped to “degraded” state along with the parent pool object.
Also, notice how Primordial pool popped up – it didn’t exist before. Let’s take a look at what’s in there. First, “before problem” shot:
Now however, Primordial group showed up…
Seems like 1 of the 4 disks decided to go bad and left the pool.
When we then look at the disk properties, we see this:
I have no idea how it got into that state but I care about roughly the 4.5TB of family memories on the 10TB pool and I need to urgently restore this to healthy state.
Of course, before I do any kind of maintenance work on this precious volume, I will be taking a backup while I still can. Keep in mind I already have backups of most critical files in the cloud (Azure), however, recovery from there would take long time and since I already have the data volumes still functional, it’s vise to take a good home-local backup right this point in time.
<2 days later…>
Alright – backup is successful.
At this point I want to point out that the Virtual Disk, despite showing “healthy” earlier, is now showing “degraded” as I would expect. Whether this was due to:
- – It taking few hours to recognize the loss of the disk
- – Me neglecting to refresh the Server Manager console
- – Somehow waiting for a write request to go to the virtual disk
…is unknown, but I’m glad it eventually showed the correct status. I wanted to keep all these notes here for you to see and relate to your own experiences. Comment below if you observe similar delay in status changes of the virtual disks.
Ok where were we… Yes.. proceeding with the repair options. Below, you’ll see that the disk had fallen out of the main pool. This is also not expected, because I have not removed it from the virtual disk – even if I wanted to, system wouldn’t let me before inserting a replacement disk. But let’s investigate why that is later. I have a good backup and this operation is supposed to only impact the physical disk which is lost anyway, it wouldn’t hurt to try.
So I’ll proceed with resetting the disk that’s sitting as unhealthy in Primordial pool. I have no basis for “reset” action to help.
It pops this warning…
…which doesn’t make much sense here because the disk had already fallen out of the pool. So I’m proceeding with Yes. Reset process goes like this:
It took about 90 seconds for it to complete reset process.
It didn’t help. It still is in “Stale metadata” state. Idea with “reset” was that if it had fixed the stale metadata, maybe I could just issue a repair-virtualdisk command and be done with this. Now let’s take a look at the event log:
System event log is full of these “The IO operation at logical block address 0x0 for Disk 2 (PDO name: \Device000003d) was retried.” errors filed under event 153, about 5 errors per second. Seems like the disk is really having physical hardware issues.
I have decided to order a replacement. After a quick search on the web, ordered a 4TB SATA disk to replace.
<2 days later>
Recovery process was pretty simple and in line with what I had posted few years ago here. Steps are:
- – Shutdown the PC (remember this is a home-built desktop class PC with bunch of onboard SATA ports, not hot-swap capable)
- – Remove bad disk
- – Install replacement disk
- – Power on.
- – Launch Server Manager, navigate to File and Storage Services
- – Find the new disk under Primordial. Add it to the virtual disk.
- – Launch elevated PowerShell and run:
- set-PhysicalDisk -friendlyname <NameoftheBADdisk> -usage Retired
- repair-VirtualDisk -Friendlyname “FixedMirror”
After virtual disk is fully repaired and is now showing Healthy, I can remove the left-over disk record that is corresponding to the bad one that I physically took out. I did that from the UI by right-clicking on it and choosing “Remove”.
Now everything is back to normal and healthy. Next question is, what to do with the disk that has gone bad. This part is somewhat interesting so I wanted to share those notes with you as well.
First, I checked if the disk is still under warranty – sure enough, it has merely 20 days left but yes. But then, I need to really clean up the contents before sending it back to the manufacturer. One way to do that is to insert it into another PC and just load it with zeros.
What I discovered however, is that if I insert it into a bare metal PC with nothing on, and attempt to boot with a USB stick of Windows 8.1, because it had been part of a “Space” before, or maybe it had physical issues, it could not be seen as a valid OS installation target. I was somewhat expecting this, but then, during the setup I launched command prompt (shift-F10 or something, not remember now), ran diskpart, even that could not see the disk.
At that point I had a choice of giving up, assuming disk went really bad and just send in for warranty replacement, or try something else..
You see, I have this Synology DS1812+ at home, with a Hot-Spare disk slot in it. I decided to insert the disk there and see what happens. Here’s how Synology saw the disk:
Despite showing SMART status as abnormal, it was able to still reset all the contents. I then created a volume on it and filled it with zeros. I then took it out and shipped for warranty replacement. I haven’t done much investigation for diskpart not seeing the disk on PC side but kudos to Synology for recognizing the disk and allowing me to zero it out easily.
Hope this walkthrough gives you a bit of a head start when attempting to recover from a physical disk failure in Windows Server Storage Spaces.
Categories: Computers and Internet