Before putting technology in production I like to have a very good understanding of how it handles failure. Just putting something cool into production because it feels right is not my style. I have to prove it in my lab first.
I started with a Dell Precision T3500 in my lab set up with SmartOS and one 1TB Western Digital Caviar Green SATA drive. To get off the ground I configured a ZFS pool on that first drive during boot configuration in SmartOS. I played around a bit with zones and VMs and then decided I would run ZFS through its paces to see how it copes with unplanned adds and removals of disks.
When ZFS resilvers a drive it is copying data from other drives in the pool to repopulate it with the correct replica data set depending on your ZFS configuration. For anyone new to SmartOS zones is the name of the default ZFS pool or zpool. It is not a ZFS-specific thing. I added a another 1TB WD SATA drive and added it to SmartOS's default zones pool:
zpool add zones mirror c1t0d0 c1t1d0
UPDATE: when I applied this to my research servers (Dell PowerEdge R720xd's with PERC H710) I needed to run "zpool attach zones c0t0d0 c0t1d0" because simply adding a mirror did not work. YMMV.
Immediately ZFS began to resilver the new drive by copying data to it from the existing 1TB drives in the zones pool. This process did not seem to take long, but I was not calculating at that point.
This evening I shut down the machine (regrettably I don't have a hotswap bay) and removed one of the 1TB drives and replaced it with a 2TB WD Red NAS Hard Drive. I expected that ZFS would recognize the drive, format it and resilver it. In retrospect I am very glad that is not the case since it would result in nuking a data-bearing drive.
What actually happened is told in detail by zpool status zones as you will see here:
pool: zones
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://illumos.org/msg/ZFS-8000-2Q
scan: resilvered 4.01G in 0h0m with 0 errors on Fri Nov 16 03:43:09 2012
config:
NAME STATE READ WRITE CKSUM
zones DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
c1t0d0 ONLINE 0 0 0
12819642171874297944 UNAVAIL 0 0 0 was /dev/dsk/c1t1d0s0
errors: No known data errors
An oddity here in my impression is that the device name becomes a number. The really nice thing you get from Illumos here is http://illumos.org/msg/ZFS-8000-2Q which brought me directly to a step-by-step process for recovering from my "failed" drive. I happen to know that my new 2TB drive is both good and has no data I need. It actually just came out of it's sealed anti-static bag so I'm really sure.
Back to business. I go back to the Illumos link and it effectively advises me to physically replace the disk (done) and then run zpool online zones c1d1t0. Because I removed the drive without telling ZFS I get the following message:
warning: device 'c1t1d0' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present
To recover from this corner I ran zpool offline zones c1d1t0 which yields no output (indicating that it succeeds) and zpool status zones to get the following:
pool: zones
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: resilvered 4.01G in 0h0m with 0 errors on Fri Nov 16 03:43:09 2012
config:
NAME STATE READ WRITE CKSUM
zones DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
c1t0d0 ONLINE 0 0 0
12819642171874297944 OFFLINE 0 0 0 was /dev/dsk/c1t1d0s0
errors: No known data errors
In order to tell ZFS that I have intentionally replaced the drive and this is not just a rogue drive I run zpool replace zones c1t1d0 c1t1d0. This seems redundant, but makes sense since I am indicating that I replaced the hardware referenced by c1t1d0 with new hardware also referenced by c1t1d0. Upon this replace ZFS resilvers the 2TB drive and we can see in zfs status zones that it does it very quickly:
pool: zones
state: ONLINE
scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 02:54:13 2012
config:
NAME STATE READ WRITE CKSUM
zones ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
errors: No known data errors
Excelsior! We have a fully functional ZFS mirror.
Had I been simulating a planned drive replacement I would have used zpool offline zones c1t1d0 prior to physically replacing the drive. This is the correct approach if you are hotswapping drives. I expect that failing to do this while hotswapping may cause serious problems. If I had a hotswap bay I would definitely try it, because with hundreds of servers some operator (possibly me) at some point will accidentally pull the wrong drive in a running system. It is worth knowing what happens. Regrettably I don't have the gear so I can't try this yet. There's an idea for my next expense claim.
The next step in this process is to replace the other 1TB drive with my other WD 2TB Red NAS drive and see if ZFS will give me a 2TB pool as expected.
A slight detour first. Can I restore the original array configuration with two 1TB drives? Let's say the 2TB drives are needed elsewhere. I replace the 2TB drive with the original 1TB and again forget to offline the drive and zpool status zones yields:
pool: zones
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-4J
scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 02:54:13 2012
config:
NAME STATE READ WRITE CKSUM
zones DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
c1t0d0 ONLINE 0 0 0
15137371295940956084 FAULTED 0 0 0 was /dev/dsk/c1t1d0s0
First step: replace in place. I run zpool replace zones c1t1d0 c1t1d0 to bring the disk back to life. Here is the output of zpool status zones caught during recovery:
pool: zones
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu Nov 22 03:37:57 2012
258M scanned out of 6.26G at 43.0M/s, 0h2m to go
255M resilvered, 4.03% done
config:
NAME STATE READ WRITE CKSUM
zones DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
c1t0d0 ONLINE 0 0 0
replacing-1 UNAVAIL 0 0 0
15137371295940956084 FAULTED 0 0 0 was /dev/dsk/c1t1d0s0/old
c1t1d0 ONLINE 0 0 0 (resilvering)
You can see the resilvering status right in there this time. I also learned that offlining the drive is unnecessary in this case of a cold disk replacement. It is probably still the right thing to do and is what I will do to bring my 2TB drive back in now. Running zpool status zones before moving forward brings us:
pool: zones
state: ONLINE
scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 03:39:44 2012
config:
NAME STATE READ WRITE CKSUM
zones ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
All is well again in ZFS land. To mix it up I am going to replace c1t0d0 (SATA-0) instead of c1t1d0 (SATA-1) this time. I will start by running zpool offline zones c1t0d0. My immediate thought is that this is a solid way to nuke myself. If I replace the wrong drive, what will happen? It is interesting enough to try, so I am going there.
I put the 2TB in c1t1d0 (SATA-1) and I am shocked that it boots, but it does. Trusty zpool status zones gives us a degraded state with c1t1d0 showing with a numeric ID and showing FAULTED, which is much better than I expected:
pool: zones
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-4J
scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 03:39:44 2012
config:
NAME STATE READ WRITE CKSUM
zones DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
c1t0d0 ONLINE 0 0 0
14628442290675348894 FAULTED 0 0 0 was /dev/dsk/c1t1d0s0
So even with deliberate stupidity it seems I am safe. This is very positive. I am going in to put the disks the way they should be for my next step. I will remove the misplaced 2TB drive from SATA-1 and replace it with the original 1TB from SATA-1 that I replaced. I will then remove the 1TB from SATA-0 and put the 2TB in SATA-0. I'm feeling lucky so I'm going to do this live with the case open and not hotswap bay. This is likely a very stupid idea that could electrocute me, fry the motherboard or the drives, but hey, I live on the edge.
So here's the hotswap procedure as I understand it. I will zpool offline zones c1t1d0 to prepare the erroneously placed 2TB drive for removal. After this zpool status zones gives me:
pool: zones
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 03:39:44 2012
config:
NAME STATE READ WRITE CKSUM
zones DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
c1t0d0 ONLINE 0 0 0
14628442290675348894 OFFLINE 0 0 0 was /dev/dsk/c1t1d0s0
Next, I will unplug the SATA-1 data cable (the wider one) and then its power (the narrower one). This can only be done with SATA power connectors. Do not attempt this with Molex or you are guaranteed some serious trouble. If this fries your computer it is on you. Okay zpool status zones:
pool: zones
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 03:39:44 2012
config:
NAME STATE READ WRITE CKSUM
zones DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
c1t0d0 ONLINE 0 0 0
14628442290675348894 OFFLINE 0 0 0 was /dev/dsk/c1t1d0s0
As expected, the drive is still offline. Let's zpool online zones c1t1d0 and see what happens. Disappointment unfortunately is the result. Even though it is the correct drive take a look at zpool status zones:
pool: zones
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://illumos.org/msg/ZFS-8000-2Q
scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 03:39:44 2012
config:
NAME STATE READ WRITE CKSUM
zones DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
c1t0d0 ONLINE 0 0 0
14628442290675348894 UNAVAIL 0 0 0 was /dev/dsk/c1t1d0s0
errors: No known data errors
This indicates that the drive isn't recognized as the drive it was. This is curious. I wonder what happens with an in-place zpool replace zones c1t1d0 c1t1d0. It indicates that the drive doesn't even exist:
cannot open 'c1t1d0': no such device in /dev/dsk
must be a full path or shorthand device name
Why is that? Let's check if SmartOS even see this poor-man's "hot-swapped" disk. When I run format I get the following:
Searching for disks...done
AVAILABLE DISK SELECTIONS:
0. c1t0d0 <ATA-WDC WD10EARX-32N-AB51-931.51GB>
/pci@0,0/pci1028,293@1f,2/disk@0,0
Specify disk (enter its number): ^D
So. This T3500 does not do hotswap apparently. That is a shame. What happens when I reboot? My hypothesis is that the zpool zones is going to come back fine. The world is at it should be as I can see from zpool status zones:
pool: zones
state: ONLINE
scan: resilvered 11.6M in 0h0m with 0 errors on Thu Nov 22 05:13:10 2012
config:
NAME STATE READ WRITE CKSUM
zones ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
errors: No known data errors
So, stupidity followed by insanity and ZFS keeps it afloat. What about growing the array to 2TB like I planned? This time I will shut down first. As zpool status zones reports the drive is faulted as expected:
pool: zones
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-4J
scan: resilvered 11.6M in 0h0m with 0 errors on Thu Nov 22 05:13:10 2012
config:
NAME STATE READ WRITE CKSUM
zones DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
c1t0d0 ONLINE 0 0 0
14628442290675348894 FAULTED 0 0 0 was /dev/dsk/c1t1d0s0
Next, obvious step as before is zpool replace zones c1t1d0 c1t1d0. As before resilvering begins as reported by zpool status zones:
pool: zones
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu Nov 22 05:36:21 2012
320M scanned out of 6.26G at 79.9M/s, 0h1m to go
316M resilvered, 4.98% done
config:
NAME STATE READ WRITE CKSUM
zones DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
c1t0d0 ONLINE 0 0 0
replacing-1 UNAVAIL 0 0 0
14628442290675348894 FAULTED 0 0 0 was /dev/dsk/c1t1d0s0/old
c1t1d0 ONLINE 0 0 0 (resilvering)
errors: No known data errors
I am amazed by how quiet both computers and drives have become. Even with the case open this machine is surprisingly quiet. I check zpool status zones until I see that resilvering is complete (there is probably a smarter way to do this):
pool: zones
state: ONLINE
scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 05:37:57 2012
config:
NAME STATE READ WRITE CKSUM
zones ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
errors: No known data errors
Now another power down and drive replacement should get me a 2TB pool. Let's check the size of the pool first with zpool list zones:
NAME SIZE ALLOC FREE EXPANDSZ CAP DEDUP HEALTH ALTROOT
zones 928G 6.26G 922G - 0% 1.00x ONLINE -
As expected, the pool is a little under 1TB usable even though there is a 2TB drive in there. Once the other 2TB drive is in I expect it will be slightly under 2TB in size. Now I power off the machine again and replace the SATA-0 1TB WD Green Caviar drive with the other 2TB WD Red NAS drive. With SATA-0 being the swapped drive zpool status zones now shows c1t0d0 as a numeric and as FAULTED:
pool: zones
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://illumos.org/msg/ZFS-8000-2Q
scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 05:37:57 2012
config:
NAME STATE READ WRITE CKSUM
zones DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
1897234209816080938 UNAVAIL 0 0 0 was /dev/dsk/c1t0d0s0
c1t1d0 ONLINE 0 0 0
Just as I did before, but with the other drive now, bringing this drive online is as easy as zpool replace zones c1t0d0 c1t0d0. As with the previous replaces in the zpool mirror zpool status zones will show c1t0d0 resilvering under replacing-0 under mirror-0 and will show the original drive ID being replaced:
pool: zones
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu Nov 22 05:55:55 2012
1.40M scanned out of 6.26G at 1.40M/s, 1h16m to go
1.04M resilvered, 0.02% done
config:
NAME STATE READ WRITE CKSUM
zones DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
1897234209816080938 FAULTED 0 0 0 was /dev/dsk/c1t0d0s0/old
c1t0d0 ONLINE 0 0 0 (resilvering)
c1t1d0 ONLINE 0 0 0
errors: No known data errors
The 2TB drive in SATA-0 is resilvered in roughly a minute as it was before according to zpool status zones:
pool: zones
state: ONLINE
scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 05:58:55 2012
config:
NAME STATE READ WRITE CKSUM
zones ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
How big is the zones zpool now? I can check that with zpool list zones:
NAME SIZE ALLOC FREE EXPANDSZ CAP DEDUP HEALTH ALTROOT
zones 928G 6.26G 922G 932G 0% 1.00x ONLINE -
Curious why it is still only roughly 1TB rather than 2TB? At the moment ZFS still thinks we have two 1TB drives. Telling it zpool online -e zones c1t0d0 says that we want ZFS to expand the zones pool to the physical extent of c1t0d0. Now with zpool list zones we see the expected space of 2TB mirrored:
NAME SIZE ALLOC FREE EXPANDSZ CAP DEDUP HEALTH ALTROOT
zones 1.81T 6.26G 1.81T - 0% 1.00x ONLINE -
I was inspired to take this on and to push for SmartOS in our QA environment by reading Constantin Gonzalez Schmitz excellent OpenSolaris ZFS: Mirroring vs. other RAID schemes. Thank you Constantin and also thank you very much to Joyent for releasing the excellent SmartOS cloud OS. SmartOS and ZFS are excellent examples of Concise Software.